Powerset Blog
Parsing Miss South Carolina's Statement
It’s not like it’s easy to parse Wikipedia, but at least most of the its text is (usually) written with correct spelling, capitalized proper names, meaningful paragraph structure and so on. A natural question is: how will our system perform on the rest of the Web with all of its slang, non-standard syntax, and so on? To put Powerset to the test, two of our engineers, Lukas Biewald and Brendan O’Connor, ran our entire parsing and indexing system on the hardest corpus we could find: Miss South Carlolina’s response to the question, "Recent polls have shown that a fifth of Americans can’t locate the US on a map. Why do you think this is?" They fed this transcription into the XLE verbatim, disfluencies and all:
I personally believe that U.S. Americans are unable to do so because uh some uh people out there in our nation don’t *have* maps and uh I believe that our ed- education like such as in South Africa and uh the- the Iraq everywhere like such as and I believe that they should uh our education over here in the U.S. should help the U.S. or- or- should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future
One might think that such a convoluted mess of words (I hesitate to call it "English") would be impossible to parse, but here is the C-structures that our parser generates: ![]()
Unsurprisingly, the sentence is fragmented quite a bit, but the parser clearly managed to extract useful structure throughout the sentence. The last large verb phrase “should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future” seems very close to correct, which is pretty impressive (Language Log has more discussion on the weird “the Iraq” construction). The output of the semantics system is too long to put here, but in some ways, it’s amazing that we were able to extract any semantics at all. And, believe it or not, we can actually run some queries against the Carolina Index (as it’s known at Powerset). It’s hard to think of a reasonable question, but we asked, "Who does education help?" and returned and highlighted the right answer: "Americans". Or should we have returned "U.S. Americans"?