Analyzing the News: UC Irvine Researchers “Text Mine” The New York Times
Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.
Text mining allows a computer to extract useful information from unstructured text. Until recently, text mining required a great deal of preparation before documents could be analyzed in a meaningful way. A new text-mining technique called “topic modeling†– which UCI scientists used in their New York Times experiment – looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics – all with minimal human effort.
UCI researchers didn’t invent topic modeling, but they developed a technique that allows the technology to be used on huge document collections. They also are among the first to demonstrate its ease and effectiveness by applying it to a newspaper archive. The results reveal few surprises, but the application demonstrates the ability of topic modeling to spot trends and make connections in a way that could be applied to more complicated and cumbersome documents such as those used by medical researchers and lawyers.
The news release has more.
For those of you who would like to read the paper where this research was presented, you can find it here (13 pages; PDF).
On the surface (and a non-tech one at that), some of this technology reads like some of the things that Clusty is doing with dynamic categorization and Ask.com is providing with their narrow/expand options.
The UC Irvine researchers are:
David Newman
Padhraic Smyth
Mark Steyvers
Chaitanya Chemudugunta
More in this ZDNet Blog story.
