From a Post by Brewster Kahle:
The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections. The NSF has awarded a grant of $2.7 million for a project to apply advanced OCR, topic modeling and metadata extraction techniques to over one million books at the Internet Archive.
Source: IA
See Also: NSF Grant Document
See Also: Scanning: Internet Archive Text Collection Passes 1.5 Million Titles (August 10, 2009)
Note: The Internet Archive is constantly adding new content so these numbers are a bit out of date.
