Archive for the ‘Information Science’ Category

The Library of Congress Unveils API for Chronicling America Digitized Newspaper Database and Directory

Friday, October 30th, 2009

What follows is a post that might be of special interest to web developers, webmasters, site owners, or anyone who can work with an API (Application Programming Interface), It comes from a digitized collection of more than 1 million historic newspapers and a searchable directory of newspaper info. Even if you are don’t have the technical skills required, it’s possible you know someone who does and with their help you can partner to develop new resources, create mashups, etc. Btw, if you know of people who are able to work with an API, feel free to share this post with them.

First, some background.

We’ve posted about the CA program since the day it launched in March, 2007. The project is a joint effort between the Library of Congress and the National Endowment for the Humanities to digitize historic American newspapers. In addition to the digitized newspaper database CA also provides Chronicling America directory. It’s both searchable with a powerful interface (a great example of what good metadata can do) and browsable. The directory contains information about most American newspapers published from 1690 to today.

On June 16, 2009, we ran a story about CA reaching a milestone. CA had just hit the one million digitized pages mark. It has grown a lot since then. About five weeks ago we posted an item about CA adding more than 192,000 pages to CA. The media release said the size of the database at that time contained 1,442,000 digitized pages from 171 titles, that were published between 1880 and 1922.

Thanks for the info but what about the API (Application Programming Interface) ?

The following from the “About the Chronicling America API” web page:

Chronicling America provides access to information about historic newspapers and select digitized newspaper pages. To encourage a wide range of potential uses, we designed several different views of the data we provide, all of which are publicly visible. Each uses common Web protocols, and access is not restricted in any way. You do not need to apply for a special key to use them. Together they make up an extensive application programming interface (API) which you can use to explore all of our data in many ways.

The rest of the web page offers technical details about the API.

Programmable Web has also posted about the new API.

Here are a couple of highlights:

Search results are available on the web site appear with terms highlighted. The API does not have access to highlight information, but it does contain thumbnails. Each page has a permalink back to the Library of Congress site, which displays the page in a zoomable, draggable viewer similar to Google Map.

The Library of Congress is focused on making these public domain works widely available. As such, this is an API without any registration or key necessary. That’s pretty wide open.

Among the interesting technical details is that the API can return linked data via RDF. It’s good to see reference sites, especially government ones, support semantic web formats (there are now 20 APIs in our directory with RDF support.)

Sources: Library of Congress, Programmable Web
Hat Tip: Dan C.

New Report: Digitisation of special collections: Mapping, assessment, prioritisation

Friday, October 30th, 2009

From the Executive Summary:

Traditionally, digitisation has been led by supply rather than demand. While end users are seen as a priority they are not directly consulted about which collections they would like to have made available digitally or why. This can be seen in a wide range of policy documents throughout the cultural heritage sector, where users are positioned as central but where their preferences are assumed rather than solicited. Post-digitisation consultation with end users is equally rare. How are we to know that digitisation is serving the needs of the Higher Education community and is sustainable in the long-term?

[Snip]

Key Findings

+ The communities of both intermediary and end users are willing to express their view on prioritising digitisation of special collections; the participation in the project was a matter of good will and the good response (see p. 25) makes evident that there is definitely interest of the professional communities to express their opinion on the matter of digitisation needs. It should be noted here that the community of intermediaries sees collections on a finer level of granularity; end users often refer to super-collections such as the holdings of an institution

+ The top user-driven priority criteria that emerged from consultation with both intermediaries and end users are: Improve access; Enhance impact on research and/on studies; Enhance impact on teaching; Allow for collaboration; Improve access outside

+ The geographic and institutional boundaries of collections nominated for digitisation are wider – this study was aimed at the higher education institutions in the UK, but 14% of the nominated collections were from institutions outside of the higher education sector, and 6% were from overseas (see p. 27)

+ The complementarity of collections is strongly favoured by both users’ communities (see section 5)

+ The criteria for digitisation nominated by intermediary and end users include general criteria but also a number of criteria where metrics can be applied; thus allowing to establish a ranking mechanism (see p. 45

Access the Complete Report (62 pages; PDF)

Access the Final Report Appendices (94 pages; PDF)

Source: JISC, Research Information Network

Open Book Alliance Co-Founder Peter Brantley Visits Spain to Talk About the Alliance and Google Book Search

Friday, October 30th, 2009

Brantley is attending meetings in Spain and discussing the OBA and Google Book Search. He’s been interviewed by two newsapers, El Pais and Publico.es.

Here are links to both interviews in Spanish along with mechanically generated translations from two services.

1) “Google no ve libros, se limita a ver datos” (via El Pais)

+ Translation by Google: “Google does not see books, is limited to viewing data” (via El Pais)

+ Translation by Systran: “Google does not see books, is limited to see data” (via El Pais)

2) El bibliotecario que se enfrentó a Google (via Público.es)

+ Translation by Systran: “The Librarian Who Faced Google” (via Público.es)

+ Translation by Google: “The librarian who challenged Google” (via Público.es)

Univerisity of Illinois Press Signs Agreement With JSTOR

Tuesday, October 27th, 2009

From the Announcement:

The University of Illinois Press, the not-for-profit publishing division of the University of Illinois, and JSTOR, the preservation archive and research platform that is part of the not-for-profit ITHAKA, announced an agreement today to make leading journals from the Press available worldwide as part of the Current Scholarship Program.

The Current Scholarship Program is a new collaboration initiated by University of California Press and JSTOR and first announced on August 13, 2009.

[Snip]

Current and historical content from at least ten University of Illinois Press-published journals will be available on a re-designed JSTOR in 2011. This will offer faculty and students around the world access to current issues alongside back issues and a growing set of primary source materials from libraries easily and seamlessly. JSTOR’s nearly 6,000 library participants worldwide will be able to license the Press’s current journals, either individually or as part of current issue collections, together with JSTOR back issue collections in a single transaction. University of Illinois Press-published journals available as part of the Program will include American Journal of Psychology, American Music, Journal of Aesthetic Education, and Journal of American Ethnic History among others. The journals will also be preserved in Portico, the digital preservation service that is also part of ITHAKA.

Source: ITHAKA

New Research Findings: Students and the Mobile Internet

Tuesday, October 27th, 2009

Some new research from the U.K.

From the Summary:

The qualitative research with second year undergraduate students from a range of disciplines and universities, consisted of four focus groups and eight depth interviews, held in Manchester and London. The research was conducted by FDS International on behalf of Intute and the findings reinforce the motivation behind the work of the project, which is to provide a user friendly mobile site that is fast and inexpensive to load, providing the right content, presented in the right order and with an adapted layout.

[Snip]

The extent to which the mobile Internet was used varied greatly, with only a small number of students using their mobile Internet for academic work. Given the cost and generally slow access to the Internet from mobile devices, primarily determined by the type of contract and the handset, most students only ever occasionally accessed the Internet using their mobile phone for social purposes and for short durations of time. Consequently, those most likely regularly to access the internet on their mobile phones possessed new telephones with large screens, and had a contract which included free internet access. These represented only a small fraction of those interviewed.

Despite the fact that students rarely used the mobile Internet for their university course, many stated that they would if:

+ their phones had larger screens;
+ it was quick and easy to load and navigate websites; and
+ it was cheaper or free (included in their contract) to access the Internet.

Access the Complete Summary

See Also: Mobilising the Internet Detective (August 14, 2009)

Source: Intute

Spammers Continue to Abuse the Names of Top Government Executives by Misusing the Name of the United States Attorney General

Tuesday, October 27th, 2009

Spammers Continue to Abuse the Names of Top Government Executives by Misusing the Name of the United States Attorney General

As with previous spam attacks, which have included the names of high-ranking FBI executives and names of various government agencies, a new version misuses the name of the United States Attorney General, Eric Holder.

The current spam alleges that the Department of Homeland Security and the Federal Bureau of Investigation were informed the e-mail recipient is allegedly involved in money laundering and terrorist-related activities. To avoid legal prosecution, the recipient must obtain a certificate from the Economic Financial Crimes Commission (EFCC) Chairman at a cost of $370. The spam provides the name of the EFCC Chairman and an e-mail address from which the recipient can obtain the required certificate.

Source: Federal Bureau of Investigation (FBI)

New Project Report: Newspaper Digitisation: British Newspapers 1620-1900

Tuesday, October 27th, 2009

From the Summary:

This report describes all of the stages and issues that occurred during a second complex mass newspaper digitisation project. The project was an innovative and challenging example of a public/private partnership between Gale Cengage Learning, CCS and the British Library.

Access the Executive Summary

Access the Complete Report (57 pages; PDF)

Source: JISC

See Also: Newspaper Digitisation News from the British Library: £33m Saves the World’s Greatest Newspaper Collection for the Nation

See Also: Video and Slides Available from OCR for the Mass Digitisation of Textual Materials Workshop

The World Media Has Responsibility to Save Audio-Visual Archives + Library of Congress Research Project

Tuesday, October 27th, 2009

October 27 is UNESCO Audio-visual Heritage Day.

From the Article:

Federation president Herbert Hayduck says that the world media community has a common responsibility to save audio-visual archives, many of which are on the verge of being lost.

Source: CCTV

See Also: UNESCO World Day for Audiovisual Heritage: Library of Congress Engaged in Cutting Edge Grooved Recording Imaging Research

In celebration of UNESCO’s World Day for Audiovisual Heritage, the Library of Congress Preservation Directorate is featuring information about an innovative project using imaging technology to recover ‘lost’ sound from grooved analog recordings.

+ Learn More about the IRENE Project

+ Webcast: Capturing Recorded Sound through Imaging: The I.R.E.N.E. Project and Future Prospects

See Also: UNESCO World Day for Audiovisual Heritage Day Web Page

See Also: Message from Director-General of UNESCO

Podcast: Professor Robert Darnton on Harvard’s Success With Open Access

Tuesday, October 27th, 2009

From the Summary:

In October 2008 Harvard University in the US adopted an open access policy for all its research papers to be made available in their university repository, in an opt out basis. 12 months on, since the policy was adopted, JISC’s Rebecca O’Brien speaks with Professor Robert Darnton, Director of Harvard University Library and trustee of New York Public Library and the Oxford University Press (USA), about the cultural change that is taking place at Harvard and the background to why professors at the university decided to share their knowledge in this way.

The podcast runs 23 minutes. You’ll find it near the bottom of this page.

Source: JISC

See Also: DASH (Digital Access to Scholarship at Harvard): Harvard University Scholarly Repository

See Also: Harvard University Library: Open Collection Program

GeoCities Says So Long as Internet Archive Works to Preseve Content

Tuesday, October 27th, 2009

In August, we first posted about the Internet Archive (IA) asking GeoCities users to make sure their content was archived by the IA. Why? As of yesterday, GeoCities is no longer online.

From the Article:

Yahoo, which acquired the site for $3.57bn (£2.17bn) in 1999 at the height of the dotcom boom, said sites would no longer be accessible from 26th October.

However, many of the pages have been archived and will still be available to view via the nonprofit Internet Archive project.

The giant digital library, which has been archiving the public web since 1996, has set up a special project to archive GeoCities before it is lost forever.

“We’ve collected a lot of GeoCities sites over the years – but might not have every site and every page,” the Internet Archive said.

Access the Complete Article

Source: BBC

See Also: Saving a Historical Record of GeoCities (via Internet Archive)

Library of Congress’ National Digital Information Infrastructure and Preservation Program Wins Government Computing News Award

Saturday, October 24th, 2009

The NDIIPP as one of 11 projects to receive GCN [Government Computing News] Award for Agency IT Achievement.

From the Summary:

It took two centuries for the Library of Congress to acquire its 29 million books and 105 million other items. Today, it only takes 15 minutes for the world to produce an equal amount of information in digital form, creating unprecedented archiving challenges for the Library of Congress. The Library is meeting the challenge of digital preservation by developing new tools to transfer large quantities of digital content. To date, more than 3 million files have been transferred and stored using the BagIt specification. Due to the Library’s digital preservation initiatives, more than 1,000 collections of digital content have been selected, captured, preserved, and made available to the U.S. public and online visitors across the globe.

Access the Complete Article

We are warned to be careful about what we put online because data on the Internet lives forever. But keeping random copies of files on servers, routers and databases is not the same as preservation, said Martha Anderson, director of program management for the Library of Congress’ National Digital Information Infrastructure and Preservation Program. Digital data can be ephemeral. “That is the paradox,” she said.

Much More in the Summary and Complete Article

Source: GCN

See Also: Library of Congress News Release

Electronic Frontier Foundation and Other Groups Send Letter to Judge in Google Book Search Case

Friday, October 23rd, 2009

From a Blog Post:

EFF today led a coalition of authors, publishers, companies and nonprofit organizations in sending a letter to the judge overseeing the Google Book Search settlement urging the Court to ensure that those concerned about the settlement receive adequate notice of, and have sufficient time to study and comment on, any amended settlement agreement that Google, the Authors Guild, and the Association of American Publishers present.

Those following the twists and turns of the Google Book Search settlement will recall that the original Fairness Hearing scheduled for October 7, 2009, was put off because of what the Court called: “significant issues, as demonstrated not only by the number of objections, but also by the fact that the objectors include countries, states, non-profit organizations, and prominent authors and law professors.” The Court received over 400 submissions about the settlement, including the EFF-led coalition of authors and publishers concerned about reader privacy, as well as significant concerns raised by the Department of Justice.

Read the Complete Letter Sent to the Judge Denny Chin (4 pages; PDF)

The letter was signed by a large group of people and organizations including:

+ The Open Book Alliance*
+ Amazon.com
+ The Picture
+ Archive Council Of America
+ National Writers Union
+ Electronic Frontier Foundation
+ Pamela Samuelson (UC Berkeley Law Professor)
+ Microsoft
+ Washington Legal Foundation
+ The Internet Archive
+ Consumer Watchdog
+ Lyrasisk, Nylink and Bibliographical Center for Research Rocky Mountain, Inc.
+ Public Knowledge
+ Urban Libraries Council

+ The Special Libraries Association and the The New York Library Association are two of the members of the Open Book Alliance.

Source: Electronic Frontier Foundation

Getting to Know the HathiTrust Digital Library

Friday, October 23rd, 2009

Barbara Quint Writes:

With all the controversy still swirling around Google Books and its post-settlement offerings, an alternative route to the millions of digitized books and journals supplied by leading Google Book Search library partners has arrived. The HathiTrust (www.hathitrust.org) is a collaboration of 25 research libraries already participating in Google Book Search to produce a shared digital repository for preservation and access to a curated collection. By mid-November, the HathiTrust Digital Library will have a full-featured, full-text search service for 4.3-5 million items. The searches will retrieve bibliographic citations and page references, including those for in-copyright books. Content will extend beyond the digitized copies of books returned to early library partners by Google. HathiTrust is pushing to acquire other digitized special collections from its members, as well as making arrangements for opening access to university press books.

[Snip]

The new launch will open indexing to nearly 1.5 billion pages from well more than 4.3 million volumes with full-text searching by keyword or phrase. (Just between us, if you simply cannot wait until mid-November, go to

http://babel.hathitrust.org/cgi/ls.

[John] Wilkin, [associate university librarian at the University of Michigan and executive director of the HathiTrust], tipped me off that, [our emphasis] although this “experimental search” site claims to search only 500,000 documents, it actually includes the full 4.3-5 million volumes. Feedback options appear at the top and bottom of each search results page.) The system already had the equivalent of library cataloging searching, though they expect to upgrade even that kind of searching under a cooperative program with OCLC.

Much More in the Complete Article

Source: InfoToday NewsBreaks

China: Google Responds to Complaints Regarding Copyright Issues

Friday, October 23rd, 2009

It was just a few days ago when we posted that the China Written Works Copyright Society (CWWCS) was not happy with Google over copyright issues stemming from Google Book Search.

Today, in another Wall Street Journal blog post, we learn that Google has responded to CWWCS.

From the Post:

Here is the latest from Google:

“Today we have more than 50 Chinese publishers participating in Google Book Search, who together have authorized more than 30,000 books to be found through Google web search–and made available through a short preview. We also have some Chinese books that have been scanned by our Book Search library partners; in those cases, we only make the books available as a short snippet of text–as we do with web search–unless the rightsholder authorizes a greater use. We also honor rightsholders’ preferences if they ask not to be included.”

“Like all rightsholders, Chinese authors and publishers will be able to tell Google whether or not to display their books, and will be paid if the books are included in sales or subscriptions authorized under the settlement.”

Source: WSJ

See Also: Here’s How The Story Was Reported in the China Daily
Hat Tip: James Grimmelmann, The Laboratorium

Google Book Search: Video from D for Digitize Conference is Now Available Online

Friday, October 23rd, 2009

A few weeks ago the D for Digitize Conference took place. It was sponsored by the New York Law School and organized by Professor James Grimmelmann. The focus of the conference was Google Book Search (GBS). The list of speakers/panelists reads like a Who’s Who of people representing all sides of the many issues being debated at the conference and elsewhere.

Now, you can watch each session online (free). Even two pre-conference tutorials are included. A list of sessions and speakers along with links to the videos can be accessed here.

Finally, if you want to read about what was discussed during a session before viewing the video or just don’t have time to watch, no worries.
Peter Hirtle from the Law Library Blog provide excellent text summaries of each session.

Law Library Blog is a co-production between Peter and Mary Minow.

See Also: Law Library Blog also has a Twitter feed at:
http://twitter.com/librarylaw

Article: Missing Links: The Enduring Web

Thursday, October 22nd, 2009

From the Abstract:

The Web runs at risk. Our generation has witnessed a revolution in human communications on a trajectory similar to that of the origins of the written word and language itself. Early Web pages have an historical importance comparable with prehistoric cave paintings or proto-historic pressed clay ciphers. They are just as fragile. The ease of creation, editing and revising gives content a flexible immediacy: ensuring that sources are up to date and, with appropriate concern for interoperability, content can be folded seamlessly into any number of presentation layers. How can we carve a legacy from such complexity and volatility?

Access the Complete Article (PDF)

Source: International Journal of Digital Curation (4.2)

Washington University: Libraries receive federal grant to digitize pre-war slave lawsuits

Wednesday, October 21st, 2009

Here’s more about a very brief item we posted when IMLS National Leadership Grants at the end of September.

From the Article:

Washington University Libraries received one of the largest grants in the institution’s history, a $376,426 National Leadership Grant from the Institute of Museum and Library Services. The money will fund the St. Louis Freedom Suits Legal Encoding Project, which aims to digitize pre-Civil War lawsuits that slaves brought against slaveholders in the St. Louis Circuit Court.

[Snip]

The newly funded Freedom Suits Legal Encoding Project takes the digitalization process a step further. In addition to finishing the scanning of more than 20,000 pages of city directories and court records, the project also seeks to transcribe the documents to enable full-text searches.

[Snip]

The primary novel aspect of this project is to “develop extensions to the Text Encoding Initiative (TEI) for encoding legal documents to reflect legal function, genres and roles, and employ these extensions in this collection,” according to a grant announcement.

In other words, this project seeks to develop a computer language for annotating the legal functions of documents. This language would be comparable to HTML, which is used to denote structural semantics for Web pages. Ultimately, this innovation will be integrated into TEI, the existing language, to provide a model for similar archives.

Access the Complete Library

Source: Student Library (Washington University, St. Louis, MO)