New Research: SpotSigs: Near-Duplicate Detection in Web Page Collections

A recently published research paper from the Stanford Info Lab:
SpotSigs: Near-Duplicate Detection in Web Page Collections
by Siddharth Jonathan and Andreas Paepcke

From the abstract:

Motivated by our work with political scientists we present an algorithm that detects near-duplicate Web pages. These scientists analyze Web archives of news sites. The archives were collected with crawlers and contain a large number of pages that look very different because the frame around their core content differs. However, the news stories in the pages are nearly identical. The close proximity of unrelated items on the pages makes the detection of content overlap difficult. Our SpotSigs algorithm generates signatures that are spread across each document. Places for these signatures are determined by the placement of common words, like ‘is’ and ‘they’ in the documents. We can vary our method of computing the signatures. Using hash collisions the algorithm detects overlap among the signatures of matching contents. We explore how the different SpotSig parameters impact precision and recall performance.

Direct to Full Text (8 pages;PDF)

Source: Stanford Info Lab

Comments are closed.