Research Paper: Detecting Spam Web Pages through Content Analysis

Detecting Spam Web Pages through Content Analysis
10 pages; PDF.
by Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly
Fulll text of paper presented at 15th International World Wide Web Conference (WWW 2006), Edinburgh, United Kingdom, May 2006.
From the abstract:

In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influenc the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

Comments are closed.