A Comparison of Open Source Search Engines
46 pages; PDF.
by Christian Middleton, Ricardo Baeza-Yates
The present work is the first study, to the best of our knowledge, to cover a comparison of the main features of 17 search engines, as well as a comparison of the performance during the indexing and retrieval tasks with different document collections and several types of queries. The objective of this work is to be used as a reference for deciding which open source search engine fits best with the particular constraints of the search problem to be solved. On chapter 2 we prefer a background of the general concepts of Information Retrieval. On chapter 3 it is presented a description of the search engines used in this work. Then, on chapter 4 the methodology used during the experiments is described. On chapters 5.1 and 5.2 we present the results of the different experiments conducted, and on chapter 5.3 the analysis of these results. Finally, on chapter 6 the conclusions are presented.
Which engines were considered? Which were compared? From pages 17-18:
We compared 29 search engines: ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, Lucene, Managing Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair.
Based on the information collected, it is possible to discard some projects because they are considered outdated (e.g. last update is prior to the year 2000), the project is not maintained or paralyzed, or it was not possible to obtain information of them. For these reasons we discarded ASPSeek, BBDBot, ebhath, Eureka, ISearch, MPS Information Server, PLWeb, and WAIS/freeWAIS.
In some cases, a project was rejected because of additional factors. For example, although the MG project (presented on the book “Managing Gigabytesâ€) is one of the most important work on the area, it was not included in this work, due to the fact that it has not been updated since 1999. Another special case is the Nutch project. The Nutch search engine is based on the Lucene search engine, and is just an implementation that uses the API provided by Lucene. For this reason, only the Lucene project will be analyzed. And finally, XML Query Engine and Zebra were discarded since they focus on structured data (XML) rather than on semi-structured data as HTML. Therefore, the initial list of search engines that we wanted to cover in the present work were:
Datapark, ht://Dig, Indri, IXE, Lucene, MG4J, mnoGoSearch, Namazu, OmniFind, OpenFTS, Omega, SWISH-E, SWISH++, Terrier, WebGlimpse (Glimpse), XMLSearch, and Zettair. However, with the preliminary tests, we observed that the indexing time for Datapark, mnoGoSearch, Namazu, OpenFTS, and Glimpse where 3 to 6 times longer than the rest of the search engines, for the smallest database, and hence we also did not considered them on the final performance comparison.
Source: Universitat Pompeu Fabra (Barcelona, Spain)
