Archive for the ‘Search Tools’ Category

Research Paper: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Thursday, May 1st, 2008

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
8 pages; PDF.

From the abstract:

Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor natural language portions of Web pages over advertisements and navigational bars.

The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient self- tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm a increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative Gold Set” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.

Source: Stanford InfoLab

Briefs: More New Google Features;

Friday, April 18th, 2008

+ Google Maps Now Offers Traffic Predictions (via SEL)

+ Google News Makes Quotes More Discoverable (via SEL)

Microsoft Launches Live Search News

Thursday, April 17th, 2008

Barry Schwartz writes:
Live Search News takes a more linear view of news, when you compare it to the Yahoo News home pages. Live Search News looks more like a Techmeme style news approach, but it obviously uses a different algorithm.

Direct to Live Search News

Source: Search Engine Lande

See Also:
Two More Excellent News Resources:

1) NewsNow

2) Topix

Briefs: OCLC and Orbis Cascade Alliance to develop new consortial borrowing solution

Wednesday, April 16th, 2008

+ OCLC and Orbis Cascade Alliance to develop new consortial borrowing solution

+ Hakia Launches Health Vertical

+ Updated Web Browsers: Which One Works Best? (via PC World)

+ Compete: Microsoft Gains Share; Google Hits New High In Raw Searches (via Search Engine Land)

+ AOL Acquires Sphere (via News Release)

+ Google’s Paid Clicks Weak In March, Says ComScore (via Dow Jones)

Indeed.com Launches Job Search by Salary

Wednesday, April 16th, 2008

From a blog post overview:

You can now enter an annual salary in the keyword search box to find all jobs we estimate pay at least that much. To find marketing manager positions paying over $60,000 per year, for example, search Marketing Manager $60,000.

Source: Indeed.com

Five Web-Based Apps and Tools Worth a Look: Image Search; Shorter URLs; Web Organization; YouTube Spying; New Tech News Aggregator

Friday, March 28th, 2008

Here are five new web-based tools and apps we discovered using KillerStartUps.com. Perhaps one or more will be of interest to you or those you work with.

+ Picollator.com - An Image Based Search Engine

+ LinkGap.com - Shortens Those Long URLs

+ TubeSpy - Spy On Other YouTube Viewers

+ Techsted.com - Technology News Aggregator

+ Eluma.com - Organize that Web Clutter

Rocketinfo Launches New Version of News Search Engine (Rocketnews.com)

Friday, March 28th, 2008

A online news search pioneer releases some new technology. We’re going to give it a whirl.

From the announcement:

Rocketnews.com goes further, working with news seekers to bring them what they are looking for by creating easy to configure, user-defined feeds from a database of over 60,000 sources, and growing…Rocketnews.com introduces the Topic Discovery Engine, which expands a contextual search to include blog posts, photos, video clips and research data, besides an abundance of updated and historical news. The Topic Discovery Engine examines all 60,000 news sources; it collects, analyzes and categorizes news stories; and then updates category pages, topic pages and related RSS feeds. Topic pages, a new feature at Rocketnews.com, highlight popular news topics by displaying related news stories, blog posts, photos and noteworthy quotes.

Source: News Release

RSS — eufeeds: over 300 newspapers updated every 20 minutes

Sunday, March 23rd, 2008

eufeeds: over 300 newspapers updated every 20 minutes
From RSS4Lib:

EUFeeds is a special-purpose RSS aggregator for European newspapers that provides access to more than 300 papers from the European Union. Provided by the European Journalism Centre in the Netherlands, this site lets you quickly browse the print media from each EU member nation.

The site defaults to UK newspapers; there is no apparent way to set a different country as your default entry page. It also does not provide an RSS feed for the aggregated content — so you cannot subscribe to the aggregated Czech Republic news, only visit it on a web page.

New Lookup Database from Melissa Data: Email Location

Saturday, March 22nd, 2008

The folks at Melissa Data have just placed a new email location database online at no charge. After entering the email address, the database will tell you where the mail server is located. Of course, this does not guarantee that the sender is located in the same place. For example, the mail server might be located in the UK but the sender is in the U.S.

Direct to Email Lookup Database Interface

Displays the city, state, country & a map of an email address.

Review All Melissa Data Lookup Databases

Source: Melissa Data

SearchMedica Offers Medical Professionals Six New Specialized Clinical Web Searches

Wednesday, March 19th, 2008

SearchMedica Offers Medical Professionals Six New Specialized Clinical Web Searches

From the news release

SearchMedica adds cardiovascular, diabetes/endocrine, infectious disease, musculoskeletal, pediatric, and respiratory disease categories to cancer/hemic, mental/nervous system and general medicine.

Direct to SearchMedica

Updated: Databases: Chronicling America Newspaper Site Adds More Pages, Features

Monday, March 17th, 2008

Chronicling America Newspaper Site Adds More Pages, Features

From the announcement:

More than 79,000 newly digitized newspaper pages, along with several new site features, have recently been added to the Chronicling America Web site at www.loc.gov/chroniclingamerica/. With this update, the site now provides access to more than 500,000 digitized newspaper pages, dating primarily from 1900 to 1910, and representing 61 newspapers from California, the District of Columbia, Florida, Kentucky, New York, Utah and Virginia. Chronicling America is a project of the National Digital Newspaper Program (NDNP), which is a partnership between the Library of Congress and the National Endowment for the Humanities (NEH).

New features in Chronicling America include:

+ “See All Available Newspapers” page - A list of all newspapers with pages available on the site.

+ RSS feed and E-mail Update service - Users can subscribe to Real Simple Syndication (RSS) updates or e-mail delivery at www.loc.gov/rss/ (see list under Topics/Newspapers and Journalism). Updates will include notices of added content and other points of interest.

Make sure to see the news release with links to a few highlights from the database.

Source: LC

CrossRef Integrates With Papers (Software for the Mac) To Help Scientists Manage Their Personal Libraries

Friday, March 14th, 2008

From the announcement:

CrossRef, the multi-publisher linking platform, announced today that Mekentosj, creator of Papers, had signed on as a CrossRef affiliate in order to integrate DOIs and CrossRef metadata into its services. Papers is an award-winning application for researchers that improves their Mac-based workflow for searching, downloading, and managing PDF articles.

Papers already uses the DOI as a standard way to identify and lookup scientific articles. With the new partnership, Papers will add a tighter integration with Crossref’s OpenURL service to facilitate the discovery of both new and existing scientific publications. As a result of the CrossRef integration, Papers can recognize the DOI in PDF files and on web-pages, and automatically retrieve the available bibliographic information, including title, authors and journal names, from Crossref’s metadata database. With one click, this information is then added to the researcher’s personal library, making scientific articles more accessible and manageable.

Source: CrossRef

BizJournals.com Introduces gClick, A Place to Find Company Profiles and Related Info

Thursday, January 31st, 2008

A full review will be coming soon on ResourceShelf.

What is it?

1) Available for IE only!

2)

The gClick™ button allows readers to dynamically extract real-time comprehensive intelligence — on companies, executives, and events — from any Web page with the click of button. Within seconds, you can go from scanning an article, anywhere on the web, to viewing in-depth information about the companies and executives referenced in the article.

3) gClick gathers real-time, contextual business intelligence from any story or HTML page by clicking on the button or using imbedded links.

More here. The technology comes from a company named Generate Inc. American City Business Journals became a “Strategic Investor” in Generate Inc. in 2005.

Here are two screen caps of gClick in action using a WSJ story. It works with all content, not only American City Biz Journals material.

1 (the story itself) ||| 2 (clicking on a company mentioned in the story)

Worth a look and more coming from RS in the future about gClick. It’s a free app, btw. We also hope a Firefox version is also in the works.

Briefs: It’s Hard to Hide From Your ‘Friends’; Oklahoma Governor Pushes Bill To Create Rx Drug Web Site

Thursday, January 31st, 2008

+ Add Footnotes & Endnotes to your Zoho Writer Documents
Hooray! Hooray!! Hooray!!!

+ Google Universal Search: 2008 Edition (via SEL)

+ Hackers Rig Google to Deliver Malware (via PC World)

+ It’s Hard to Hide From Your ‘Friends’ (via WSJ)
Note: No mention of the Ask.com Eraser feature that might also be of interest. You can read about it here. Gary is Director of Online Info Resources at Ask.com.

+ Oklahoma Governor Pushes Bill To Create Rx Drug Web Site

+ Middle East and Asia lose internet access after cable fails (via The Guardian, Hat Tip, Barry)

New Health Topic Resources from MedlinePlus: Diabetes Complications

Saturday, January 12th, 2008

New Health Topic Resources from MedlinePlus: Diabetes Complications

Source: MedlinePlus