QProber: Classifying and Searching "Hidden-Web" Text Databases

PERSIVAL Project
Computer Science Department
Columbia University

Project Summary

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Hence traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse. Our QProber system automates the classification of searchable text databases (whether their contents are "hidden" or not) by adaptively probing the databases with queries derived from document classifiers, without retrieving any documents. A large-scale experimental evaluation over 130 real web databases indicates that our technique produces highly accurate database classification results using -on average- fewer than 200 queries of four words or less to classify a database (TOIS'03 paper; SIGMOD'01 paper). Interestingly, our technique is attractive to classify even crawlable text databases (i.e., databases whose contents are not "hidden") as long as search interfaces for the databases are available (IEEE Data Engineering Bulletin'02 paper).

An alternative way to facilitate access to text databases is through "metasearchers," which provide a unified query interface to search many databases at once. For efficiency, a critical task for a metasearcher is the selection of the most promising databases to search for a query, a task that typically relies on statistical summaries of the database contents. We derive content summaries from searchable text databases by exploiting our probing-based classification algorithm to adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. We can then build content summaries from these topically-focused document samples. A large-scale experimental evaluation over a variety of databases indicates that our content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies (VLDB'02 paper, SIGMOD'04 paper).

Publications

People

Demo


Panagiotis G. Ipeirotis (panos@nyu.edu)