PERSIVAL Project
Computer Science Department
Columbia University
Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Hence traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse. Our QProber system automates the classification of searchable text databases (whether their contents are "hidden" or not) by adaptively probing the databases with queries derived from document classifiers, without retrieving any documents. A large-scale experimental evaluation over 130 real web databases indicates that our technique produces highly accurate database classification results using -on average- fewer than 200 queries of four words or less to classify a database (TOIS'03 paper; SIGMOD'01 paper). Interestingly, our technique is attractive to classify even crawlable text databases (i.e., databases whose contents are not "hidden") as long as search interfaces for the databases are available (IEEE Data Engineering Bulletin'02 paper).
An alternative way to facilitate access to text databases is through "metasearchers," which provide a unified query interface to search many databases at once. For efficiency, a critical task for a metasearcher is the selection of the most promising databases to search for a query, a task that typically relies on statistical summaries of the database contents. We derive content summaries from searchable text databases by exploiting our probing-based classification algorithm to adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. We can then build content summaries from these topically-focused document samples. A large-scale experimental evaluation over a variety of databases indicates that our content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies (VLDB'02 paper, SIGMOD'04 paper).
Modeling and Managing Content Changes in Text Databases (Abstract,
PDF)
Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), 2005
P. Ipeirotis,
A. Ntoulas,
J. Cho, and
L. Gravano
When one Sample is not Enough: Improving Text Database Selection Using
Shrinkage (Abstract,
PDF)
Proceedings of the 2004 ACM SIGMOD International Conference On
Management of Data, 2004
P. Ipeirotis and
L. Gravano
QProber: A System for Automatic Classification of Hidden-Web Databases
(Abstract,
PDF)
ACM Transactions on Information Systems, vol. 21, no. 1, Jan. 2003
L. Gravano,
P. Ipeirotis, and
M. Sahami
Distributed Search over the Hidden-Web: Hierarchical Database
Sampling and Selection (Abstract,
PDF)
Proceedings of the 28th International Conference on Very Large
Databases (VLDB 2002), 2002
P. Ipeirotis and
L. Gravano
Probe, Count, and Classify: Categorizing Hidden-Web Databases
(Abstract,
PDF)
Proceedings of the 2001 ACM SIGMOD International Conference On Management of Data,
2001
P. Ipeirotis,
L. Gravano, and
M. Sahami
Query- vs. Crawling-based Classification of Searchable Web
Databases (PDF)
IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002
L. Gravano,
P. Ipeirotis, and
M. Sahami
PERSIVAL Demo: Categorizing Hidden-Web Resources (PDF)
Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL
2001), 2001
P. Ipeirotis,
L. Gravano, and
M. Sahami
Automatic Classification of Text Databases through Query
Probing (Abstract,
PDF)
Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'00);
also in LNCS
Series no. 1997, Springer, 2001
P. Ipeirotis,
L. Gravano, and
M. Sahami