|
Automatic Classification of Text Databases
through Query
Probing P. Ipeirotis, L. Gravano, and M. Sahami |
|
Many text databases on the web are
"hidden" behind search
interfaces, and their documents are only accessible through querying.
Search engines typically ignore the contents of such search-only
databases. Recently, Yahoo-like directories have started to manually
organize these databases into categories that users can browse to find
these valuable resources. We propose a novel strategy to automate the
classification of search-only text databases. Our technique starts by
training a rule-based document classifier, and then uses the
classifier's rules to generate probing queries. The queries are sent
to the text databases, which are then classified based on the number
of matches that they produce for each query. We report some initial
exploratory experiments that show that our approach is promising to
automatically characterize the contents of text databases accessible
on the web. |