Searching the Hidden Web

Download Report

Transcript Searching the Hidden Web

Donghui Xu
Spring 2011, COMS E6125
Prof. Gail Kaiser
•
•
What is the hidden Web
Two approaches in searching the hidden Web
o Browsing Yahoo! like Web directory
o Crawling the hidden Web
•
conclusion

The surface Web
◦ reachable via hyperlinks

The hidden Web
◦ no static hyperlink points to the webpage
◦ access via a query interface
◦ dynamically generated base on the query submitted

About 500 times larger than the surface web
◦ The surface web - 1 billion pages
◦ Hidden web - over 550 billion pages

Top sixty largest Deep web sites are about
40 times larger than the surface web.
the Deep Web V.S. the Surface Web (from Bergman)
URL
Web Size (GBs)
National Climatic Data Center (NOAA)
http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html
366,000
NASA EOSDIS
http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html
219,600
http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/
32,940
MP3.com
http://www.mp3.com/
4,300
US PTO - Trademarks + Patents
http://www.uspto.gov/tmdb/, http://www.uspto.gov/patft/
2,440
Informedia (Carnegie Mellon Univ.)
http://www.informedia.cs.cmu.edu/
1,830
UC Berkeley Digital Library Project
http://elib.cs.berkeley.edu/
766
US Census
http://factfinder.census.gov
610
NCI CancerNet Database
http://cancernet.nci.nih.gov/
488
Amazon.com
http://www.amazon.com/
461
IBM Patent Center
http://www.patents.ibm.com/boolquery
345
NASA Image Exchange
http://nix.nasa.gov/
337
Name
National Oceanographic (combined with
Geophysical) Data Center (NOAA)
some of the largest Hidden Web sites (from Bergman)


Browsing Yahoo! like Web directory
Crawling the Hidden Web.


Manually populate Yahoo! like directory
Classify collections of text database into
categories and subcategories

Pros
◦ Intuitive
◦ Easy to use

Cons
◦ Labor intensive
Yahoo Directory containing 200, 0000 categories
and there are millions of database searchable
online
◦ Accurate classification is not an easy task

Main challenge in searching the hidden Web
◦ How to automatically generate meaningful query as
input against query interface

The query generation problem
◦ assume that a Web site contains a set of pages, s.
◦ each query qi issued returns a subset of s, si
◦ the task is to select a set of queries that would
return maximum number of unique pages in the
database with minimum cost



Random - select the query randomly from a list of
keywords (e.g. a random word from an English
dictionary).
Generic Frequency - select a list of most frequent
key words from a generic document corpus.
Adaptive - select promising keywords from
documents downloaded based on previously issued
queries.
comparison of policies for dmoz (modified from Ntoulas et al )
comparison of policies for PubMed (modified from Ntoulas et al)



The surface web is the tip of the iceberg
Beneath it is an even vaster hidden Web
Two main approaches to access the hidden Web
◦ Yahoo! like web directory
◦ Crawling the Hidden Web


Much work need to be done.
Hidden Web searching technology would enable us
to connect different data sources and allow
businesses use data in new ways.







[1] "The Deep Web: Surfacing Hidden Value"Michael K. Bergman. .
The Journal of Electronic Publishing, August 2001
[2] "Exploring a 'Deep Web' That Google Can’t Grasp"Alex Wright. .
New York Times, February 3 2009
[3] S. Raghavan and H. Garcia-Molina. “Crawling the Hidden Web.” In
Proceedings of the International Conference on Very Large
Databases (VLDB), 2001.
[4] Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis
Gravano "Modeling and Managing Content Changes in Text
Databases."ACM Transactions on Database Systems, 32(3): June
2007.
[5] Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge University
Press. 2008.
[6] Alexandros Ntoulas, Petros Zerfos, Junghoo Cho "Downloading
Textual Hidden Web Content by Keyword Queries" ,In Proceedings of
the Joint Conference on Digital Libraries (JCDL),June 2005
[7] J. P. Callan and M. E. Connell. Query-based sampling of text
databases. Information Systems, 97–130, 2001.
Thanks!