The “Deep Web”

Download Report

Transcript The “Deep Web”

The “Deep Web”
ISC 110 Final Project
Kaila Ryan - 12/12/2013
What is the “Deep Web”?




Web content which is hidden behind an HTML form, and is generally
not able to be indexed by search engines (Madhavan et. All, 2009).
Largely made up of web-connected databases (Wright, 2009).

Shopping catalogs

Scientific research data

Public transport information, etc.
Requires “valid input values” to access (Madhavan et. All, 2009). In other
words, a query or another similar form of typed input.
Web-crawlers not yet sophisticated enough to automate formulation
of relevant queries, so this data cannot be reached by them.
A bit about search engines...



Most modern search engines use automated “web crawler” programs
to index websites
Crawlers follow a “trail” of links from webpage to webpage, indexing
each new page it finds so that it becomes searchable- part of the
“surface web” (Wright, 2009).
Because of the very nature of how they function, traditional crawling
methods fail to index some documents, such as:

Databases, which require specific queries to access the
information contained in them

Impossible (or at least inefficient and impractical) to use
every possible query on every database found.

Task of figuring out how to narrow down possible queries
to relevant terminology has been challenging.
Finding the Deep Web:




No single, exhaustive method of locating this data is available- yet.
Many competing theories and projects working toward the creation of
functioning Deep Web crawlers and search engines.
Primary methods of locating Deep Web content at present:

Directories, like “The Hidden Wiki” (requires Tor browser)

Referral by current users of a particular
site/service/database
Many in the field of Information Science focused on development of
technology capable of “surfacing” Deep Web content, through the use
of new methods of locating and querying databases, and indexing the
results of these queries.

Google has a team dedicated specifically to this task
The Deep Web's value:


You may be asking yourself, “Why should we bother surfacing the 'Deep
Net'? What is it worth to us?”
Ability to automate database querying and indexing opens up potential for
automated cross-referencing of otherwise unconnected databases.

Invaluable to the field of medical and scientific research.

Important step in the movement toward a semantic web.

Could potentially be used to search for answers to complex
questions, for which all of the information is available, but is
either not unified, or not easily accessible (“What is the cheapest
way to get from X to Y at 9am on a Sunday?”)

In general, ability to discover a wealth of knowledge that is
already freely available, but hidden: up to 96% of the Web may
be considered the Deep Web.
Sources




Bergman, M. K. (2001, Sept 24). The deep web: Surfacing hidden value.
Deep Content, Retrieved from
http://grids.ucs.indiana.edu/courses/xinformatics/searchindik/
deepwebwhitepaper.pdf
Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang. 2007.
Accessing the deep web. Commun. ACM 50, 5 (May 2007), 94-101.
DOI=10.1145/1230819.1241670
http://doi.acm.org/10.1145/1230819.1241670
Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex
Rasmussen, and Alon Halevy. 2008. Google's Deep Web crawl. Proc.
VLDB Endow. 1, 2 (August 2008), 1241-1252.
Wright, A. (2009, Feb 23). Exploring a 'deep web' that google can't grasp.
The New York Times. Retrieved from
http://cob.jmu.edu/williamson/mktg470/reading/search/2009/Exploring a
‘Deep Web’ That Google Can’t Grasp.pdf