Slides for 10/25 -- The Deep Web

Download Report

Transcript Slides for 10/25 -- The Deep Web

The Invisible/Deep Web
The Private Web
Sites excluded by a Webmaster
Password protection
“Noindex” metatag
Robot exclusion protocol
The Opaque Web
Files not indexed in search engines
because:
Depth of crawl
Frequency of crawl
Disconnected URLs
Sherman and Price (2001)
The Proprietary Web
Registered Sites
e.g., New York Times
Fee-based Sites
e.g., Wall Street Journal
Sherman and Price (2001)
Truly Invisible Web
Technical Reasons
Crawlers cannot handle newer file formats
Dynamically generated information
Content in relational databases
(e.g., Census data)
Sherman and Price (2001)
Dynamically Generated Content
 stock quotes
 airline flights
 weather
 phone
directories
 library catalogs
 people finders
 dictionary definitions
 online store products
(e.g., Ebay)
 census data
 patents
 news
Why Web Is “Invisible” to
Search Engines
 Password protected sites
 Noindex metatag
 Robot exclusion protocol
 Sites require forms to be filled out
 Dynamically generated content
 New file formats
 Newly added Web pages
Finding the
Invisible/Deep Web
Try these search terms:
webliography, libguides, database,
research guides, resource guides
Use Invisible Web directories.
Explore library websites
– Pathfinders, Research Guides, etc.
Chris Sherman
Gary Price