The Invisible Web - ALIA conferences

Download Report

Transcript The Invisible Web - ALIA conferences

The Invisible Web
Chris Sherman
Editor, SearchDay
SearchEngineWatch.com
Information Online 2003
Sydney, Australia January 23, 2003
Sydney, Australia January 23, 2003
Overview
• How Search Engines Work
• What is the Invisible Web?
• Tactics for Searching
the Invisible Web
• Future Trends
Sydney, Australia January 23, 2003
The Parts of a Search Engine
• Three main parts of every search
engine:
– The Crawler (aka spider)
– The Indexer
– The Search Engine Database
Sydney, Australia January 23, 2003
How Search Engines Work
Crawler
URL1
URL2
Indexer
The Web
URL3
Search
Engine
Database
Eggs?
URL4
Eggs.
Sydney, Australia January 23, 2003
Eggs
- 90%
All About
Eggo
- 81%
Your
Eggs
EgoBrowser
by40%
Huh?
10%
S. I.-Am
How Crawlers Work
• Crawlers are like hypercaffeineated browsers
• Seeded with a set of URLs
• Download Web pages, then:
– Extract all links on every page for
further crawling
– Hand the page off to the indexer
Sydney, Australia January 23, 2003
The Bow Tie Model
• 30% in the core
• 24% origination
pages
• 24% termination
pages
• 22% disconnected
pages -- these are
effectively invisible
to search engines
Source: IBM
Sydney, Australia January 23, 2003
What is the Invisible Web?
• “Stuff” that search engine crawlers
(spiders) can not -- or will not -add to their databases
• 2 to 50 times larger than the
visible Web
• Resources often much higher
quality than the visible Web
Sydney, Australia January 23, 2003
What is the Invisible Web?
• Certain file formats (PDF, Flash,
Office files, streaming media)
– Why? They aren’t HTML text
• Most real-time data (stock quotes,
weather, airline flight info)
– Why? Ephemeral & storage intensive
Sydney, Australia January 23, 2003
What is the Invisible Web?
• Dynamically generated pages
(cgi, javascript, asp, or most pages
with “?” in URL)
– Why? Spider traps
• Web accessible databases
– Why? Spiders can’t type
Sydney, Australia January 23, 2003
The Opaque Web
• Visible pages “hidden” behind
dynamic navigation codes
• Mostly graphic, non-text pages
• “Disconnected” pages
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
The URL Test
Sydney, Australia January 23, 2003
Invisible Web Searching:
Core Tactics
• The first step in determining the
best approach for searching the
Invisible Web is to have a clear
idea of what you’re seeking.
• Limit your search to appropriate
tools for the particular type of
information you’re looking for.
Sydney, Australia January 23, 2003
Use Invisible Web Pathfinders
• Intelliseek
– http://www.invisibleweb.com
• Invisible-web.net
– http://www.invisible-web.net/
• Librarians’ Index to the Internet
– http://www.lii.org
Sydney, Australia January 23, 2003
Finding Non-HTML File
Formats
• Google & AlltheWeb: use the
filetype operator
– filetype:pdf
– filetype:doc
• Use specialized engines
– searchpdf.adobe.com
– Research Index
Sydney, Australia January 23, 2003
Finding Real Time Information
• Underground Weather
• Google News Search
• Yahoo Finance
• J-Track Spacecraft Tracker
Sydney, Australia January 23, 2003
Finding Images
• Google/FAST/AltaVista Image
Search
• Google Catalogs
• Visoo
• Webseek @ Columbia
Sydney, Australia January 23, 2003
Finding Streaming MediaFiles
• Speechbot
• Singingfish
• MSN Music
• British Pathe
• WindowsMedia
.com
v.9 player
Sydney, Australia January 23, 2003
Future Trends: The Invisible
Web Revealed
• More “difficult” content indexed
– Flash, dynamic content
• “Data centric” search engines
– ResearchIndex
• Agent-brokered database search
• Form crawlers
Sydney, Australia January 23, 2003
Conclusion
• Searching the Invisible Web isn’t
hard. It just takes a different
mindset.
• It’s crucial to develop your own,
personal collection.
• Expect the unexpected: the
boundary between visible and
invisible is changing as we speak.
Sydney, Australia January 23, 2003
More Info
CyberAge Books 0-910965-51-X
http://www.invisible-web.net
Sydney, Australia January 23, 2003
More Ranting
• SearchDay Newsletter
– http://searchenginewatch.com/searchday/
• Searchwise
– http://www.searchwise.net
[email protected]
Sydney, Australia January 23, 2003