The Invisible Web

Download Report

Transcript The Invisible Web

The Invisible Web
Definition
Searching
The Invisible Web
Also called:
 deep content
 hidden internet
 dark matter
The Invisible Web
The vast number of pages that search engines cannot or will not
index
 Restricted: login, password (such as intranets, databases; private,
proprietary)
 Sites not linked from anywhere (undiscovered)
 Sites that use a robots.txt file to keep files off limits from spiders
 Unsearchable or un-indexable file formats
 Non-static - searchable databases that only produce results
dynamically in response to a specific search request (such as CGI,
ASP, CFM)
 Real-time data – changes rapidly – too “fresh”
 Sites that are too “deep”
The Invisible Web
Search engines often avoid indexing web pages that are
delivered dynamically, such as via database programs:
 Often, the search engine may not like the URL used in
order to retrieve the document. Many dynamic delivery
mechanisms make use of the ? symbol.
 For example, a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
 Most search engines will not read past the ? in that URL.

The Invisible Web
Invisible Web sources tend to be:
 More current
 More comprehensive
 Searchable (however, not by SE’s)
 More specific/targeted
 Deeper breadth
 Often better quality
The Invisible Web













Top types of “invisible” information
News
RSS
Blogs
Public company filings, stock prices
Customized maps and directions
Clinical trials
Telephone numbers and addresses, postal codes
Definitions
Job postings
Grant information
Statistics
Weather
Museum, gallery, and library holdings
Finding the “Dark Matter”
Search Engines
 Specialized Search Engines
 Directories
 Vortals

Traditional Search Engines
Traditional Search Engines incorporation
of “Invisible” Databases
 Weather
 Maps
 Phone directories
 Catalogs
 Stock prices
Traditional Search Engines
Unless specially, programmed, though,
spiders can’t find all the valuable
resources available
Specialized Search Engines
Search deeper into sites:
 Go beyond top page, or homepage
 Choose sources to spider—topical sites
only
 “Smart” ranking and indexing based on
knowledge of the specific subject
Specialized Search Engines
There are hundreds of specialized search
engines for almost every topic Search Engine Guide
 Specialty Search Engines
Directories
Collections of pre-screened web-sites into
categories based on a controlled ontology
 Ontology: classification of human
knowledge into topics, similar to traditional
library catalogs

Directories
Closed Model: paid editors; quality control
(LookSmart, Yahoo)
 Open Model: volunteer editors; (Open
Directory Project, Google)

Directories
Easier access to relevant results
 Faster
 Access to materials not always indexed by
search engines—content in databases or
file types not searched by spiders

Directories
Issues with directories:
 Inherently small
 Unseen editorial policies
 May
charge for listing
 Lopsided coverage

Timeliness--Harder to keep updated
Search
Vortals




Vortals: vertical-portal. Instead of being a
horizontal, all-inclusive entry point into the Web,
they are vertical, specialized entry points.
Comprehensive sites focusing on gathering and
providing links to the best resources in a specific
topic.
Usually are combined subject-specific search
engines and subject-specific directories
Also called “focused crawlers”; metasites; guru;
authority; industry guide; subject directory site
Vortals
Advantages – best of directories and
subject specific search engines
 More up-to-date - crawl subject specific
pages more often
 Deeper crawl - gets more of the content on
each server
 More precision, less recall
Searching the Invisible Web
How do you find these sites?
 Use directories known directories to find
invisible web searching and browsing
tools:
 Librarians’
Index to the Internet
 Open Directory
 Google Directory
 Teoma works well, too.
Searching the Invisible Web


Rethink your search:
Think key terms specific details – macro vs. micro
Example you want to find the melting point of hydrogen
peroxide. On the general web, you’d put in the key
words melting, point, and “hydrogen peroxide” On the
invisible web, you look for chemical databases, which
included melting points as one feature of the database,
once in the database, then you’d search for hydrogen
peroxide
Searching the Invisible Web
Remember some concepts are assumed
 Do not use the subject a search term
 Example: If you are looking for information
on gender inequity in math education,
exclude terms like education from your
search in AskERIC, an education specific
search tool
Mining the Invisible Web





Tips: Certain kinds of sites can prove to be
clearinghouses of information:
Government - statistics of all kinds
Professional organizations - archives of relevant
research and statistics
Media sites (TV and Radio) – transcripts and
speeches
College and university professor sites – lectures
and personal publications
Mining the Invisible Web

Look for library guides and commercial
portals for more guidance in finding the
hidden, valuable content available for free
on the Web (more on this in the next
lesson):

My Ready Reference on the Web Resource