No Slide Title

Transcript No Slide Title

Introductory Survey of
Internet Search Services
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
for Rochester Regional Library Council
Member Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Library Services and Technology Act (LSTA) and/or Regional
Bibliographic Databases and Resources Sharing (RBDB) funds granted by the
New York State Library
2000
What is a search engine?
• A searchable database of resources extracted from
the Internet by computer-generated search and
retrieval processes.
– Updated frequently
– Search features vary among engines
– Results of searches are ranked for”relevance” as
predicted by automated logical algorithms.
Search Engines and Subject Directories in
2000: Genres in Flux
• What types are available today?
Automatic
Human-compiled
“Pure” Crawler
Crawler “Plus”
Specialized Crawler
Peer-Reviewed
Search Engines in 2000
• “Pure” crawler-based
– Google
– Fast
• Crawler “plus”
– Subject Directory (HB, Lycos, Excite, AV,
Infoseek, WebCrawler)
– Special Collection (NL)
– Pre-programmed answers-Ask Jeeves (AV)
Search Engines in 2000
• Specialized (chiefly crawler-based)
– SearchEdu.com
• Specialized (crawler/human compiled)
– Scicentral.com metasite
• Peer reviewed
– Hippias-Philosophy http://hippias.evansville.edu/
– Argos -Classics and ancient history
http://argos.evansville.edu/
How large is a search engine?
• Typical personal computer - 64 MB RAM
• “General” search engines - 4,000 MB RAM
(and more)
• Database - 1,000 GB of storage
WS
WS
WS
WS
WS
WS
WS
WS
WS
WS
DATABASE
CR - Crawler
WS - Web Server
User 1
User 2
User 3
DATABASE
Search
Engine
User 4
User 5
User 6
User 7
WS
WS
WS
WS
WS
WS
WS
WS
WS
WS
DATABASE
CR - Crawler
WS - Web Server
Crawling the Web
The Big Picture ...
• Crawlers
–
–
–
–
–
download a page
extract links to other web pages
index words from the page
“crawl” the extracted links and
continue the cycle
Crawling the Web
The Detailed View ...
• While downloading one page the crawlers
simultaneously …
– check for the next page to download in the “queue”
– check for any “robots exclusion” files that prohibit
downloading of pages from a web server
– download the whole page
– extract all links from the page and add them to the
“queue”
Crawling the Web
The Detailed View ...
– Index contents (extract all words and save them to a
database associated with the page’s URL; also save the order
of the words to allow for phrase searching)
– Optionally filter for adult content, language of
document, other criteria
– Save (or make) summary of the page
– Record the date downloaded for future reference in
scheduling re-visits to the site
Scale?
• One page at a time?
– Covering the Internet would take several years
• Instead …
– Thousands of pages are processed simultaneously
by multiple crawlers (Google has ca. 4,000)
Performance?
• What about maintenance down-time?
– Services have duplicate machines so no
interruptions occur during maintenance
• Why are interface changes so rare?
– Updating software on complex systems is
expensive
– Usually slows service down, or stops it
completely
Performance?
• If I execute the same search in the same
engine several times in succession I get
different results. Why?
– Query is run against multiple machines in
parallel
– Ranking may be performed on a limited subset
of the hits (ie, those returned first) rather than
the entire set of results.
Why do search engines exist?
• To make money!!!
– Advertising
– Banner ads
– Allied services
– Pay-for-placement in search results
– Man;y other commercial endeavors
In pursuit of user loyalty . . .
• Advertisers want “stickyness” ie, users that
return often and at length
• “Stickyness” drives design
– Portalization – “One-stop access for all your
Intenet needs”
– Speed
– Freshness
– Relevance of results
– Value-added search features such as
customization (My Yahoo, etc.)
How Search Engines Differ . . .
•
•
•
•
Content
Update frequency (“freshness”)
Ways you can search
Ways results are presented to you
Breadth of Content
• How much of the “geographic” Internet is
searched and to what degree?
• What types of files are included?
–
–
–
–
–
–
Web sites
Usenet News
Software
Image/Video/Audio
Multimedia
FTP
Depth of Content
• How much of a given site has been downloaded?
–
–
–
–
–
–
–
–
URL?
Title?
First heading?
First 200 words?
Full text?
Full text and some of the documents linked to?
Full text and all of the documents linked to?
Full text and documents that are linking to this one?
Update frequency
• When was the content last refreshed or
rebuilt from direct searching of the Internet?
Ways you can search
• Boolean operators
– Requiring, combining or excluding words or phrases
• Searching for a phrase
• Searching by word stem (truncation)
• Searching by location in the document (field
searching)
• Searching by date
• Searching by media
• Searching by language
Ways results are presented to you
Relevance Prediction
Based on
TEXT ON THE PAGE
FACTORS EXTERNAL TO THE PAGE
Relevance Prediction
Text on the page
• Based on
– Word frequency profiles
• “More like this”
• “Suggested similar sites”
– Relational clustering
• Northern Light’s “Custom Folders”
Relevance Prediction
Problems with text on the page ranking
• Designed for “text-heavy” pages; “designheavy” pages may be ranked lower as a
result
• No added weight possible for evaluated,
rated or reviewed sites
• Ill-suited for a web that grows so rapidly
Relevance Prediction
Factors external to the page
• Link popularity
– Sites with more links pointing to them ranked higher
• Click popularity
– Sites visited more often and longer ranked higher
(Direct Hit’s knowledge base of users’ click paths)
• “Sector” popularity
– Tracking demographic or social groups’ clickpaths
Relevance Prediction
Factors external to the page
• Pre-packaged human-generated questions
with answers (Ask Jeeves)
• Business alliances among services
• Editorial partnerships
• Pay-for-placement options (GoTo)
Relevance Prediction
Factors external to the page
• Advantages
– Helps focus and limit results for popular, common
queries
– Human-generated criteria improve quality of
results
• Disadvantages
– Increases the “invisible layer” between the
searcher and the results
• “How did I get these results?”
• “Who is controlling the search process?”
– Privacy issues around tracking users’ click paths
What is a subject directory?
• A human-generated listing of resources
usually classified and hierarchically
arranged by subject category, often
containing descriptions of the resources
included.
Subject Directories
• Ways directories differ from search engines
–
–
–
–
Sites are examined and cataloged by a human being
Descriptions of the sites are often included
Generally fewer ways of searching
Generally not updated as frequently
Types of Subject Directories
• “My favorite links”
– Personal homepages
• Subject-focused sites with “related links”
– The Cervantes Home Page
• Subject-focused metasites
– Scicentral http://sciquest.com
– Sections of the WWW Virtual Library http://wwwvl.org
• General “comprehensive” directories
– Yahoo, Snap, Excite
Important aspects of subject
directories
• Authorship/sponsorship
• Intended audience
• Update frequency
How are users faring?
NPD User Study April, 2000
• 40,000 respondents chosen randomly
• October-November, 1999
• Conducted by NPD New Media Services “on
behalf of 13 major search services”
• Summary at
http://searchenginewatch.com/reports/npd.html
• See http://www.npd.com for more information
Search Engine or Subject Directory—
Which one do I use?
• Portalization has blurred the distinctions;
however --• Use a search engine for
– Narrowly defined topics “Plate tectonics in
northern California”
– Up-to-date news and research
– Occurrences of a name or phrase
Search Engine or Subject Directory—
Which one do I use?
• Use a subject directory for
– Broadly defined topics “geophysical research”
– Subject-specific gateways or “vortals”
• websites
• discussion groups
• media files
– “A few good sites”
– General browsing
Improving search strategy
• Little overlap in coverage among engines
(Gregg Notess at
http://searchengineshowdown.com)
• Even the largest ones cover no more than 20
– 25% of the Internet
• Therefore use 2 or more engines you know
and trust to insure a wider range of results
Improving search strategy
• Know the advanced features of your
favorite engine(s) and use them.
• Use unique identifiers or keywords
• Use phrase searching when possible
• Restrict search to title or other fields
• Incorporate date searching when available
• Use the “Find in page” function to locate
your search term(s) quickly
What NO search engine covers . . .
• Dynamic Web content
– Created through user interaction
– File extensions include *.asp, *.php, *.jsp
– PDF files (See Adobe’s new engine for these at
http://searchpdf.adobe.com
• Pages requiring a login
• Wireless content
– WAP (Wireless Application Protocol) engine available
at FAST http://alltheweb.com)
Once you have a list of hits ask
yourself . . .
• How might the domain type influence the
content of this site?
• Do I trust the author/creator? Why or why
not?
• How might the organization responsible
influence the content?
Once you have a list of hits ask
yourself . . .
• Is the date of publication critical or
important in this case?
• Is the intended audience appropriate for this
information need?
Search is . . .
Intriguing
Frustrating
Exciting
Maddening
Gratifying
The Internet is . . .
Vast
Constantly changing
Uncataloged
Of wildly varying
quality
Search Services are . . .
Presently our best hope of
locating the increasingly
valuable resources found on
the ‘Net
How can I keep up???
• Use monitoring services such as
– http://searchenginewatch.com
– http://searchengineshowdown.com
– http://researchbuzz.com
• Network with colleagues and other expert users
• Try new services out (on your own, at first !!)
• Learn how to evaluate new services on your own
Thank you and best of luck!!!
Michael Hunter
Reference Librarian
Warren Hunting Smith Library
Hobart and William Smith Colleges
Geneva, NY 14456
(315) 781-3552
[email protected]