Searching and the Web - School of Communication and Information

Download Report

Transcript Searching and the Web - School of Communication and Information

Web sources and library &
information services
Finding, evaluating and using a
variety of Web sources for
searching and reference
© Tefko Saracevic, Rutgers University
1
Similarities between Web
searching & IR & reference
• Basic principles to approach the same
– human-human interaction - interview • social, organizational, cognitive, affective
aspects to explore including task, need …
– preparation of search concepts, terms,
logic
– determination of range, restrictions
– estimation of relevance
© Tefko Saracevic, Rutgers University
2
Differences
• Vastly different sources
– as to contents, authority, reliability
persistence
– variation in amounts, depth, breadth
• Very different organization
– little standardization, few if any fields
• Quite different search engines &
capabilities -basic & advanced
– also different from engine to engine
• Differing search strategies needed
© Tefko Saracevic, Rutgers University
3
Also: invisible Web
• Materials that general search engines
cannot or WILL not include in their
collection of Web pages (indexes)
• You cannot find through general search
engines
• Contains a vast amount of information
– much of it authoritative, qualitative
© Tefko Saracevic, Rutgers University
4
Why search engines miss?
• Size: Web is huge, cannot cover all
• Economics: associated costs are high
– also pay per crawl & rank
•
•
•
•
Technical: still limited capabilities
Spam: eliminating bad also looses good
Restrictions: some site do not let in
Deep structure: some sites complex
© Tefko Saracevic, Rutgers University
5
Needed for Web searching
• Knowledge & competencies
– variety of Web sources
– their organization
– search engines
– Web search strategies
– search dynamics, feedback
• Keeping up & up & up
– constant updates, changes, innovations
– many domain/subject specific
© Tefko Saracevic, Rutgers University
6
Web size - who knows?
• Estimated over 16 million web servers
Lawrence & Giles, 1999
– But only a fraction of direct search relevance
• Domains of sites
• 83% commercial, 6% scientific or educational; 3%
health
• 2.5% personal; 2% societies; 1.5% government,
• about 1% each community, religion
• 1.5% pornographic
• Web Characterization Project - OCLC
– statistics, trends, report, links … for 2001 reports 8.5 mill web sites
– http://wcp.oclc.org/
© Tefko Saracevic, Rutgers University
7
Organization of sources
• No standardization across sources
• Major approaches in search engines
– classification: many directory types used
– statistical analyses of terms, links
• Metatags in sources
– to enable retrieval by fields
– HTML “keywords”, “description”
• 34% of sites use them
– Dublin core - .3% sites use
• Organization: hindrance to retrieval
– also faked contents to force retrieval
© Tefko Saracevic, Rutgers University
8
Sources & search engines
• Indexed by search engines (publicly indexed)
– by terms, selection, links, registration
• Not publicly indexed
– many domain sources will not be found e.g digital
libraries, online journals, reference
– many commercial sites will hardly be found
• Differing approaches to inclusion/selection
– mostly automatic; also generic source providers
– increasingly added human evaluation & selection
© Tefko Saracevic, Rutgers University
9
Search engine coverage
• No engine covers more than 16% of WWW
• In respect to combined coverage of 11 top:
– Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot
27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6
Excite 13.5, Lycos 5.9, EuroSeek 5.2
– HotBot MS, Snap & Yahoo use Inktomi as search provider,
but have different filtering & Inktomi databases
• Northern Light has ‘special collection’ - documents
not part of publicly indexabable web
• Hard to discern & compare coverage
• Many national search engines - own
coverage
© Tefko Saracevic, Rutgers University
10
Search features among
engines
• Some search features the same across all
but details differ - particularly in advanced
– Boolean available
• but sometimes AND sometimes OR default
– Differences may be found in:
• phrases, proximity, truncation, case sensitivity,
relevance feedback, field searching, special
features
• term expansion to concepts (latent semantic
indexing)
© Tefko Saracevic, Rutgers University
11
Search strategies & outputs
• Geared toward very short searches
– big majority of searches 2-3 terms (av. 2.5)
• in IR av. 7-14 - making a big difference
• Directory browsing a big component - not in IR
• Geared toward limited top outputs
• Ranking output by relevance predominates
– relevance calculation differ & proprietary (secret)
– except Google - they published their method
– affects search strategy - you guess how is done
© Tefko Saracevic, Rutgers University
12
Meta search engines
• Search engines that cover search
engines – many around e.g.
– All4one http://all4one.com/
• four windows - good for comparison
– CDNET Search.com http://www.search.com/
• meta engine of meta engines - customization
• Search Engines Worldwide
• 174 countries, over 1300 engines
http://www.twics.com/~takakuwa/search/search.html
• More on the horizon & differing
© Tefko Saracevic, Rutgers University
13
Specialized meta engines
• Selective with directories & large number
of databases & search engines
– Complete Planet http://completeplanet.com
– Invisible Web http://invisibleweb.com
• U.S. federal information via Government
Printing Office Access
http://www.gpo.gov/gpoaccess
– Federal Bulletin Board (file libraries for download
from many agencies): http://fedbbs.access.gpo.gov
© Tefko Saracevic, Rutgers University
14
Reference (expert) services
• Reference services - several models
– Q&A, directories, email answers etc. – e.g.
– Martindale’s Reference Desk - comprehensive
http://www-sci.lib.uci.edu/~martindale/Ref.html
– Ask Jeeves! – most popular http://www.ask.com/
– Ask ERIC – education questions- email answers
http://www.askeric.org/Qa/
– Information Please - almanac type questions
http://www.infoplease.com/
• Academic libraries developing reference
models - new service area
© Tefko Saracevic, Rutgers University
15
Libraries as Web sources
• Academic libraries providing open
collections & services; models vary
– Rutgers libraries - big long term effort
http://www.libraries.rutgers.edu/
– various sources & links involved
• for domain information& sources go to:
– Electronic Reference Sources; Subject Research
Guides: Social Sciences & Law; Library &
Information Science
– University of California, Berkeley - a most
elaborate effort together with Sun
Corporation http://sunsite.berkeley.edu/
© Tefko Saracevic, Rutgers University
16
Virtual libraries on the Web
• Libraries emerging only on the Web
– More & more libraries & organizations involved
Examples of academic & public libraries
– Virtual Library - Switzerland, US, UK & other
countries – ‘oldest virtual library on the Web’
• http://vlib.org
– Toronto Public Library
• http://vrl.tpl.toronto.on.ca/
– Internet Public Library, Michigan
• http://www.ipl.org/
© Tefko Saracevic, Rutgers University
17
Domain sites
• Many domain/issue specific sites
– rich & often unique coverage & services
– different approaches & requirements
• Examples in health related domains:
– Medscape - registration required
http://www.medscape.com/
– Rxlist - The Internet Drug Index
http://www.rxlist.com/
– Mayo Clinic HealthOasis
http://www.mayohealth.org/
© Tefko Saracevic, Rutgers University
18
•
Societies, organizations ,
publishers
Great many rich sources for searching
– differences in requirements, depth, richness
Examples from variety of organizations:
– Assoc. for Computing Machinery
http://www.acm.org/
• Digital Library; subscription or registration
– State department http://www.state.gov/
• about the U.S & other countries
– R.R. Bowker http://www.bowker.com/
• Free Resources from Bowker; Library Resource Guide
– Genealogy: http://www.familysearch.org/
© Tefko Saracevic, Rutgers University
19
Language barriers on the Web
• English still the major language
– but declining, now slightly over 50%
• Multilingual retrieval search engines
– Euroseek – searches 40 languages
http://www.euroseek.com/
– All the Web – 45 languages
http://www.alltheweb.com/
– in both, search in different languages covers
primarily their language sources
© Tefko Saracevic, Rutgers University
20
Language barriers: translations
• A number of translation sites
– machine aided – i.e. plug in terms,
phrases, sentences in one & review in the
other language , but effectiveness???
– Free Translations
http://www.freetranslations.com
– Babel Fish http://babelfish.altavista.com/tr
– Travlang – great for travelers – phrases
http://www.travlang.com
© Tefko Saracevic, Rutgers University
21
Key professional competencies
• Knowledge of SOURCES in area of interest
• search engines not enough
• not too helpful in finding these other sources;
structure hard to discern
• Evaluation of sources
– a key professional skill!
• standard criteria: quality, veracity, coverage etc
• plus Web criteria:
authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
– http://www.otterbein.edu/learning/libpages/subeval.htm
© Tefko Saracevic, Rutgers University
22
competencies …
•
•
•
•
•
•
•
Knowledge of users & use
Knowledge of searching
Use of technology
Adaptability, flexibility
Integration with other resources
Teaching others
Constant learning & update
© Tefko Saracevic, Rutgers University
23
© Tefko Saracevic, Rutgers University
24