Searching and the Web
Download
Report
Transcript Searching and the Web
Information jungle on the
Web:
finding and evaluating
information sources
Tefko Saracevic, PhD
Rutgers University
[email protected]
http://www.scils.rutgers.edu/people/faculty/tefko.html
Web & information:
key problems
SEARCHING the Web for information
Retrieving a MANAGEABLE AMOUNT
Selecting the most RELEVANT sources
EVALUATING sources & information
Three laws for information on the Web:
1. EVALUATE
2. EVALUATE
3. EVALUATE
Tefko Saracevic, Rutgers University
2
Characteristics of
information on the Web
VARIETY - amazing
rich source on myriad topics & subjects
DISTRIBUTION - all over, global
information scattered across great many sites
LINKAGE - many hyperlinks, hypertexts
elaborate web of connections, paths, and mazes
AMOUNT - huge, growing exponentially
millions of sites, billions of pages
Tefko Saracevic, Rutgers University
3
Characteristics … (cont.)
CONTENT VALUE NEUTRAL - anything goes
no control of content
some accurate, trustworthy, verifiable
some biased, self-serving, propaganda, promotional
some false accidentally
some false deliberately, some even with evil intent
Thus, the three Web laws
Tefko Saracevic, Rutgers University
4
Size of the Web
Over 16 million web servers; 800 million pages
83% commercial, 6% scientific or educational; 3% health
2.5% personal; 2% societies; 1.5% government,
about 1% each community, religion; 1.5% pornographic
Growth 97-99 public sites +179%
Countries of origin:
U.S. 55% (59% in 1997), Germany 6%, Canada 5%, UK 5%, Japan 3%,
Australia, Brazil, France, Italy 2% each, all others 18%
Languages: 80% English (84% in 1997)
US sites & English language predominate, but % falling steadily
Sources: Lawrence & Giles, Nature (1999): http://www.wwwmetrics.com/
OCLC Web Characterization Project
http://oclc.org/oclc/research/projects/webstats/index.htm
Tefko Saracevic, Rutgers University
5
Organization of Web sites
Metatags - to enable retrieval by fields- low use
HTML “keywords”, “description”
34% of sites use them
Dublin core - .3% sites use
No standardization across sources
Classification a predominant approach
many types used
Lack of organization major hindrance to retrieval
also faked contents to force retrieval
Tefko Saracevic, Rutgers University
6
Comparison: Web & library
or inf. retrieval searching
SIMILARITIES in searching
Basic principles to approach the same
human-human interaction - mediated or introspection
to determine content, explore information need for a task
preparation of search concepts, terms, logic
determination of range, restrictions
estimation of relevance
Tefko Saracevic, Rutgers University
7
Differences
Vastly different sources
as to contents, authority, reliability, persistence
variation in amounts, depth, breadth
Very different organization
little standardization, few if any fields
Quite different search engines
Differing search strategies needed
Presence of many links; complex connections
Evaluation more complex
Tefko Saracevic, Rutgers University
8
Needed for Web searching
Knowledge & competencies
about great variety of sources
great variety in their organization
search engines
search strategies; search dynamics
exploring & exploiting links & networks
keeping up: constant changes, innovations
Web economics - no such thing as free lunch
Effectiveness proportional to that knowledge
Tefko Saracevic, Rutgers University
9
Criteria for evaluation
http://www.otterbein.edu/learning/libpages/subeval.htm
Authority
Author - possible bias? Publisher - reputation?
Professional society? Academic source?
Reason on the Web?
Vanity pages? Sponsor? Advocacy association?
Domain name -who put up the site?
Accuracy - possible independent verification? Sources?
Currency - verification
Prior review, experiences - checking review sources
Critical thinking & constant verification
Tefko Saracevic, Rutgers University
10
Ways to search & retrieve
Most popular: search engines
global, regional, country, specialized engines
Following links from major sites & portals
e.g. from Library of Congress to many libraries
from newspapers to archives
Reference sites - growing numbers
Library sites - becoming ever richer sources
Web addresses in print sources, newspapers
Referrals, emails, bookmarks
Tefko Saracevic, Rutgers University
11
Web sites & search engines
Indexed by search engines (public sites)
by keywords, classification, links, registration
Hard to find
most domain sources will not be found e.g digital
libraries, online journals, reference sources
many commercial sites
Differing approaches to inclusion/selection
mostly automatic; also generic source providers
increasingly added human evaluation
Tefko Saracevic, Rutgers University
12
Search engine coverage
No US engine covers more than 16% of the Web
Very hard to discern coverage
In respect to combined coverage of 11 top engines:
Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3
Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek
5.2
HotBot, MS, Snap & Yahoo use Inktomi as search provider, but have
different filtering & Inktomi databases
Large European engines geared to country coverage
E.g. Wanadoo (France), T-online (Germany)
highest use among engines in their countries
Tefko Saracevic, Rutgers University
13
Unique search engines
Number of specialized engines - looking for niche
good for scientific, technical, professional searches
include manual evaluation & selection of sources
Northern Light has ‘special collections’
not found on publicly indexable Web
http://www.northernlight.com
Oingo has word associations, evaluations
includes elaborate classification
Tefko Saracevic, Rutgers University
http://www.oingo.com/
14
Search features among
engines
Some search features the same across all but
details differ - particularly in advanced
Boolean available
but sometimes AND sometimes OR default
Differences may be found in:
phrases, proximity, truncation, case sensitivity,
relevance feedback, field searching, special features
some have term expansion to concepts & lists of
associated terms ( e.g. latent semantic indexing)
Tefko Saracevic, Rutgers University
15
Search strategies &
outputs
Geared toward very short searches
big majority of searches 2-3 terms (av. 2.5)
big majority of users view one page only
Geared toward limited top outputs
Ranking output by relevance predominates
relevance calculation differ & secret
Also heavy & increasing use of classification
Browsing a big component
Tefko Saracevic, Rutgers University
16
Meta search engines
Search engines that cover search engines e.g.
All4one
http://all4one.com/
four windows - good for comparison
Savvy Search
http://www.savvysearch.com/
indicates search engine source
More on the horizon & differing
Search Engine Watch http://www.searchenginewatch.com/
listing, reviews, ratings, tests, resources, tutorials
Tefko Saracevic, Rutgers University
17
Reference sites - facts
Reference services & access changing drastically
Several models in reference services:
Martindale’s Reference Desk - comprehensive
http//www-sci.lib.uci.edu/~martindale/Ref.html
Ask Jeeves! - natural language http://www.ask.com/
over 2 million queries per day; growing 46% per quarter
Electric Library - membership http://www.elibrary.com/
Review of several reference sites http//www.libraryjournal.com/articles/multimedia/webwatch/1999110
1_12593.asp
Tefko Saracevic, Rutgers University
18
Reference ...
Sources … continued
Information Please - almanacs
http://www.infoplease.com/
Reference Desk - rich http://www.refdesk.com/
Encyclopedia Britannica
http://www.britannica.com/
great many cross-references & other sources
Webhelp - “real people, real answers, real time”
live conversation with one of the 1000+
“Web wizards” www.webhelp.com
Tefko Saracevic, Rutgers University
19
Libraries as Web sources
Libraries providing open collections & services
growth of digital libraries & Web access
models vary; parts open to all, parts only to own users
One example, among great many:
Rutgers libraries - large & long term effort
http://www.libraries.rutgers.edu/
various sources & links involved
e.g for domain information& sources go to:
Electronic Ready Reference Shelf; Research Guides; Social Sciences
& Law; Library & Information Science
Tefko Saracevic, Rutgers University
20
Virtual libraries on the
Web
Libraries emerging only on the Web
More & more libraries & organizations involved
Examples of libraries rich in sources & links
Virtual Library - Switzerland, US, UK & other countries,
started by Tim Berners-Lee the creator the Web http://vlib.org.
Toronto Public Library http://vrl.tpl.toronto.on.ca/
Internet Public Library, Michigan http://www.ipl.org/
Academic Info - “Gateway to Quality Educational
Resources.” International http://academicinfo.net/
Tefko Saracevic, Rutgers University
21
New modes of access
Libraries, agencies, companies, developing reference
& service models - new, rich, innovative e.g.
For & about children Los Angeles Public Library - great
fun! http://www.lapl.org/kidsweb/
Parenting: Parenttime http://www.parenttime.com/home/homepage.cgi
Fathom - consortium of six leading institutions in US & UK
beta testing - top quality research coverage http://www.fathom.com/
Course on Internet use with links http://www.newbie.org/
Tefko Saracevic, Rutgers University
22
Domain sites
Many domain/issue specific sites
rich & often unique coverage & services
different approaches & requirements
Examples in health related domains:
Medscape - registration required
http://www.medscape.com/
Rxlist - The Internet Drug Index
http://www.rxlist.com/
Mayo Clinic HealthOasis http://www.mayohealth.org/
Tefko Saracevic, Rutgers University
23
Societies, organizations ,
publishers
Great many rich sources for searching
differences in requirements, depth, richness
Examples from variety of organizations:
Assoc. for Computing Machinery http://www.acm.org/
Digital Library; subscription or registration, searchable
State department http://www.state.gov/
about the U.S & other countries
R.R. Bowker http://www.bowker.com/
free sections - Yours for the Asking; Library Resource Guide
Genealogy: http://www.familysearch.org/
Tefko Saracevic, Rutgers University
24
Newspapers
Various online newspapers models are explored
beyond having just a print copy on the Web
subscription; links; archives; more elaborate stories …
e.g. San Francisco Examiner - http://examiner.com/
articles, in depth projects, area guide (SF Gate), archive ...
Finding stories & papers: Excite News Tracker
http://nt.excite.com/
Includes: World Newspapers Resources
Index of some major world news papers (from New
Zealand) http://www.ccc.govt.nz/Library/Resources/Newspapers/index.asp
Tefko Saracevic, Rutgers University
25
Summary
Web is:
rapidly evolving, changing, expanding
unpredictable, rich, and valuable source
Knowledge & competencies needed to use it
effectively, also common sense & flexibility
Three Web laws always in effect!
Web economics
rewards big, but costs significant
Tefko Saracevic, Rutgers University
26
But … limitations
The public Web does not have it all
Many rich resources not accessible without paying
DIALOG covers many fields & is larger than the Web
similarly Lexis - Nexis, Data Star etc.
Majority of content in libraries is NOT on Web
Majority of archives, old newspapers NOT on Web
WEB IS RICH, BUT NOT A BEGINNING & END
OF INFORMATION SOURCES
Tefko Saracevic, Rutgers University
27
Tefko Saracevic, Rutgers University
28