InvisibleWeb - School of Communication and Information

Download Report

Transcript InvisibleWeb - School of Communication and Information

The Invisible Web
Tefko Saracevic, PhD
Rutgers University
http://www.scils.rutgers.edu/~tefko
(contains also a list of sites relevant to the
topic and this presentation)
© Tefko Saracevic, Rutgers University
1
What is invisible Web?
• Materials that general search engines
cannot or WILL not include in their
collection of Web pages (indexes)
• You cannot find through general search
engines
• Contains a vast amount of information
– much of it authoritative, qualitative
© Tefko Saracevic, Rutgers University
2
Why search engines miss?
• Size: Web is huge, cannot cover all
• Economics: associated costs are high
– also pay per crawl & rank
•
•
•
•
Technical: still limited capabilities
Spam: eliminating bad also looses good
Restrictions: some site do not let in
Deep structure: some sites complex
© Tefko Saracevic, Rutgers University
3
Web size - who knows?
• Estimated over 16 million web servers
Lawrence & Giles, 1999
– But only a fraction of direct search relevance
• Domains of sites
• 83% commercial, 6% scientific or educational; 3%
health
• 2.5% personal; 2% societies; 1.5% government,
• about 1% each community, religion
• 1.5% pornographic
• Web Characterization Project - OCLC
– statistics, trends, report, links … for 2001 reports 8.5 mill web sites
– http://wcp.oclc.org/
© Tefko Saracevic, Rutgers University
4
Organization of sources
• No standardization across sources
• Major approaches in search engines
– classification: many directory types used
– statistical analyses of terms, links
• Metatags in sources
– to enable retrieval by fields
– HTML “keywords”, “description”
• 34% of sites use them
– Dublin core - .3% sites use
• Organization: hindrance to retrieval
– also faked contents to force retrieval
© Tefko Saracevic, Rutgers University
5
Sources & search engines
• Indexed by search engines (publicly indexed)
– by terms, selection, links, registration
• Not publicly indexed
– many domain sources will not be found e.g digital
libraries, online journals, reference
– many commercial sites will hardly be found
• Differing approaches to inclusion/selection
– mostly automatic; also generic source providers
– increasingly added human evaluation & selection
© Tefko Saracevic, Rutgers University
6
Search engine coverage
• No engine covers more than 16% of WWW
• In respect to combined coverage of 11 top:
– Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot
27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6
Excite 13.5, Lycos 5.9, EuroSeek 5.2
– HotBot MS, Snap & Yahoo use Inktomi as search provider,
but have different filtering & Inktomi databases
• Northern Light has ‘special collection’ - documents
not part of publicly indexabable web
• Hard to discern & compare coverage
• Many national search engines - own
coverage
© Tefko Saracevic, Rutgers University
7
Meta search engines
• Search engines that cover search engines –
many around e.g.
– All4one http://all4one.com/
• four windows - good for comparison
– CDNET Search.com ttp://www.search.com/
• meta engine of meta engines - customization
• Search Engines Worldwide
http://www.twics.com/~takakuwa/search/search.html
• 174 countries, over 1300 engines
• More on the horizon & differing
© Tefko Saracevic, Rutgers University
8
Major source for invisible
Web
• Book
Chris Sherman & Gary Price (2001).
Invisible Web: Uncovering information
sources search engines can’t see.
Information Today
• Site
www.invisible-web.net
© Tefko Saracevic, Rutgers University
9
Specialized meta engines
• Selective with directories & large number
of databases & search engines
– Complete Planet http://completeplanet.com
– Invisible Web http://invisibleweb.com
• In the U.S. federal information via
Government Printing Office Access
http://www.gpo.gov/gpoaccess
• Federal Bulletin Board (file libraries for download from
many agencies): http://fedbbs.access.gpo.gov
© Tefko Saracevic, Rutgers University
10
Reference (expert) services
• Reference services - several models
– Q&A, directories, email answers etc. – e.g.
– Martindale’s Reference Desk - comprehensive
http://www-sci.lib.uci.edu/~martindale/Ref.html
– Ask Jeeves! – most popular http://www.ask.com/
– Ask ERIC – education questions- email answers
http://www.askeric.org/Qa/
– Information Please - almanac type questions
http://www.infoplease.com/
• Academic libraries developing reference
models - new service area
© Tefko Saracevic, Rutgers University
11
Libraries as Web sources
• Academic libraries providing open
collections & services; models vary
– Rutgers libraries - big long term effort
http://www.libraries.rutgers.edu/
– various sources & links involved
• for domain information& sources go to:
– Electronic Reference Sources; Subject Research
Guides: Social Sciences & Law; Library &
Information Science
– University of California, Berkeley - a most
elaborate effort together with Sun
Corporation http://sunsite.berkeley.edu/
© Tefko Saracevic, Rutgers University
12
Virtual libraries on the Web
• Libraries emerging only on the Web
– More & more libraries & organizations involved
Examples of academic & public libraries
– Virtual Library - Switzerland, US, UK & other
countries – ‘oldest virtual library on the Web’
• http://vlib.org
– Toronto Public Library
– Internet Public Library, Michigan
• http://www.ipl.org/
© Tefko Saracevic, Rutgers University
13
Domain sites
• Many domain/issue specific sites
– rich & often unique coverage & services
– different approaches & requirements
• Examples in health related domains:
– Medscape - registration required
http://www.medscape.com/
– Rxlist - The Internet Drug Index
http://www.rxlist.com/
– Mayo Clinic HealthOasis
http://www.mayohealth.org/
© Tefko Saracevic, Rutgers University
14
•
Societies, organizations ,
publishers
Great many rich sources for searching
– differences in requirements, depth, richness
Examples from variety of organizations:
– Assoc. for Computing Machinery
http://www.acm.org/
• Digital Library; subscription or registration
– State department http://www.state.gov/
• about the U.S & other countries
– R.R. Bowker http://www.bowker.com/
• Free Resources from Bowker; Library Resource Guide
– Genealogy: http://www.familysearch.org/
© Tefko Saracevic, Rutgers University
15
Language barriers on the Web
• English still the major language
– but declining, now slightly over 50%
• Multilingual retrieval search engines
– Euroseek – searches 40 languages
http://www.euroseek.com/
– All the Web – 45 languages
http://www.alltheweb.com/
– in both, search in different languages covers
primarily their language sources
© Tefko Saracevic, Rutgers University
16
Language barriers: translations
• A number of translation sites
– machine aided – i.e. plug in terms,
phrases, sentences in one & review in the
other language , but effectiveness???
– Free Translations
http://www.freetranslations.com
– Babel Fish http://babelfish.altavista.com/tr
– Travlang – great for travelers – phrases
http://www.travlang.com
© Tefko Saracevic, Rutgers University
17
News sources about the
Web visible & invisible
– The Virtual Acquisition Shelf & News Desk
http://resourceshelf.blogspot.com/
– Free Pint http://www.freepint.com/
– ResearchBuzz.
http://www.researchbuzz.com/index.shtml
– Internet Resources Newsletter.
http://www.hw.ac.uk/libwww/irn/
– Search Engine Watch.
http://www.searchenginewatch.com/
© Tefko Saracevic, Rutgers University
18
Sample of great sources for
invisible Web
– Direct Search.
http://gwis2.circ.gwu.edu/~gprice/direct.htm
– eLibrary. http://ask.elibrary.com/
– The Scout Report. http://scout.cs.wisc.edu/
– Museum of online museums.
http://www.coudal.com/archives/museum.html
– Librarians index to the Internet. http://www.lii.org/
– Profusion. http://www.profusion.com/
– Research Index. http://www.researchindex.com/
– Cybercafe Search Engine.
http://www.cybercaptive.com
© Tefko Saracevic, Rutgers University
19
Needed for Web searching
in general
• Knowledge & competencies
– variety of Web sources
– their organization
– search engines
– Web search strategies
– search dynamics, feedback
• Keeping up & up & up
– constant updates, changes, innovations
– many domain/subject specific
© Tefko Saracevic, Rutgers University
20
Needed for Web searching by
professionals
• Knowledge of SOURCES in area of interest
• search engines not enough
• not too helpful in finding these other sources;
structure hard to discern
• Evaluation of sources
– a key professional skill!
• standard criteria: quality, veracity, coverage etc
• plus Web criteria:
authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
© Tefko Saracevic, Rutgers University
21
competencies …
•
•
•
•
•
•
•
Knowledge of users & use
Knowledge of searching
Use of technology
Adaptability, flexibility
Integration with other resources
Teaching others
Constant learning & update
© Tefko Saracevic, Rutgers University
22
© Tefko Saracevic, Rutgers University
23