Searching the Web

Download Report

Transcript Searching the Web

Searching the Web
Or
“If there’s so much out there, why can’t I find it?”
Presented by: Allen Brown IS/SE
Date: 2003-05-12

Outline - Searching the Web
1.
2.
3.
4.
5.
6.
7.
8.
Information Cartography
Visible and Invisible Web Information
Information Finding Strategies
Reference Tools, Pathfinders, Specialized
Information Repositories, Subject
Directories, and Search Engines
Information Search Strategies
Information Evaluation Strategies
Information Finding Summary
Search Engines and their Characteristics
Searching the Web - 2

Information Cartography
Imagine a physical map of an ocean basin
• identifiable areas of the sea floor
• large abyssal plain
• many undulating hills above the plain
• occasional higher elevations or plateaus
• sparse atolls and seamounts
Imagine the Web
• some information content identifiable by subject
• vast amounts of very low value information
• some good stuff distributed across many sites
• occasional high quality site with quality and quantity
• sparse stunningly useful sites (to die for)
Searching the Web - 3

Information Cartography - 2
Information issues:
quality
completeness
+ location!
In searching for information we need to adjust the:
•breadth of search to find all that is relevant in an “ocean” of information
•quality level to find only “atolls” of information quality
to find everything that is important and useful
Searching the Web - 4

Visible and Invisible Information
Information space
Visible = indexed by search engine
Invisible = not indexed but accessible
db 2 site 3
engine 4
engine 2
engine 3
engine 1
site 7 db 1 db 4
db 6
site 5
Searching the Web - 5

Search Engines Won’t Do It All!
According to a recent study reported in Nature (1) no search
engine indexes more than 16% of the Web. Even though search
engine databases are enormous, they cover very little of what's
actually available on the Web.
1) Steve Lawrence and C. Lee Giles. (July 8, 1999). Accessibility of
Information on the Web. Nature, 400, 107 - 109
Searching the Web - 6

Information Finding Strategies
Identify Starting Points based on your question:
What type of information do you need?
Facts, statistics, government document, scholarly articles, popular
opinion, music, picture, multimedia, news, …
What form do you want the information in?
Dictionary definition, encyclopedia entry, journal article, elementary
school project, video file, audio file, …
What type of site would offer this information?
Academic, commercial, government, non-government organization
How much information do you need?
Introduction, in-depth, references, …
Searching the Web - 7

Information Finding
Reference Materials (Often invisible)
– dictionaries, thesauri, encyclopedia, newspapers
Information Pathfinders (Sometimes invisible) / Portals / Vortals
– subject specific, highly relevant, sometimes bizarre
– usually high quality
– managed by dedicated enthusiasts, possibly amateur
– e.g., Web design, Perl, micro cars, Curta calculators, …
Specialized Information Repositories (Often invisible) / Portals
– institution-based, sometimes obscure
– usually high quality
– managed by information professionals
– e.g., government documents, archives, …
Searching the Web - 8

Information Finding - 2
Subject Indices (Often invisible – but this is changing)
– subject-based
– e.g., Yahoo
Search Engines and Search Brokers (Visible web)
– e.g., Google, Alta Vista, Hot Bot, Lycos, Vivisimo, dogpile
Searching the Web - 9

Reference Tools - Dictionaries
http://www.yourdictionary.com/
Searching the Web - 10

Reference Tools - Thesauri
http://www.visualthesaurus.com/index.jsp
Searching the Web - 11

Reference Tools - Encyclopedia
http://www.britannica.com/
Searching the Web - 12

Pathfinders
A pathfinder site provides
an information map of what
is available within a fairly
narrow area of interest;
usually compiled by domain
experts. These sites are
often called “vortals”
(vertical portals).
Searching the Web - 13

Specialized Information Repositories National Library of Canada
A specialized
information
repository
often collects
and
catalogues
relatively
specific
information;
usually
compiled by
information
experts. Some
are considered
to be vortals.
Searching the Web - 14

Subject Directories
www.yahoo.com
Subject directories are
lists compiled by
people. They are
organized in a hierarchy
where each subject
includes a list of
sub-topics. These
sites are often called
“portals” - a one-site
starting location for
general information
seeking.
Searching the Web - 15

Subject Directories
Subjects lists are usually evaluated but sites are not presented in
order of relevancy. In other words, the best sites on a topic are not
necessarily listed first. Sites are compiled through submission of
URLs by site creators and human evaluation and selection.
One advantage of is their browsability, although this feature is only
suitable with fairly general topics. A disadvantage is their relatively
small size.
Other examples of subject directories :
Infomine: http://infomine.ucr.edu
Scout Report Signpost: http://www.signpost.org/signpost
Searching the Web - 16

Invisible Web Directories
Look at
http://www.invisibleweb.net/
Searching the Web - 17

Search Engines
Search engines use computer programs that automatically collect web
sites using "spiders" or "robots". The sites are indexed and stored in
an index database.
To query a search engine, type topic keywords and Boolean
connectors into a search "box." The search engine scans its index and
returns links to websites containing the specified keyword
relationships.
Size matters - an advantage of using search engines is their coverage
(though size is relative), but this can also be a disadvantage if
relevance ranking is poor.
Searching the Web - 18

Search Engines: Operational Concepts
World
Wide
Web
crawling
and page
contents
extraction
and
indexing
query
parsing,
index
index
lookup,
data
base
results
ranking and
management
Search Engine
query
User
query
results
Searching the Web - 19

Search Engines - Does Size Matter?
Searching the Web - 20

Size
If you are looking for
unusual or hard-to-find
information should try one or
more of the search engines
with a large index to check
more web content. This
improves the likelihood of
finding what you seek.
However, for general
searches or when looking for
information about popular
topics, a large index does
not necessarily equal better
results. Also, large indexes
may have longer re-visit
intervals.
Searching the Web - 21

Search Engines:
Search Scoping
and Ranking / Results Management
It is essential to learn and apply each engine's specialized search
formats to narrow results and filter and push the most relevant
pages to the top of the results list. Use Boolen operators,
proximity connectors, stems, wild cards, sounds-like, media-type
and metadata filters.
Result relevancy ranking also depends on the size of the search
index and how the search engine interprets and uses your
query.
Each engine determines result relevancy ranking in unique ways.
Consult the help file of each engine to learn about these.
Some engines offer search refinement and conceptual clustering
for better focus (tighter “hit cluster”) or greater accuracy / validity
(centred on the “right stuff”).
Searching the Web - 22

Search Engines - Search Scoping
+ expands the scope, - reduces the scope
• Exact phrase - - quotes, e.g., “We hold these things to be self-evident”
• Boolean operators - and - (default) or + (caution!) not - (extreme
•
•
•
•
•
•
caution!), e.g., large male dog, large or male or dog, not cat
Proximity connectors - near - (depends on engine), e.g., spring near flower
Stemming and wildcards - + e.g., swim*  swim, swimming, swimmer,
swimmers, swimmingly, …
Sounds-like - + e.g., table  cable, able, fable, …
Media type - - e.g., image, audio file, …
Concept-based + - e.g., synonym  thesaurus, antonym, homonym, …
Metadata-based - - in some systems
Searching the Web - 23

Search Engines - Ranking
Result relevancy ranking (=“usefulness”) can be done according to
two techniques (or some combination):
•
•
Conventional - using intra-page information
Relative - using extra-page information
Searching the Web - 24

Search Engines - Conventional Ranking
Conventional (intra-page):
• frequency of words (number and density)
• phrases (exact word sequences)
• hierarchy (e.g., closer to the top of the document)
• adjacency (proximity of words)
• metadata (keywords provided by content owners)
• font size and style (relative intra-page)

Jack Christensen repairs CURTA calculators. I've known Jack for many years and can highly recommend him.
Here are a few questions I asked Jack:
What do you charge to clean a Curta?
Typically $65 to $95, depending on the work involved. More often than not, the upper carriage needs a complete disassembly, whereas the main body can be cleaned
without a complete disassembly. If the main body needs to be completely disassembled, something is usually bent, out of adjustment, or broken.
What do you charge when repairing a Curta?
I charge $20 per hour of my time. It seems my hours are about 90 minutes long, however, because I rarely finish in the time I originally quoted. Extended repair time is
absorbed by me.
What spare parts do you have? Are they expensive?
I actually have many hundreds of new original Curta parts. Most are for inside the instrument, though. I use them when I do general cleaning and repairs. Outer body pieces,
replacement cannisters, and external parts that are easily damaged or broken due to abuse are not generally available, although I do occasionally locate some these items.
Sometimes I have to fabricate a part, or repair an item as best I can. Obviously, this takes time, and the cost is high.
Parts costs are charged as the traffic will bear. I usually try to be blunt about this to the Curta owner, often telling them that a severely damaged unit is best sold as a "parts
Curta". Unfortunately, I've sometimes had to tell this to someone who wanted to repair a Curta looked upon as an heirloom. What to them appears to be a minor issue
actually turns out to be a major problem (e.g., a crank handle tilted downward is due to a broken main shaft).
I think the most I ever charged for a repair was about $375. There were many severe problems with the unit. Generally, when the price gets to be above $175 most people
simply decide to keep the damaged Curta as a memento.
Can you replace a clearing ring? What costs are involved?
The plastic clearing rings are easy to install. I have several new ones, but I typically do not sell them separately as a spare part. Rather, I install them during a general
cleaning and repair.
Metal rings are more difficult to replace. As with the plastic clearing rings, I will only install a metal clearing ring during a general cleaning and repair. It takes a special tool to
properly swage the rivet in place. [Editor's note: Very old Type I clearing rings were held on with a screw and nut. The nut was also crimped to the screw threads.] I used all
the new metal clearing rings I had about five years ago, but I do have a few used ones that were removed from other damaged Curtas. I have these for both the Type I and
Searching the Web - 25

Search Engines - Relative Ranking
Relative (extra-page):
• popularity (page visits - from the search engine)
• citation (links pointing to the item)
• relevance of the pages containing the links pointing to the item (!)
Yahoo


Web Pages
Searching the Web - 26

Search Engines: Keys to Success
World Wide Web
Size  Large index
and / or several
engines
Scoped query  “wide net”
but appropriate “sieve”
carefully constructed for
your needs
Ranked and manageable results
 query construction and
search engine features
Searching the Web - 27

Meta Search Engines
“Meta" search tools are able to search the index databases of
multiple engines “simultaneously”, via a single interface.
“Meta” search tools don’t really search metadata. They are simply
brokers that reformulate a query and hand it off to a set of
search engines, then combine the results.
“Meta” engines are very fast but they do not offer the same level of
control over the relationship between keywords as do individual
search engines.
Also, meta search engines may produce poor ranking of combined
results.
Searching the Web - 28

Search Engines
Examples of popular search engines include:
Google: http://www.google.com
Alta Vista: http://www.altavista.com
All the Web http://www.alltheweb.com
Northern Light: http://www.northernlight.com
Also see
The KartOO clustering visual engine http://www.kartoo.com/
For meta engines, try Vivisimo at http://vivisimo.com/
Searching the Web - 29

Information Search Strategies
•
•
•
•
•
•
•
•
•
•
•
Think hard about what you are looking for!
Use a Reference Tool, if appropriate
Use a Pathfinder, if you know one
Use a Specialized Information Repository, if appropriate
Use Subject Indexes, if it is a common topic
Use several Search Engines, if needed, especially for the obscure or
academic topic, but learn how they work
Use keywords - be narrow, and specific (and technical)
Use phrases - try synonyms or related concepts
Use Boolean connectors - but find out if / how the engine uses them
Use stemming and wildcards - but find out if / how the engine uses
them
Use media-type filters or metadata, if appropriate
Searching the Web - 30

Information Search Tools - Use
depth
Pathfinder
focused content
pre-selected by
domain experts
Search Engines
and Meta-engines
easy
to use
obscure or
academic
caveat emptor!
Subject
Indexes
Specialized
Information
Repository
Information
space
popular or common
pre-selected by
interested people
related or themed
pre-selected by
professionals
contains “invisible”
content
Reference
Tool
hard to
use well
generic simple lookup
created by
professionals
contains “invisible”
content
breadth
Searching the Web - 31

Information Evaluation Strategies:
CARS
CARS checklist:
http://library.queensu.ca./inforef/guides/evalchart.htm
• Credibility
- author credentials stated with email contact
- evidence of quality control (site location)
• Accuracy
- timeliness
- comprehensiveness
- audience & purpose
• Reasonableness
- fairness
- objectivity
- consistency
- world view
• Support
- source documentation or bibliography
Searching the Web - 32

Summary
 There is much information on the Web, but it’s not:
- all there
- all good (or all bad)
- always easy to locate
 Use an information search strategy that:
- matches the information sought
- uses the appropriate tools
- uses them in the correct ways
 Use an information evaluation strategy, e.g., CARS methodology.
 Choose and use search engines wisely, knowing their strengths,
features, and their limitations.
Searching the Web - 33

How Do Search Engines Work?
Three Activities Occur:
1. Crawling
– fetch pages
– compile URL list (a db)
– re-visit pages
2. Page harvesting
– parse page
– add to index db and establish ranking
3. Responding to search requests
– parse query
– apply to index
– present and rank results
Searching the Web - 34

Search Engines: Operation
fetch
URL
Crawler
Robot
re-visit
URL
Really clever stuff
in here
URL
data
base
World
Wide
Web
fetch
Harvester
Robot
page
contents
query
Query
Processor
query
User
results
Fairly clever stuff
in here
Index
data
base
Search Engine
Searching the Web - 35

Search Engine - Hardware
(not really …)
Searching the Web - 36

How Do Search Engines Work?
• See “The Anatomy of a
Large-Scale Hypertextual
Web Search Engine” at
http://wwwdb.stanford.edu/~backrub/
google.html
Searching the Web - 37

References
• Information Search Strategies:
<http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html>
• Information Evaluation Strategies:
<http://www.vuw.ac.nz/~agsmith/evaln/evaln.htm>
• Search Engines:
< http://www.library.arizona.edu/search.htm>
< http://www.brightplanet.com/deepcontent/tutorials/search/index.asp >
< http://www.searchenginewatch.com/ >
• Susan Maze, David Moxley, Donna Smith:
Authoritative Guide to Web Search Engines,
Neal Schuman Pub, 1997, ISBN 1555703054
Searching the Web - 38
