Boolean vs Statistical Retrieval Systems
Download
Report
Transcript Boolean vs Statistical Retrieval Systems
IS530 Lesson 12
Boolean vs. Statistical
Retrieval Systems
Boolean or Statistical?
Most web search engines default to
statistical, use Boolean for advanced
Most proprietary online systems default
to Boolean, use statistical for alternative
Statistical search engine vs. relevance
ranking of Boolean results
Web Search Engines
Databases generated by robotic programs
(non-human)
spiders, wanderers, web walkers, agents
Full-text indexing of website contents
Supports advanced, complex search
strategies
3 Parts of a Web Search Engine
1. Spider or web-crawler
reads webpage, follows links
2. Index
catalogs webpages read by spider
3. Search engine software
matches queries
lists most relevant site first
3 Parts of an Online System
1) Database building software (dataware)
(follows rules with known fields)
2)Index/dictionary file
(list of all words and sometimes phrases
in the indexed fields)
3) Search engine software
(matches queries; Boolean or statistical;
LIFO or relevant
Boolean Operators
AND limits search
NOT limits search
decreases hits
increases precision
OR expands search
increases precision
decreases hits
seldom used
too strong
Proximity Operators
Adj, (N)ear, (W)ith
limit a search
increase precision
Command Interface
Boolean Searching (Westlaw)
Find information about the assumption of
risk involving people who fall after slipping
in wintery conditions.
assum! /5 risk / p (ic* or snow****) /p
(slip! or fell or fall***)
Natural Language and
Relevance Ranking (WIN)
I need information on
assumption of risk involving a
person who has fallen on ice or
snow.
Non-Boolean Retrieval Systems
Statistical
(associative, probabilistic,
or relevance systems)
Linguistic
(semantic)
Statistical Retrieval Systems
Incorporate relevance ranking
May incorporate relevance feedback
May have natural language interface
Almost all web search engines use
Algorithm
Latin algorismus, after al-KhwArizmi
Arabian mathematician (AD 825)
Step-by-step procedure for solving
mathematical problems
Merriam-Webster
http://www.m-w.com/
Statistical search engines use weighting
algorithms to compute relevance
Statistical Search Engines
Weighting algorithms are proprietary
Search engines differ in how they assign
weights and compute relevance ranking
Search results differ
studies found only about 40% overlap
Statistical Web Retrieval Factors
Popularity, # other sites that link to a site
authoritative sites given heavier weight
Google
Meta-tags may boost ranking
Inktomi/Overture
Direct hit may boost ranking
HotBot
Linguistic Retrieval System
Natural Language & Relevance
Ranking
WIN - (Westlaw Is Natural) has some elements
I need information on assumption of risk
involving a person who has fallen on ice or
snow.
WIN Steps
1. Enter query in plain English
2. System removes stop phrases
3. Matches legal phrases from thesaurus,
adjusts weighting
4. Removes stop words
WIN Steps (cont.)
5. Stemming
6. Searches database indexes in OR
relationship
7. Statistical comparison applied
8. Results placed in ranked order
Factors in Determining Relevance
Proximity of query words to each other
Position of query words
keywords in title rank higher
keyword in headline or near top
Relative length of document
(“normalization”)
Stemming
Factors in Determining Relevance
(cont.)
Ignore very frequent terms
Inverse term frequency
Relevance feedback
Stop words
Query expansion/thesaurus
Features Users Can Control
Designating “bound phrases”
Flagging terms that must be present*
Specifying truncat?
Indicating (synonym groups)
Synonym dictionaries
Web Sites that list search engines and features:
www.pandia.com
www.searchenginewatch.com
http://notess.com