Transcript Web IR

Internet Resources Discovery (IRD)
Web IR
1
T.Sharon - A.Frank
Web IR
• What’s Different about Web IR?
• Web IR Queries
• How to Compare Web Search
Engines?
• The ‘HITS’ Scoring Method
2
T.Sharon - A.Frank
What’s different about the Web?
• Bulk ……………... (500M); growth at 20M/month
• Lack of Stability ..… Estimates: 1%/day--1%/week
• Heterogeneity
– Types of documents …. text, pictures, audio, scripts...
– Quality
– Document Languages ……………. 100+
• Duplication
• Non-running text
• High Linkage ..………….>
= 8 links/page average
3
T.Sharon - A.Frank
Taxonomy of Web Document Languages
DSSSL
SGML
HyTime
XML
Metalanguages
Languages
XSL
CSS
TEI Lite
HTML
XHTML RDF MathML SMIL
4
T.Sharon - A.Frank
Style
sheets
Non-running Text
5
T.Sharon - A.Frank
What’s different about the Web Users?
Make poor queries
– short (2.35 terms average)
– imprecise terms
– sub-optimal syntax (80%
without operators)
– low effort
Wide variance on
–
–
–
–
6
Needs
Expectations
Knowledge
Bandwidth
T.Sharon - A.Frank
Specific behavior
– 85% look over one
result screen only
– 78% of queries not
modified
Why don’t the Users get what they Want?
Example
User need
Translation
problems
Polysemy
Synonymy
7
I need to get rid of
mice in the basement
User request What’s the best way
(verbalized) to trap mice alive?
Query to
IR system
Results
T.Sharon - A.Frank
Mouse trap
Computer supplies
software, etc
Alta Vista: Mouse trap
8
T.Sharon - A.Frank
Alta Vista: Mice trap
9
T.Sharon - A.Frank
Challenges on the Web
•
•
•
•
•
•
10
Distributed data
Dynamic data
Large volume
Unstructured and redundant data
Data quality
Heterogeneous data
T.Sharon - A.Frank
Web IR Advantages
High Linkage
Interactivity
Statistics
– easy to gather
– large sample sizes
11
T.Sharon - A.Frank
Evaluation in the Web Context
• Quality of pages varies widely
• Relevance is not enough
• We need both relevance and high
quality = value of page
12
T.Sharon - A.Frank
Example of Web IR Query Results
13
T.Sharon - A.Frank
How to Compare Web Search Engines?
 Search engines hold huge
repositories!
 Search engines hold
different resources!
 Solution:
Precision at top 10
– % of top 10 pages that are
relevant (“ranking quality”)
14
T.Sharon - A.Frank
Retrieved
(Ret)
RR
Resource
s
Relevant
Returned
The ‘HITS’ Scoring Method
• New method from 1998:
– improved quality
– reduced number of retrieved documents
• Based on the Web high linkage
• Simplified implementation in Google
(www.google.com)
• Advanced implementation in Clever
 Reminder: Hypertext nonlinear graph structure
15
T.Sharon - A.Frank
‘HITS’ Definitions
• Authorities: good sources of content
• Hubs: good sources of links
16
T.Sharon - A.Frank
A
H
‘HITS’ Intuition
• Authority comes from in-edges.
Being a hub comes from outedges.
• Better authority comes from
in-edges from hubs.
Being a better hub comes from
out-edges to authorities.
17
T.Sharon - A.Frank
A
H
H
H
H
A
A
H
A
A
‘HITS’ Algorithm
Repeat until HUB and AUTH converge:
Normalize HUB and AUTH
HUB[v]
:= AUTH[ui] for all ui with Edge(v,ui)
AUTH[v] :=  HUB[wi] for all wi with Edge(wi,v)
w1
w2
...
u1
v
A
H
uk
wk
18
u2
...
T.Sharon - A.Frank
Google Output: Princess Diana
19
T.Sharon - A.Frank
Prototype Implementation (Clever)
1. Selecting documents
using index (root)
2. Adding linked
documents
3. Iterating to
find hubs
and authorities
20
T.Sharon - A.Frank
Base
Root
By-products
• Separates Web sites into clusters.
• Reveals the underlying structure of
the World Wide Web.
21
T.Sharon - A.Frank