Transcript AIRS 2005
Fusion-based Approach to
Web Search Optimization
Kiduk Yang, Ning Yu
WIDIT Laboratory
SLIS, Indiana University
AIRS2005
OUTLINE
Introduction
WIDIT in TREC Web Track
Results & Discussion
AIRS2005
2
Introduction
Web IR
Challenges
• Size, Heterogeneity, Quality of data
• Diversity of user tasks, interests, characteristics
Opportunities
• Diverse Sources of Evidence
• Data Abundance
WIDIT Approach to Web IR
Leverage Multiple Sources of Evidence
Utilize Multiple Methods
Apply Fusion
Research Question
What to combine?
How to combine?
AIRS2005
3
WIDIT in Web Track 2004
Data
Documents
• 1.25 million .gov Web pages (18 GB).
Topics (i.e. queries)
• 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)
Task
Retrieve relevant documents given a mixed set of query types (QT)
Main Strategy
Fusion of multiple data representations
• Static Tuning (QT-independent)
Fusion of multiple sources of evidence
• Dynamic Tuning (QT-specific)
AIRS2005
4
WIDIT: Web IR System Architecture
Anchor Index
Body Index
Sub-indexes
Sub-indexes
Header Index
Static
Tuning
Sub-indexes
Documents
Indexing
Module
Fusion
Module
Retrieval Module
Search
Results
Topics
Queries
Simple Queries
Query
Classification
Module
Queries
Expanded Queries
Query
Types
Dynamic
Tuning
Fusion
Result
Re-ranking
Module
Final
Result
AIRS2005
5
WIDIT: Indexing Module
Document Indexing
1.
Strip HTML tags
•
•
2.
Create Surrogate Documents
•
•
3.
anchor texts of inlinks
header texts (title, meta text, emphasized text)
Create Subcollection Indexes
•
•
4.
extract title, meta keywords & description, emphasized words
parse out hyperlinks (URL & anchor texts)
Stop & Stem (Simple, Combo stemmer)
compute SMART & Okapi term weights
Compute whole collection term statistics
Query Indexing
1.
2.
3.
4.
Stop & Stem
Identify nouns, phrases
Expand acronyms
Mine synonyms and definitions from Web search
AIRS2005
6
WIDIT: Retrieval Module
1.
Parallel Searching
Multiple Document Index
•
•
•
Multiple Query formulations
•
•
stemming (Simple, Combo)
expanded query (acronym, noun)
Multiple Subcollections
•
•
2.
body text (title, body)
anchor text (title, inlink anchor text)
header text (title, meta kw & desc, first heading, emphasized words)
for search speed and scalability
search each subcollection using whole collection term statistics
Merge subcollection search results
merge & sort by document score
AIRS2005
7
WIDIT: Fusion Module
Fusion Formulas
Weighted Sum
•
= weight of system i
(relative contribution of each system)
NSi = normalized score of a document by system i
= (Si – Smin) / (Smax – Smin)
Select candidate systems to combine
Top performers in each category
•
e.g. best stemmer, qry expansion, doc index
Diverse systems
•
e.g. Content-based, Link-based
One-time brute force combinations for validation
•
FSws = (wi*NSi)
wi
Complementary Strength effect
Determine system weights (wi)
Static Tuning
•
•
Evaluate fusion formulas using a fixed set of values
(e.g. 0.1..1.0) with training data
Select the formulas with best performance
AIRS2005
8
WIDIT: Query Classification Module
Statistical Classification (SC)
Classifiers
•
•
Training Data
•
•
Titles of 2003 topics (50 TD, 150 HP, 150 NP)
w/ and w/o stemming (Combo stemmer)
Training Data Enrichment for TD class
•
Naïve Bayes
SVM
Added top-level Yahoo Government category labels
Linguistic Classification (LC)
•
Create HP and NP lexicons
Ad-hoc heuristic
•
Word Cues
e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic
Combination
More ad-hoc heuristic
•
if strong word cue, LC
else if single word, TD
else SC
AIRS2005
9
WIDIT: Re-ranking Module
Re-ranking Features
Field-specific Match
•
Exact Match
•
•
title, header text, anchor text
body text
Indegree & Outdegree
URL Type: root, subroot, path, file
•
based on URL ending and slash count
Page Type: HPP, HP, NPP, NP, ??
•
Query words, acronyms, phrases in URL, title, header, anchor text
based on word cue & heuristic
Re-ranking Formula
Weighted sum of re-ranking features
Dynamic Tuning
Dynamic/interactive optimization of the QT-specific re-ranking formula
AIRS2005
10
WIDIT: Dynamic Tuning Interface
AIRS2005
L
W
11
Dynamic Tuning: Observations
Effective re-ranking factors
•
•
•
HP
indegree, outdegree, exact match, URL/pagetype
minimum number of outdegree =1
NP
indegree, outdegree, URLtype
o
•
•
1/3 impact of HP
TD
acronym, outdegree, URLtype
minimum number of outdegree =10
Strength
Combines the human intelligence (pattern recognition) w/
computational power of machine
Good for system tuning with many parameters
Facilitates failure analysis
Weakness
Over-tuning
Sensitive to initial results & re-ranking parameter selection
AIRS2005
12
Results
Run Descriptions
Best fusion run: F3
•
MRR
(HP)
0.1349
(+38.5%)
0.6545
(+ 6.7%)
0.6265
(+47.2%)
F3
(baseline)
0.0974
0.6134
0.4256
TREC
Median
0.1010
0.5888
0.5838
0.4*A + 0.3*F1 + 0.3*F2
Dynamic re-ranking runs
•
MRR
(NP)
DR_o
where F1= 0.8*B + 0.05*A + 0.15*H
A= anchor, B=body, H=header
MAP
(TD)
(DR_o) w/ official QT
Observations
Dynamic tuning works well
•
significant improvement over baseline (TD, HP)
NP reranking needs to be optimized
•
relatively small improvement by reranking
AIRS2005
13
Discussion: Web IR Methods
What worked?
Fusion
• Combining multiple sources of evidence (MSE)
Dynamic Tuning
• Helps multi-parameter tuning & failure analysis
What next?
Expanded MSE mining
• Web server and search engine logs
Enhanced Reranking Feature Selection & Scoring
• Modified PageRank/HITS
• Link noise reduction based on page layout
Streamlined Fusion Optimization
AIRS2005
14
Discussion: Fusion Optimization
Conventional Fusion Optimization approaches
Exhaustive parameter combination
• Step-wise search of the whole solution space
• Computationally demanding when the number of parameter is large
Parameter combination based on past evidence
• Targeted search of restricted solution space
i.e., parameter ranges are estimated based on training data
Next-Generation Fusion Optimization approaches
Non-linear Transformation function for Reranking Feature scores
• e.g. log transformation to compensate for the power law distribution of
PageRank
Hybrid Fusion Optimization
• Semi-automatic Dynamic Tuning
• Automatic Fusion Optimization by Category
AIRS2005
15
Category 1
Top 10 systems
Category 2
Top system in
each query length
Category n
Automatic fusion optimization
performance gain
> threshold?
Fetching result sets
For different categories
No
optimized
fusion formula
Yes
Results pool
Automatic Fusion Optimization
AIRS2005
16
Resources
WIDIT (Web Information Discovery Integrated Tool) Lab:
http://widit.slis.indiana.edu/
http://elvis.slis.indiana.edu/
Dynamic Tuning Interface (Web track)
http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm
WIDIT projects TREC Web track
Dynamic Tuning Interface (HARD track)
http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm
WIDIT projects TREC HARD track
Thank you!
Questions?
AIRS2005
17
SMART
Length-Normalized Term Weights
• SMART lnu weight for document terms
• SMART ltc weight for query terms
d ik
log( f ik ) 1
t
(log( f
j 1
where:
ij
) 1)
qk
2
(log( f k ) 1) idf k
t
[(log( f
j 1
j
) 1) idf j ]2
fik = number of times term k appears in document i
idfk = inverse document frequency of term k
t = number of terms in document/query
Document Score
• inner product of document and query vectors
t
q d i qk d ik
T
k 1
where:
qk = weight of term k in the query
dik = weight of term k in document i
t = number of terms common to query & document
AIRS2005
18
Okapi
Document Ranking
w
T Q
where:
RS
(1)
k1 1tf
(k3 1) qtf
K tf
k3 qtf
Q = query containing terms T
K = k1 ((1-b) + b*(doc_length/avg.doc_length))
tf = term frequency in a document
qtf = term frequency in a query
tf = term frequency in a document
k1 , b, k3 = parameters (1.2, 0.75, 7..1000)
wRS = Robertson-Sparck Jones weight
WRS
N
n
R
n
r 0.5
R
r
0
.
5
log
n r 0.5
N n R r 0.5
Document term weight
(simplified formula)
N n 0.5 tf
log
n 0.5 K tf
Query term weight
k3 1qtf
k3 qtf
= total number of documents in the collection
= total number of documents in which the term occur
= total number of relevant documents in the collection
= total number of relevant documents retrieved
AIRS2005
19
Webpage Type Identification
URL Type (Tomlinson, 2003; Kraaij et al., 2002)
• Heuristic
o
o
o
o
root:
subroot:
path:
file:
slash_cnt=0 or (HP_end & slash_cnt=1)
HP_end & slash_cnt=2
HP_end & slash_cnt>=3
rest
(HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)
Page Type
• Heuristic
if “welcome” or “home” in title, header, anchor text HPP
else if “YYYY” in title, anchor NPP
else if NP lexicon word NP
else if HP lexicon word HP
else if ends in all caps HP
else ??
• NP lexicon
o
about, annual, report, guide, studies, history, new, how
• HP lexicon
o
office, bureau, department, institute, center, committee, agency, administration, council,
society, service, corporation, commission, board, division, museum, library, project,
group, program, laboratory, site, authority, study, industry
AIRS2005
20