Transcript AIRS 2005

Fusion-based Approach to
Web Search Optimization
Kiduk Yang, Ning Yu
WIDIT Laboratory
SLIS, Indiana University
AIRS2005
OUTLINE
 Introduction
 WIDIT in TREC Web Track
 Results & Discussion
AIRS2005
2
Introduction
 Web IR

Challenges
• Size, Heterogeneity, Quality of data
• Diversity of user tasks, interests, characteristics

Opportunities
• Diverse Sources of Evidence
• Data Abundance
 WIDIT Approach to Web IR



Leverage Multiple Sources of Evidence
Utilize Multiple Methods
Apply Fusion
 Research Question


What to combine?
How to combine?
AIRS2005
3
WIDIT in Web Track 2004
 Data

Documents
• 1.25 million .gov Web pages (18 GB).

Topics (i.e. queries)
• 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)
 Task

Retrieve relevant documents given a mixed set of query types (QT)
 Main Strategy

Fusion of multiple data representations
• Static Tuning (QT-independent)

Fusion of multiple sources of evidence
• Dynamic Tuning (QT-specific)
AIRS2005
4
WIDIT: Web IR System Architecture
Anchor Index
Body Index
Sub-indexes
Sub-indexes
Header Index
Static
Tuning
Sub-indexes
Documents
Indexing
Module
Fusion
Module
Retrieval Module
Search
Results
Topics
Queries
Simple Queries
Query
Classification
Module
Queries
Expanded Queries
Query
Types
Dynamic
Tuning
Fusion
Result
Re-ranking
Module
Final
Result
AIRS2005
5
WIDIT: Indexing Module

Document Indexing
1.
Strip HTML tags
•
•
2.
Create Surrogate Documents
•
•
3.

anchor texts of inlinks
header texts (title, meta text, emphasized text)
Create Subcollection Indexes
•
•
4.
extract title, meta keywords & description, emphasized words
parse out hyperlinks (URL & anchor texts)
Stop & Stem (Simple, Combo stemmer)
compute SMART & Okapi term weights
Compute whole collection term statistics
Query Indexing
1.
2.
3.
4.
Stop & Stem
Identify nouns, phrases
Expand acronyms
Mine synonyms and definitions from Web search
AIRS2005
6
WIDIT: Retrieval Module
1.
Parallel Searching

Multiple Document Index
•
•
•

Multiple Query formulations
•
•

stemming (Simple, Combo)
expanded query (acronym, noun)
Multiple Subcollections
•
•
2.
body text (title, body)
anchor text (title, inlink anchor text)
header text (title, meta kw & desc, first heading, emphasized words)
for search speed and scalability
search each subcollection using whole collection term statistics
Merge subcollection search results

merge & sort by document score
AIRS2005
7
WIDIT: Fusion Module

Fusion Formulas

Weighted Sum
•

= weight of system i
(relative contribution of each system)
NSi = normalized score of a document by system i
= (Si – Smin) / (Smax – Smin)
Select candidate systems to combine

Top performers in each category
•


e.g. best stemmer, qry expansion, doc index
Diverse systems
•
e.g. Content-based, Link-based
One-time brute force combinations for validation
•

FSws = (wi*NSi)
wi
Complementary Strength effect
Determine system weights (wi)

Static Tuning
•
•
Evaluate fusion formulas using a fixed set of values
(e.g. 0.1..1.0) with training data
Select the formulas with best performance
AIRS2005
8
WIDIT: Query Classification Module

Statistical Classification (SC)
Classifiers

•
•
Training Data

•
•
Titles of 2003 topics (50 TD, 150 HP, 150 NP)
w/ and w/o stemming (Combo stemmer)
Training Data Enrichment for TD class

•

Naïve Bayes
SVM
Added top-level Yahoo Government category labels
Linguistic Classification (LC)

•
Create HP and NP lexicons
Ad-hoc heuristic

•

Word Cues
e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic
Combination
More ad-hoc heuristic

•
if strong word cue, LC
else if single word, TD
else SC
AIRS2005
9
WIDIT: Re-ranking Module

Re-ranking Features
Field-specific Match

•
Exact Match

•
•
title, header text, anchor text
body text
Indegree & Outdegree
URL Type: root, subroot, path, file


•
based on URL ending and slash count
Page Type: HPP, HP, NPP, NP, ??

•

Query words, acronyms, phrases in URL, title, header, anchor text
based on word cue & heuristic
Re-ranking Formula


Weighted sum of re-ranking features
Dynamic Tuning

Dynamic/interactive optimization of the QT-specific re-ranking formula
AIRS2005
10
WIDIT: Dynamic Tuning Interface
AIRS2005
L
W
11
Dynamic Tuning: Observations

Effective re-ranking factors


•
•
•
HP
indegree, outdegree, exact match, URL/pagetype
minimum number of outdegree =1
NP
indegree, outdegree, URLtype
o

•
•

1/3 impact of HP
TD
acronym, outdegree, URLtype
minimum number of outdegree =10
Strength




Combines the human intelligence (pattern recognition) w/
computational power of machine
Good for system tuning with many parameters
Facilitates failure analysis
Weakness


Over-tuning
Sensitive to initial results & re-ranking parameter selection
AIRS2005
12
Results

Run Descriptions

Best fusion run: F3
•

MRR
(HP)
0.1349
(+38.5%)
0.6545
(+ 6.7%)
0.6265
(+47.2%)
F3
(baseline)
0.0974
0.6134
0.4256
TREC
Median
0.1010
0.5888
0.5838
0.4*A + 0.3*F1 + 0.3*F2
Dynamic re-ranking runs
•
MRR
(NP)
DR_o
where F1= 0.8*B + 0.05*A + 0.15*H
A= anchor, B=body, H=header

MAP
(TD)
(DR_o) w/ official QT
Observations

Dynamic tuning works well
•

significant improvement over baseline (TD, HP)
NP reranking needs to be optimized
•
relatively small improvement by reranking
AIRS2005
13
Discussion: Web IR Methods
 What worked?

Fusion
• Combining multiple sources of evidence (MSE)

Dynamic Tuning
• Helps multi-parameter tuning & failure analysis
 What next?

Expanded MSE mining
• Web server and search engine logs

Enhanced Reranking Feature Selection & Scoring
• Modified PageRank/HITS
• Link noise reduction based on page layout

Streamlined Fusion Optimization
AIRS2005
14
Discussion: Fusion Optimization
 Conventional Fusion Optimization approaches

Exhaustive parameter combination
• Step-wise search of the whole solution space
• Computationally demanding when the number of parameter is large

Parameter combination based on past evidence
• Targeted search of restricted solution space
i.e., parameter ranges are estimated based on training data
 Next-Generation Fusion Optimization approaches

Non-linear Transformation function for Reranking Feature scores
• e.g. log transformation to compensate for the power law distribution of
PageRank

Hybrid Fusion Optimization
• Semi-automatic Dynamic Tuning
• Automatic Fusion Optimization by Category
AIRS2005
15
Category 1
Top 10 systems
Category 2
Top system in
each query length
Category n
Automatic fusion optimization
performance gain
> threshold?
Fetching result sets
For different categories
No
optimized
fusion formula
Yes
Results pool
Automatic Fusion Optimization
AIRS2005
16
Resources
WIDIT (Web Information Discovery Integrated Tool) Lab:
http://widit.slis.indiana.edu/
http://elvis.slis.indiana.edu/
Dynamic Tuning Interface (Web track)
http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm
WIDIT  projects  TREC  Web track
Dynamic Tuning Interface (HARD track)
http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm
WIDIT  projects  TREC  HARD track
Thank you!
Questions?
AIRS2005
17
SMART

Length-Normalized Term Weights
• SMART lnu weight for document terms
• SMART ltc weight for query terms
d ik 
log( f ik )  1
t
 (log( f
j 1
where:

ij
)  1)
qk 
2
(log( f k )  1)  idf k
t
[(log( f
j 1
j
)  1)  idf j ]2
fik = number of times term k appears in document i
idfk = inverse document frequency of term k
t = number of terms in document/query
Document Score
• inner product of document and query vectors
t
q d i   qk d ik
T
k 1
where:
qk = weight of term k in the query
dik = weight of term k in document i
t = number of terms common to query & document
AIRS2005
18
Okapi

Document Ranking
w
T Q
where:
RS
(1)
k1  1tf
(k3  1)  qtf
K  tf
k3  qtf
Q = query containing terms T
K = k1 ((1-b) + b*(doc_length/avg.doc_length))
tf = term frequency in a document
qtf = term frequency in a query
tf = term frequency in a document
k1 , b, k3 = parameters (1.2, 0.75, 7..1000)
wRS = Robertson-Sparck Jones weight
WRS
N
n
R
n
r  0.5




R

r

0
.
5


 log
n  r  0.5




 N  n  R  r  0.5 
 Document term weight
(simplified formula)
 N  n  0.5  tf
log 

 n  0.5  K  tf

Query term weight
k3  1qtf
k3  qtf
= total number of documents in the collection
= total number of documents in which the term occur
= total number of relevant documents in the collection
= total number of relevant documents retrieved
AIRS2005
19
Webpage Type Identification

URL Type (Tomlinson, 2003; Kraaij et al., 2002)
• Heuristic
o
o
o
o
root:
subroot:
path:
file:
slash_cnt=0 or (HP_end & slash_cnt=1)
HP_end & slash_cnt=2
HP_end & slash_cnt>=3
rest
(HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)

Page Type
• Heuristic
if “welcome” or “home” in title, header, anchor text  HPP
else if “YYYY” in title, anchor  NPP
else if NP lexicon word  NP
else if HP lexicon word  HP
else if ends in all caps  HP
else  ??
• NP lexicon
o
about, annual, report, guide, studies, history, new, how
• HP lexicon
o
office, bureau, department, institute, center, committee, agency, administration, council,
society, service, corporation, commission, board, division, museum, library, project,
group, program, laboratory, site, authority, study, industry
AIRS2005
20