Robust Pseudo Feedback & HMM Passage Extraction UIUC at

Download Report

Transcript Robust Pseudo Feedback & HMM Passage Extraction UIUC at

Robust Pseudo Feedback
& HMM Passage Extraction
UIUC at TREC 2006 Genomics Track
Jing Jiang, Xin He, ChengXiang Zhai
University of Illinois at Urbana-Champaign
Goal of Participation
• To test the effectiveness of some recent
language modeling methods for genomics
retrieval
– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
• Task at 2006 genomics track
– Document-level retrieval
– Passage-level retrieval
– Aspect-level retrieval
11/16/06
2
Overall Approach
Medline articles
paragraphs
ranked passages
1
Q
11/16/06
user
relevance
feedback
Document
Retrieval
Module
pseudo
relevance
feedback
1
2
2
…
k
…
Passage
Extraction
Module
ranked
paragraphs
…
k
…
3
Goal of Participation
• To test the effectiveness of some recent
language modeling methods for genomics
retrieval
– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
11/16/06
4
KL-Divergence Retrieval Model
[Lafferty & Zhai 01]
document
topic
11/16/06
the
for
prp
mad
cow
diseas
…
0.020
0.015
0.102
0.034
0.034
0.068
…
role
prnp
mad
cow
diseas
0.2
0.2
0.2
0.2
0.2
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
5
KL-Divergence Retrieval Model
[Lafferty & Zhai 01]
document
topic
11/16/06
the
for
prp
mad
cow
diseas
…
0.020
0.015
0.102
0.034
0.034
0.068
…
role
prnp
mad
cow
diseas
0.2
0.2
0.2
0.2
0.2
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
6
Model-Based Feedback
[Zhai & Lafferty 01]
background
role
prnp
mad
cow
diseas
topic
11/16/06
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
?
?
…
?
?
feedback
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
7
Model-Based Feedback
[Zhai & Lafferty 01]
background
role
prnp
mad
cow
diseas
topic
11/16/06
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
0.003
0.002
…
0.02
0.05
feedback
EM
algorithm
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
8
Model-Based Feedback
[Zhai & Lafferty 01]
background
role
prnp
mad
cow
diseas
topic
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
0.003
0.002
…
0.02
0.05
feedback
2 parameters
α and λ
11/16/06
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
9
Regularized Estimation
[Tao & Zhai 06]
background
role
prnp
mad
cow
diseas
topic
11/16/06
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
?
?
…
?
?
feedback
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
10
Regularized Estimation
[Tao & Zhai 06]
background
prior
role
prnp
mad
cow
diseas
topic
11/16/06
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
0.003
0.002
…
0.02
0.05
feedback
regularized
EM
algorithm
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
11
Regularized Estimation
[Tao & Zhai 06]
background
prior
role
prnp
mad
cow
diseas
topic
0.2
0.2
0.2
0.2
0.2
the
for
…
prp
prion
0.02
0.01
…
0.003
0.004
the
for
…
prp
prion
0.003
0.002
…
0.02
0.05
feedback
1 parameter η
11/16/06
The…for…
spongiform
…PrP
protein…
D1
Prion
diseases…
that…(PrP
C)…This…
D2
…which…(
PrP C)…to
the…prion
protein…
Dk
…
…
12
Original vs. Regularized EM
original
regularized
α manually set
α
…
D1
D2
Dk
α
D1
D2
…
Dk
α dynamically set
α
D
1
D2
…
Dk
11/16/06
13
Goal of Participation
• To test the effectiveness of some recent
language modeling methods for genomics
retrieval
– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
11/16/06
14
HMM Passage Extraction
[Jiang & Zhai 06]
relevant passage
paragraph
w w
B B
…
…
p(w|B1)
the: 0.02
for: 0.01
prp: 0.001
…
HMM
11/16/06
…
…
w w w w
B R R R
p(w|R)
the: 0.003
for: 0.002
prp: 0.02
…
B1
…
…
w
B
p(w|B2)
the: 0.02
for: 0.01
prp: 0.001
…
R
p(R|B1)
= 0.1
p(B1|B1)
= 0.9
w w w w
R R R B
B2
p(B2|R)
= 0.05
p(R|R)
= 0.95
p(B2|B2)
=1
15
HMM Passage Extraction
[Jiang & Zhai 06]
transition probabilities estimated
from observations
B1
R
B3
end-of-paragraph
state
E
B2
a background state
for smoothing
11/16/06
16
Experiment Design
• Pre-processing
– HTML parsing
– paragraph boundaries
– Tokenization
• User relevance feedback
11/16/06
17
Official Runs
Medline articles
paragraphs
ranked passages
1
1
Q
KL-Div
Retrieval
2
2
…
k
…
HMM
Passage
Extraction
ranked
paragraphs
…
Q'
11/16/06
k
…
18
UIUCauto
Medline articles
paragraphs
ranked passages
1
1
Q
KL-Div
Retrieval
2
2
…
k
…
HMM
Passage
Extraction
ranked
paragraphs
…
Q'
11/16/06
regularized estimation
k
…
19
UIUCinter
Medline articles
paragraphs
ranked passages
1
1
Q
KL-Div
Retrieval
2
2
…
k
…
HMM
Passage
Extraction
ranked
paragraphs
…
Q'
11/16/06
regularized estimation
k
…
20
UIUCinter2
Medline articles
paragraphs
ranked passages
1
2
1
Q
k
…
HMM
Passage
Extraction
KL-Div
Retrieval
2
…
ranked
paragraphs
…
Q'
11/16/06
original estimation
k
…
F
21
Pseudo Relevance Feedback
(k = 10)
Method
Baseline (no feedback)
Def
Original
Estimation
Opt
Regularized
Estimation
Def
Opt
Doc MAP
0.3484
0.3606
Rel. Impr.
N/A
+3.50%
0.3943
0.3842
(UIUCauto)
0.3952
+13.2%
+10.3%
+13.4%
η is similar to λ / (1 − λ)
11/16/06
22
Pseudo Relevance Feedback
(k = 10)
Method
Baseline (no feedback)
Def
Original
Estimation
Opt
Regularized
Estimation
Def
Opt
Doc MAP
0.3484
0.3606
Rel. Impr.
N/A
+3.50%
0.3943
0.3842
(UIUCauto)
0.3952
+13.2%
+10.3%
+13.4%
η is similar to λ / (1 − λ)
11/16/06
23
Pseudo Relevance Feedback
(k = 10)
Method
Baseline (no feedback)
Def
Original
Estimation
Opt
Regularized
Estimation
Def
Opt
Doc MAP
0.3484
0.3606
Rel. Impr.
N/A
+3.50%
0.3943
0.3842
(UIUCauto)
0.3952
+13.2%
+10.3%
+13.4%
η is similar to λ / (1 − λ)
11/16/06
24
Parameter Sensitivity
(pseudo feedback, k = 10)
11/16/06
25
User Relevance Feedback
Doc MAP
Pseudo
User
Feedback Feedback
Method
Original
Estimation
Def
Opt
Regularized
Estimation
Def
11/16/06
Opt
Rel.
Impr.
0.3606
0.3986
+10.5%
0.3943
0.4511
+14.4%
0.3842
0.4261
+10.9%
(UIUCauto) (UIUCinter)
0.3952
0.4515
+14.2%
26
User Relevance Feedback
Doc MAP
Pseudo
User
Feedback Feedback
Method
Original
Estimation
Def
Opt
Regularized
Estimation
Def
11/16/06
Opt
Rel.
Impr.
0.3606
0.3986
+10.5%
0.3943
0.4511
+14.4%
0.3842
0.4261
+10.9%
(UIUCauto) (UIUCinter)
0.3952
0.4515
+14.2%
27
User Relevance Feedback
Doc MAP
Pseudo
User
Feedback Feedback
Method
Original
Estimation
Def
Opt
Regularized
Estimation
Def
11/16/06
Opt
Rel.
Impr.
0.3606
0.3986
+10.5%
0.3943
0.4511
+14.4%
0.3842
0.4261
+10.9%
(UIUCauto) (UIUCinter)
0.3952
0.4515
+14.2%
28
HMM Passage Extraction
Method
Paragraph
UIUCauto HMM Passage
Rel. Impr.
Paragraph
UIUCinter HMM Passage
Rel. Impr.
Paragraph
UIUCinter2 HMM Passage
Rel. Impr.
11/16/06
Psg MAP
0.03753
0.04864
+29.6%
0.04481
0.05906
+31.8%
0.04580
0.06038
+31.8%
29
Passage Length (In Bytes)
Max
Min
Avg
Std
True Passages
6928
27
399.8
489.4
HMM Passages
6955
34
1525.8
949.7
Paragraph
8670
60
2105.4
1136.8
HMM passages are generally too long!
11/16/06
30
Example Passage
Prion diseases, which include Creutzfeldt-Jacob disease
in humans, mad cow disease in cattle, and scrapie in
sheep, involve the misfolding of the benign cellular
prion protein (PrP C) 1 to the infectious disease-causing
scrapie isoform PrP Sc. The prion protein (PrP C) is a
copper-binding cell surface glycoprotein. The role of copper
in the normal function of PrP, as well as in prion diseases,
has been the subject of a number of excellent reviews. The
mature cellular form of PrP consists of residues 23 to 231
and is tethered to the cell surface via a
glycosylphosphatidylinositol anchor at the C terminus. There
are now a number of NMR solution structures of copper-free
mammalian PrPs. A crystal structure of PrP C has also been
published; this structure is dimeric involving domain
swapping of the monomeric form.
11/16/06
31
Example Passage
Prion diseases, which include Creutzfeldt-Jacob disease
in humans, mad cow disease in cattle, and scrapie in
sheep, involve the misfolding of the benign cellular
prion protein (PrP C) 1 to the infectious disease-causing
scrapie isoform PrP Sc. The prion protein (PrP C) is a
copper-binding cell surface glycoprotein. The role of copper
in the normal function of PrP, as well as in prion diseases,
has been the subject of a number of excellent reviews. The
mature cellular form of PrP consists of residues 23 to 231
and is tethered to the cell surface via a
glycosylphosphatidylinositol anchor at the C terminus. There
are now a number of NMR solution structures of copper-free
mammalian PrPs. A crystal structure of PrP C has also been
published; this structure is dimeric involving domain
swapping of the monomeric form.
11/16/06
32
Conclusions and Future Work
• The two language modeling methods in general
works well in genomics domain
– Regularized feedback estimation can effectively
eliminates parameter α
– HMM passages improves over paragraphs
• User relevance feedback is effective
• Limitations and future work
– Regularized feedback estimation still has parameter η
to tune
• How to eliminate η?
– The inherent coherence property of HMM passages
may not suit the task well
• Different/better HMM architecture?
11/16/06
33
The End
• Questions?
11/16/06
34