PowerPoint Presentation - At Home with Technology: Web Page
Download
Report
Transcript PowerPoint Presentation - At Home with Technology: Web Page
Tehran University
Using OWA Fuzzy Operator to Merge
Retrieval System Results
CAASL2 2007
July 21-22
Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud Rahgozar
School of Electrical and Computer Engineering
University of Tehran
Farhad Oroumchian
University of Wollongong in Dubai
1
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experiment results
Conclusion
University of Tehran - Database Research Group
2
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experiment results
Conclusion
University of Tehran - Database Research Group
3
The Persian Language
It is Spoken in countries like Iran, Tajikistan and
Afghanistan
It has Arabic like script for writing and consists of 32
characters that are written continuously from right to left
It’s morphological analyzers need to deal with many
forms of words that are not actually Farsi
Example
• The word “( ”کافرsingular) “( ”کفارplural)
• Or “ ”عادتthat has two plural forms in Farsi:
– Farsi form“”عادت ها
– Arabic form“”عادات
University of Tehran - Database Research Group
4
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experimental results
Conclusion
University of Tehran - Database Research Group
5
Vector Space Model
List of Weights that produced the best results
Name
Weighting
tf.idf
tf*log(N/n) / ((tf2) * (qtf2))
lnc.ltc
(1+log(tf))*(1+log(qtf))*log((1+N)/n) / ((tf2) *
(qtf2))
nxx.bpx
(0.5+0.5*tf/max tf)+log((N-n)/n)
tfc.nfc
tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) /
((tf2) * (qtf2))
tfc.nfx1
tf* log(N/n)*(0.5+0.5*qtf/max qtf) *log(N/n) /
((tf * log(N/n))2)
tfc.nfx2
tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) /
((tf2))
Lnu.ltu
Best
((1+log(tf))*(1+log(qtf))*log((1+N)/n))/
((1+log(average tf)) * ((1-s) + s * N.U.W/ average
N.U.W)2)
We used Lnu.ltu and Lnc.btc weighting schemas
University of Tehran - Database Research Group
6
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experimental results
Conclusion
University of Tehran - Database Research Group
7
Language Modeling
Hiemstra (2001) proposed four ways to specify
the rank of document d against query q
Considering P(D=d) as the prior probability of
relevance of the document d to the query q with
query terms t1,...,tn.
Lambda (λ ) is a smoothing parameter and is
equal for each query term. Hiemstra (2002)
emphasizes if there is no previous relevance
information available for a query, each query
term will be considered equally important.
University of Tehran - Database Research Group
8
Language Modeling- Cont.
LM1
(d )
n
log( 1 (1 )
i 1
LM 2
(d )
n
log( 1
i 1
(1 )
( d ) log( t tf (t , d )) log( 1
i 1
n
LM 4
( d ) log( t tf (t , d )) log( 1
i 1
(t tf (t , d ))
cf (ti)
( t df (t ))
tf (ti, d )
n
LM 3
(t cf (t ))
tf (ti, d )
( t tf (t , d ))
df (ti)
tf (ti, d )
(1 )
cf (ti)
tf (ti, d )
(1 )
df (ti)
University of Tehran - Database Research Group
(t cf (t ))
)
(t tf (t , d ))
(t df (t ))
)
)
(t tf (t , d ))
9
)
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experimental results
Conclusion
University of Tehran - Database Research Group
10
OWA Operator- Cont.
.
We used OWA operator as the merge operator. The OWA
operator with n dimensions is a nonlinear aggregation
operator OWA: [0, 1]n[0, 1] with a weighting vector
W=[w1,w2,…,wn] such that Sigma(Wi)=1 with wi in [0, 1].
The OWA weight of each document d is defined as:
OWA(d ) OWA( x1 , x2 ,..., xn )
where xi indicated the score of document d in the ith list.
Each score xi is assigned by Ri (ith search engine) to
document d. If d is not present in the ith list then xi=0.
University of Tehran - Database Research Group
11
OWA Operator- Cont.
The OWA weight of each document is computed by
this Equation:
n
OWA(d ) OWA( x1 , x2 ,.., xn ) W T . B wiT * bi
i 1
in which WT is the transpose vector of W that
defines the semantics of associated with the
OWA operator and B=[b1,b2,..,bn] is the vector
X=[ x1, x2,…, xn] reordered so that bj=Minj(x1, x2,…,
xn), that is the jth smallest element of all the x1,
x2,…, xn.
we used a simple function to bring the scores ({xi,
i=1,…,n}) into a same scale
University of Tehran - Database Research Group
12
OWA Operator- Weighting Method
Quantifier Based Weighting
Degree of Importance Based
Weighting
University of Tehran - Database Research Group
13
Quantifier Based Weighting
linguistic quantifiers All, Mostn, Fewn, and At-Least-One as the
weighting schemas,
All: consider documents appearing in all retrieval engines’ lists. This
quantifier is suitable when the user is looking for precise answer
Most: a fuzzy majority operator that assumes the retrieval by the most
of the engines to be sufficient for inclusion in the fused list.
Few: is a weaker weighting schemas in which it is enough for a
document to be retrieved by a few number of retrieval engines.
At-Least-One: is the weakest weighting schemas in which it is enough
for a document to appear in only one retrieval engine’s list to be
included in the fused list.
Hence the All quantifier has an AND semantic and the At-Least-One
quantifier has an OR semantic. The Most and Few quantifiers have the
semantics in between an AND and an OR operators.
University of Tehran - Database Research Group
14
Degree of Importance Based Weighting
.
As the second weighting schema we use the
position of the documents in the retrieved lists
to produce the weighting vector W= [w1,
w2,…, wn]. The weight of each document d in
the Li,q is defined by
N i POS i 1
wi
Ni
in which Ni is the number of elements in the
ith list, Li,q, and POSi is the position of
document d in Li,q.
University of Tehran - Database Research Group
15
Outline
The Persian Language
Used Methods
Vector Space Model
Language Modeling
OWA Operator
The test collections
Experimental results
Conclusion
University of Tehran - Database Research Group
16
Test Collections
1.
Qvanin Collection
Documents: Iranian Law Collection
•
•
2.
177089 passages
41 queries and Relevance Judgments
Hamshari Collection
Documents: 600+ MB News from Hamshari Newspaper
•
•
3.
160000+ news articles
60 queries and Relevance Judgments
BijanKhan Tagged Collection
Documents: 100+ MB from different sources
•
•
A tag set of 41 tags
2590000+ tagged words
University of Tehran - Database Research Group
17
Hamshahri Collection
We used HAMSHAHRI (a test collection for Persian
text prepared and distributed by DBRG (IR team) of
University of Tehran)
The 3rd version:
– contains about 160000+ distinct textual news
articles in Farsi
– 60 queries and relevance judgments for top 20
relevant documents for each query
University of Tehran - Database Research Group
18
Outline
The Persian Language
Used Methods
Pivoted normalization
N-Gram approach
Local Context Analysis
Our test collections
Experimental results
Conclusion
University of Tehran - Database Research Group
19
Experiment results
0.62
Precision
0.57
0.52
0.47
0.42
0.37
0.32
5
10
15
20
Document Cut-Off
Lnc.btc
Lnu.ltu .025
LM1
LM2
University of Tehran - Database Research Group
LM3
LM4
20
Quantifier Based OWA Weighting
Method
Weighting Vector
Orness Degree
All
W=[1,0,0,0,0,0]
0.00
Most2
W=[0,.5,.5,0,0,0]
0.30
Most3
W=[0,.33,.33,.33,0,0]
0.40
Most4
W=[0,.25,.25,.25,.25,0]
0.50
Few3
W=[0,0,.33,.33,.33,0]
0.59
Few2
W=[0,0,0,.5,.5,0]
0.70
At-Least-One
W=[0,0,0,0,0,1]
University of Tehran - Database Research Group
1.00
21
Experiment results
0.620
0.600
0.580
0.560
0.540
0.520
0.500
5
Most2
Few3
10
Most3
DOI
15
Most4
University of Tehran - Database Research Group
20
Few2
22
Experiment results
0.620
0.600
0.580
0.560
0.540
0.520
0.500
5
10
Most3
Most4
15
LM4
University of Tehran - Database Research Group
20
Lnu.ltu .025
23
Statistical significance tests
Wilcoxon Signed Rank
Method
Most3
Most4
LM4
LM4
0.861
0.894
---
Lnu.ltu 0.25
0.081
0.077
0.444
T-Test
Method
Most3
Most4
LM4
LM4
0.456
0.383
--
Lnu.ltu 0.25
0.028
0.027
0.147
University of Tehran - Database Research Group
24
Statistical significance tests
Based on T Test, both Mos3 and Most4
methods are significantly better than
LM4 method which is a confirmation of
The Wilconxon Signed Rank test.
However, with the T-Test we can not
confirm the significance of the Mos3
and Most4 methods over the Lnu.ltu
with slope of 0.25 method.
University of Tehran - Database Research Group
25
Conclusion
We used two weighting namely quantifier based and
degree-of-importance based weighting methods
The experimental results show that the best OWA
operator, Most3 and Most4 (quantifier based OWA
operators), only marginally improve over the best
retrieval method on Persian text the LM4 methods.
However seems they produce better ranking since
they push the relevant documents to higher ranks.
The significant tests we conducted seem to confirm
that Most3 and Most4 are significantly better than all
other methods but Lnu.ltu with slope of 0.25.
However, the superiority over the Lnu.ltu with slope of
0.25 was not confirmed by T-Test.
University of Tehran - Database Research Group
26
Thanks, Questions
?
http://ece.ut.ac.ir/dbrg
University of Tehran - Database Research Group
27