PowerPoint Presentation - At Home with Technology: Web Page

Download Report

Transcript PowerPoint Presentation - At Home with Technology: Web Page

Tehran University
N-Gram and Local Context Analysis for
Persian text retrieval
ISSPA 2007
12 January
Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani
School of Electrical and Computer Engineering
University of Tehran
Farhad Oroumchian
University of Wollongong in Dubai
1
Outline
The Persian Language
Used Methods
Pivoted normalization
N-Gram approach
Local Context Analysis
The test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
2
Outline

The Persian Language
Used Methods
Pivoted normalization
N-Gram approach
Local Context Analysis
Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
3
The Persian Language
It is Spoken in countries like Iran, Tajikistan and
Afghanistan
It has Arabic like script for writing and consists of 32
characters that are written continuously from right to left
It’s morphological analyzers need to deal with many
forms of words that are not actually Farsi
Example
• The word “‫( ”کافر‬singular)  “‫( ”کفار‬plural)
• Or “‫ ”عادت‬that has two plural forms in Farsi:
– Farsi form“‫”عادت ها‬
– Arabic form“‫”عادات‬
So N-Grams are a solution
University of Tehran - Database Research Group
4
Our Study
We investigated vector space model on
the Persian language:
unstemmed single term
N-gram based
Local Context Analysis
Using HAMSHAHRI collection which
contains 160,000+ news articles
University of Tehran - Database Research Group
5
Outline
The Persian Language
Used Methods

Pivoted normalization
N-Gram approach
Local Context Analysis
Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
6
Vector Space Model
List of Weights that produced the best results
Name
Weighting
tf.idf
tf*log(N/n) / ((tf2) * (qtf2))
lnc.ltc
(1+log(tf))*(1+log(qtf))*log((1+N)/n) / ((tf2)
* (qtf2))
nxx.bpx
(0.5+0.5*tf/max tf)+log((N-n)/n)
tfc.nfc
tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) /
((tf2) * (qtf2))
tfc.nfx1
tf* log(N/n)*(0.5+0.5*qtf/max qtf) *log(N/n)
/ ((tf * log(N/n))2)
tfc.nfx2
tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) /
((tf2))
Lnu.ltu
((1+log(tf))*(1+log(qtf))*log((1+N)/n))/
((1+log(average tf)) * ((1-s) + s * N.U.W/
average N.U.W)2)
Best
University of Tehran - Database Research Group
7
Problem with Document length normalization
It is supposed to remove the difference between
the document's lengths
Under cosine normalization shorter documents
get higher weights but they are less relevant.
Average probability of Relevance/Retrieval
Average of median bin length
University of Tehran - Database Research Group
8
Lnu.ltu weighting scheme
A good weight proposed by Amit Singhal, et al.
and tested on TREC collections
Based on reducing the gap between relevance and
retrieval
1  log( tf )
1  log( average(tf ))
Lnu =
( slope  N .U .T )  (1  slope )  pivot
ltu =
ln( tf )  1.0  ln N / n
( slope  N .U .T )  (1  slope )  pivot
University of Tehran - Database Research Group
9
Pivoted Normalization
Final Normalization Factor
Probability
Document Length
Old Normalization Factor
Source: A. Singhal, et al. “Pivoted Document Length Normalization”
University of Tehran - Database Research Group
10
Outline
The Persian Language
Used Methods

Pivoted normalization
N-Gram approach
Local Context Analysis
Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
11
NGRAM Approach (Cont.)
NGRAMS are strings of length n.
In this approach the whole text is considered as
a stream of characters and then it is broken
down to substrings of length n.
It is remarkably resistant to textual errors (e.g.
OCR) and no linguistic knowledge is needed.
Example:
“‫ ”مخابرات‬for n=4
‫مخاب خابر ابرا برات رات‬
University of Tehran - Database Research Group
12
Outline
The Persian Language
Used Methods

Pivoted normalization
N-Gram approach
Local Context Analysis
Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
13
Word Mismatch Problem
Automatic query expansion is a good solution
for the issue of word mismatch in IR:
Local Analysis
• + Expansion based on high ranking documents
• - Needs an extra search
• - Some queries may retrieve few relevant documents
Global Analysis
• + It has robust average performance
• - Expensive in terms of disk space and CPU
• - Individual Queries can be significantly degraded
University of Tehran - Database Research Group
14
Local Context Analysis
Local Context Analysis is an automatic query
expansion method
combines global analysis (use of context &
phrase structure) and local feedback (Top ranked
documents)
LCA is fully automated and there is no need to
collect any information from user other than the
initial query
+ It is computationally practical
- But has the extra search to retrieve top ranked
documents
University of Tehran - Database Research Group
15
Local Context Analysis (Cont.)
LCA has three main steps:
1.
2.
3.
Run user’s query, break the top N retrieved
documents into passages and rank them again.
Calculate similarity of each concept in the top
ranked passages with the entire original query using
similarity function:
log( f (c, ki )  idf c ) idfi
sim (q, c)   ( 
)
log n
k i q
the top M ranked concepts are added to the original
query and initial retrieval method is done with the
expanded query
University of Tehran - Database Research Group
16
Outline
The Persian Language
Used Methods
Pivoted normalization
N-Gram approach
Local Context Analysis

Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
17
Test Collections
1.
Qvanin Collection
Documents: Iranian Law Collection
•
•
2.
177089 passages
41 queries and Relevance Judgments
Hamshari Collection
Documents: 600+ MB News from Hamshari Newspaper
•
•
3.
160000+ news articles
60 queries and Relevance Judgments
BijanKhan Tagged Collection
Documents: 100+ MB from different sources
•
•
A tag set of 41 tags
2590000+ tagged words
University of Tehran - Database Research Group
18
Hamshahri Collection
We used HAMSHAHRI (a test collection for Persian
text prepared and distributed by DBRG (IR team) of
University of Tehran)
The 3rd version:
– contains about 160000+ distinct textual news
articles in Farsi
– 60 queries and relevance judgments for top 20
relevant documents for each query
University of Tehran - Database Research Group
19
Some examples of Queries
‫قانون حقوق زنان‬
Women rights law
Contamination in Persian
gulf
Birds migration
‫آلودگی خلیج فارس‬
Increase of gas price
‫افزایش قیمت بنزین‬
Iranian Wrestling
‫کشتی فرنگی ایران‬
‫کوچ پرندگان‬
University of Tehran - Database Research Group
20
Outline
The Persian Language
Used Methods
Pivoted normalization
N-Gram approach
Local Context Analysis

Our test collections
Our experiment and the results
Conclusion
University of Tehran - Database Research Group
21
Term-based vector space model
A. Singhal, et al. in their paper “Pivoted
Document Length Normalization” reported that the
following two configurations have the best
performance:
Slope=0.25 and using pivoted unique normalization
(P.U.N.).
•
Pivot = average no. of unique terms in a document
Slope=0.75 and using pivoted cosine normalization
(P.C.N.).
•
Pivot = average cosine factor for 1+log(tf)
University of Tehran - Database Research Group
22
Our experiment results
Comparison of vector space model slope=0.25
and slope=0.75
0.9
0.8
0.7
Lnu.ltu Slope
0.25 using
P.U.N.
Precision
0.6
0.5
0.4
Lnu.ltu Slope
0.75 using
P.C.N.
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
University of Tehran - Database Research Group
23
Our experiment results
Comparison of vector space model and LCA:
In LCA we used Lnu.ltu (slope=0.25 and P.U.N.)
0.9
0.8
0.7
LCA
Precision
0.6
0.5
0.4
Lnu.ltu Slope
0.25 using
P.U.N.
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
University of Tehran - Database Research Group
24
N-Gram Experiments
Next, we assessed N-gram based vector space
model for N = 3,4,5 on the HAMSHAHRI
collection.
In addition to Lnu.ltu we assessed atc.atc in
which both query and documents are weighted
as follows:
atc =
tf
N
0.5  0.5 
 ln 
max tf
n
1
 wi
2
i
University of Tehran - Database Research Group
25
N-Gram experiment results
N-Grams using atc.atc and lnu.ltu (slope=0.25)
weighting schemes
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
Lnu.ltu Term-based
Lnu.ltu 4Gram-based
atc.atc Term-based
atc.atc 4Gram-based
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Lnu.ltu 3Gram-based
Lnu.ltu 5Gram-based
atc.atc 3Gram-based
atc.atc 5Gram-based
University of Tehran - Database Research Group
26
Conclusion
Previous Works: Comparison of Vector Space System with FuFaIR
They used the first version of HAMSHAHRI collection (300+ MB) in Their
experiments. It has 30 Queries
In vector space model the Slope set to 0.75 and the Pivot set to 13.36
University of Tehran - Database Research Group
27
Conclusion
Comparison of vector space systems with BM25
Precision
.
.
.
.
.
.
.
.
.
P@
P@
P@
P@
vector-Lnu.ltu
.
.
.
.
vector-tfc.nfx
.
.
.
.
vector-lnc.ltc
.
.
.
.
BM
.
.
.
.
Document Cut off
University of Tehran - Database Research Group
28
Conclusion
Comparison of Best Vector Space With Best N-grams
Source: F. Oroumchian, F. Mazhar Garamaleki. “An Evaluation of Retrieval performance Using
Farsi Text”. First Eurasia Conference on Advances in Information and Communication Technology,
Tehran, Iran, October 2002.
Precision
Experiments on Qavanin Collection
.
.
.
.
.
.
.
.
.
vector-Lnu.ltu
gram-Lnu.ltu
gram-Lnu.ltu
paice-nxx.bpx
gram-BM
P@
P@
P@
P@
University of Tehran - Database Research Group
29
Our experiment best results
Experiments using atc.atc and lnu.ltu (slope=0.25)
weighting schemes
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
Lnu.ltu Term-based
0.3
0.4
0.5 0.6
Recall
0.7
0.8
Lnu.ltu 4Gram-based
University of Tehran - Database Research Group
0.9
1
LCA
30
Conclusion
Results Analysis (N-Gram)
AS It was shown, 4-gram based vector space
with Lnu.ltu weighting scheme has better
performance than FuFaIR and other vector space
models:
It is in contradiction with the performance of them
in English.
The rational is that most Farsi words' roots are
about 4 characters.
Our results are more valid than previous works
because we used a better collection
University of Tehran - Database Research Group
31
Conclusion
Results Analysis (LCA)
Local Context Analysis only marginally improved
the results over the Lnu.ltu method
Lnu.ltu weighting method is performing very well on
the Farsi language
It’s better to tune LCA parameters for the
HAMSHAHRI collection
University of Tehran - Database Research Group
32
Thanks, Questions
?
http://ece.ut.ac.ir/dbrg
University of Tehran - Database Research Group
33