Language Model in Turkish IR

Download Report

Transcript Language Model in Turkish IR

Language Model in Turkish IR
Melih Kandemir
F. Melih Özbekoğlu
Can Şardan
Ömer S. Uğurlu
Outline
•
•
•
•
•
•
Indexing problem and proposed solution
Previous Work
System Architecture
Language Modeling Concept
Evaluation of the System
Conclusion
Indexing Problem
• “A Language Modeling Approach to
Information Retrieval” Jay M. Ponte and.
W. Bruce Croft, 1998
• Indexing model is important at probabilistic
retrieval model
• Current models do not lead to improved
retrieval results
Indexing Problem
• Failure because of unwarranted
assumptions:
• 2-Poisson model
– “elite” documents
• N-Poisson model
– Mixture of more than 2 Poission distributions
Proposed Solution
• Retrieval based on probabilistic language
modeling
• Language model refers to probabilistic
distribution that captures statistical
regularities of the generation of language
• A language model is inferred for each
document
Proposed Solution
• Estimate probability of generating the
query
• Documents are ranked according to these
probabilities
• Users have a reasonable idea of terms
• tf, idf are integral parts of language model
Previous Work
• Robertson–Sparck Jones model and
Croft–Harper model
– They focus on relevance
• Fuhr integrated indexing and retrieval
models.
– Used statistics as heuristics
• Wong and Yao used utility theory and
information theory
Previous Work
• Kalt’s approach is the most similar
– Maximum likelihood estimator is used
– Collection statistics are integral parts of the
model
– Documents are members of language classes
System Overview
Application Server
Different
Resultsets
LM-Search
Indexer
Document
Repository
Query
Evaluator
USER
Index DB
(PostgreSQL)
System Architecture
No Stemming
Document
Repository
First 5
Lemmatiser
Stemming & Term Selection
tf.idf
Language
Model
Inverted
Index
Generation
Index DB
tf.idf vs. Language model
Different
Resultsets
tf.idf
GUI for seeing differences
between results
LM
LM
tf.idf
Vocabulary Extraction
• No stemmer
No Stemming
– Turkish is aggluntinative
– Expectation: low precision
• First 5 characters
First 5
Lemmatiser
Stemming & Term Selection
– As effective as more complex solutions
• Lemmatiser:
– Expectation: high precision.
– Zemberek2 (MPL license)
•
•
•
•
•
Open Source Software
Java Interface, easy to use
Find stems of the words
First valid stem will be used,
Word sense disambiguation (using wordnet or
POS) may be added in the future
Database
Hybrid Index
Structure
3 Indices
• Index contains fields for both methods
• Table can be evaluated in any form
• Stemmed
• First 5
• Lemmatiser
Index DB
Inverted Index
Implementation
• tf.idf
• LanguageModel
• PostgreSQL database server
Language Modeling :
Inverted Index Implementation
An example inverted index for m terms :
t1 
cft
d1 P(t1|Md1)
tf1d1
…
dn
P(t1|Mdn)
tf1dn
t2 
cft
d3 P(t2|Md3)
tf2d2
…
dp P(t2|Mdp)
tf2dp
tfmd4
…
dk
tfmdk
…
tm 
cft
d4 P(t4|Md4)
P(tm|Mdk)
If a document does not contain term then probability can be calculated using cft
ft
= mean term frequency
= mean probability of t in documents containing it
cft
=frequency of t in all documents
The Baseline Approach : tf.idf
We will use the traditional tf.idf term weighting
approach as the baseline model
Robertson’s tf score
Standard idf score
Language Modeling : Definition
• An alternative approach to indexing and retrieval
• Definition of Language Model: A probability
distribution that captures the statistical regularities
of the generation of language
• Intuition Behind : Users have a reasonable idea of
terms that are likely to occur in documents of
interest and will choose query terms that distinguish
these documents from others in the collection
Language Modeling : The Approach
The following assumptions are not made :
• Term distributions in the documents are
parametric
• Documents are members of pre-defined classes
•“Query generation probability” rather than
“Probability of relevance”
Language Modeling : The Approach
P(t | Md) : Probability that the term t is
generated by the language model of document
Md
Language Modeling : Theory
Maximum likelihood estimate of the probability
of term t under the term distribution for
document d:
tf(t,d) : raw term frequency in document d
dld : total number of terms in the document
Language Modeling : Theory
An additional more robust estimate from a larger
amount of data :
pavg : Mean probability of term t in documents containing it
dft : Number of documents that contain term t
Language Modeling : Theory
The risk function :
: Mean term frequency of term t in documents which contains it.
Language Modeling :
The Ranking Formula
Let the probability of term t being produced by document d given
the document model Md :
The probability of producing Q for a given document model Md is :
Language Modeling :
Inverted Index Implementation
An example inverted index for m terms :
t1 
cft
d1 P(t1|Md1)
…
dn P(t1|Mdn)
t2 
cft
d3 P(t2|Md3)
…
dp P(t2|Mdp)
…
dk
…
tm 
cft
d4 P(t4|Md4)
P(tm|Mdk)
Evaluation
• Perform recall/precision experiments
– Recall/precision results
– Non-interpolated average precision
– Precision figures for the top N documents
• For several values of N
• R-Precision
Other Metrics
• Compare the baseline (tf.idf) results to our
language model.
– Percent Change between two result sets
–I/D
• I : count of queries performance improved
• D : count of queries performance changed
Document Repository
Document
Source
•
•
•
•
Milliyet (2001-2005)
XML file ( 1.1 GB )
408304 news
Ready for indexing
• XML Schema ......(FIXME)
Summary
• Indexing and stemming
– Zemberek2 lemmatiser
– Java environment
• Data
– News archive from 2001 to 2005, from Milliyet
• Evaluation
– Methods will be compared according to
performance over recall/precision values
Conclusion
• First language modelling approach to Turkish
IR
• The LM approach
– Non-parametric
– Less assumptions
– Relaxed
• Expected a better performance than baseline
tf.idf method
Thanks for listening
…
Any Questions?