You Can’t Beat Frequency (unless You Use Linguistic
Download
Report
Transcript You Can’t Beat Frequency (unless You Use Linguistic
You Can’t Beat Frequency (Unless You Use
Linguistic Knowledge) – A Qualitative
Evaluation of Association Measures for
Collocation and Term Extraction
Joachim Wermter and Udo Hahn
Jena University
ACL 2006 Regular Conference Paper
Objective
• Compare the performance of frequency, ttest, LSM and LPM methods on
collocation extraction and domain-specific
automatic term recognition
Collocation Extraction
• Extract idioms
• “kick the bucket”
Domain-Specific Term Extraction
• Extract domain-specific phrases
• “mitochondrial inheritance”
Corpus
LSM
• A “linguistic knowledge-based” method for
collocation extraction proposed by the same
authors in another paper
• Assumes that idioms are less modifiable by
supplements
– e.g. “kick the beautiful bucket”
• probability of PNVtriple having Suppk :
• f(x) : frequency of x
LSM
• Modifiability of a PNVtriple
• Probability of a PNVtriple
• Collocation Score
LPM
• A “linguistic knowledge-based” method for automatic
term recognition proposed by the same authors in
another paper
• Assumes that words in a phrase are less
interchangeable
– e.g mitochondrion inheritance money inheritance
• Modifiability of a phrase:
• modk(n-gram) : replace k words
• seli : particular replacement
LPM
• Phrase Score:
Evaluation Criteria
•
1.
2.
3.
4.
Compared to the baseline frequency ranking
method, a good ranking function should have
the four characteristics:
Keep the true positives in the upper portion of
the list
Keep the true negatives in the lower portion of
the list
Demote true negatives from the upper portion
Promote true positives from the lower portion
Collocation Extraction Results
Automatic Term Recognition
Results
Observations
• CE Criterion 1
– t-test and frequency methods have similar
performance
– LSM promotes some TPs to top 1/6
• ATR Criterion 1
– t-test and frequency methods have similar
performance
– LPM promotes a few TPs to top 1/6
Observations
• CE Criterion 2
– LSM promotes a lot more TNs to upper
portion than t-test method (bad…)
• ATR Criterion 2
– Same as above
Observations
• CE Criterion 3
– LSM demotes a lot more TNs to the lower
portion than t-test
• ATR Criterion 3
– Same as above
Observations
• CE Criterion 4
– LSM promotes more TPs to upper portion
than t-test
• ATR Criterion 4
– Same as above
Conclusion
• LSM and LPM methods are better than ttest and frequency methods
• Pure statistics methods are worse than
knowledge-based methods