Search Engine Statistics Beyond the n
Download
Report
Transcript Search Engine Statistics Beyond the n
Search Engine Statistics
Beyond the n-gram:
Application to Noun
Compound Bracketing
CoNLL-2005
Preslav Nakov
EECS, Computer Science Division
University of California, Berkeley
Marti Hearst
SIMS
University of California, Berkeley
1
Outline
Introduction
Related Work
Models and Features
2
Introduction
Noun compound bracketing
-> Noun compound interpretation
liver cell antibody
liver cell line
[[liver cell] antibody]
[liver [cell line]]
POS equivalent, different syntactic trees
3
This Paper
A highly accurate unsupervised method for
making bracketing decisions for noun
compounds (NCs)
Current: using bigram estimates to compute
adjacency and dependency scores
Improvement
χ2 measure
a new set of surface features for querying Web search
engines
Evaluate on 2 domains, encyclopedia & bioscience
4
Related Work
NC syntax and semantics
Still active -> J. of Com. Speech and Language –
Special Issue on Multiword Expressions
Adjacency model
Probabilistic dependency model, Laucer (1995)
Data sparseness (use categories instead)
244 NCs from encyclopedia
Inter-annotator agreement 81.5%
Baseline 66.8% -> 77.5%
Adding POS -> state-of-the-art result of 80.7%
5
2003~2005
Keller and Lapata (2003)
Use Web Search Engines for obtaining frequencies
for unseen bigrams
(2004) apply to six NLP tasks including
disambiguation of NCs
Simpler version (use frequency only) - 78.68%
Girju et al. (2005) supervised (decision tree)
(5 WordNet semantic features)
83.1%
6
Models and Features
Adjacency and dependency model
w1w2w3 -> [w1 [w2w3]] (two reasons)
take on right bracketing
1.
w2w3 is a compound (modified by w1)
2.
Adjacency model checks 1.
w1 and w2 independently modify w3
home health care
adult male rat
(Better) Dependency model checks 2.
Left bracketing -> only 1 choice
[law enforcement] agent
7
Computing Probabilities
Alternative
Calculations
8
χ2 measure
B=#(wi)-(A)
C=#(wj)-(A)
D=~N-A-B-C
N=8T
=google 8B pages X
1000 words/page
(Yang and Pedersen, 1997)
χ2 better than MI
9
蛋包飯
蛋 2067593
蛋包2217
包 10207448
包飯3398
飯 1672224
χ2 包飯750.34 > 蛋包67.32
10
Web-Derived Surface (1/2)
Authors sometimes (consciously or not) disambiguate the words
they write by using surface-level markers to suggest the correct
meaning.
Dash (hyphen)
left bracketing
right bracketing less reliable
donor T-cell
fiber optics-system
t-cell-depletion
Possessive marker
cell cycle analysis -> cell-cycle
brain’s stem cells, brain stem’s cells, brain’s stem-cells
Internal capitalization
Plasmodium vivax Malaria, brain Stem cells
disable this feature on Roman digits and single-letter words
vitamin D deficiency
11
Web-Derived Surface (2/2)
Embedded slashes
a comma, a dot or a colon
“health care, provider” or “lung cancer: patients” (weak indicator)
mouse-brain stem cells (weak indicator)
Unfortunately, Web SE ignore punctuation characters - hyphens,
brackets, apostrophes, etc.
leukemia/lymphoma cell
growth factor (beta) or (growth factor) beta
(brain) stem cells
collect them indirectly – post-processing the resulting summaries
(up to 1000 results)
Above features are clearly more reliable than others, we do not
try to weight them
Features verifying
Counts returned by SE, page hits as a proxy for n-gram frequencies
from 1000 summaries
12
Other Web-Derived
Features
Abbreviations
Concatenation
health care reform -> healthcare, carereform
Wildcard (*)
tumor necrosis factor (NF)
tumor necrosis (TN) factor
“health care * reform” <-> “health * care reform”
Reorder
reform health care <-> care reform health
Internal inflection variability
myosin heavy chain, heavy chain myosin
tyrosine kinase activation, tyrosine kinases activation
Switching
“adult male rat”, we would also expect “male adult rat”.
13
新發現
14
Paraphrases
Warren (1978) proposes
Copula paraphrase
stem cells in the brain
cells from the brain stem
office building that/which is a skyscraper
pain associated with arthritis migraine
search engines lack linguistic annotations
small set of hand-chosen paraphrases
associated with, caused by, contained in, derived from,
focusing on, found in, involved in, located at/in, made of,
performed by, preventing, related to and used by/in/for
15
Evaluations
Lauer’s Dataset (1995)
244 unambiguous 3-noun NC-s
Biomedical Dataset (Nakov et al., 2005,
SIG BioLink)
Open NLP tools
sentence splitted, tokenized, POS tagged and
shallow parsed a set of 1.4 million MEDLINE
abstracts (citations between 1994 and 2003)
500 NCs, 361 left, 69 right, 70 ambiguous
16
Experiments
used MSN Search statistics for the ngrams and the paraphrases (unless the
pattern contained a “*”)
MSN always returned exact numbers
Google for the surface features
Google and Yahoo rounded their page hits,
which generally leads to lower accuracy
(Yahoo was better than Google for these
estimates)
17
Tools Mentioned
UMLS Specialist lexicon
得到生物領域字不同的拼法
http://www.nlm.nih.gov/pubs/factsheets/u
mlslex.html
Carroll’s morphological tools
http://www.cogs.susx.ac.uk/lab/nlp/carroll/
morph.html
18
UMLS Lexicon
{base=AAAentry=E0000049
cat=noun
variants=metareg
variants=uncount acronym_of=abdominal
aortic aneurysmectomy|E0429482
acronym_of=acne-associated
arthritis|E0429483
acronym_of=acquired aplastic
anemia|E0429484
acronym_of=acute anxiety attack|E0429485
acronym_of=androgenic anabolic agent|E0429486
acronym_of=aneurysm of ascending aorta
acronym_of=aromatic amino acid|E0356310 acronym_of=acute
apical abscess|E0356309
abbreviation_of=abdominal aortic
aneurysm|E0006446}
{base=AAMD
spelling_variant=A.A.M.D.
entry=E0000050
cat=noun
variants=groupuncount
acronym_of=American
Association on Mental Deficiency|E0000277}
19
20
21
22
Conclusions and Future Work
Improved upon the state-of-the-art
approaches to NC bracketing
Future include
test on > 3 words
recognize the ambiguous case
Include determiners and modifiers
on other NLP problems
refine the parser output
Parser typically assume right bracketing
23