Search Engine Statistics Beyond the n

Download Report

Transcript Search Engine Statistics Beyond the n

Search Engine Statistics
Beyond the n-gram:
Application to Noun
Compound Bracketing
CoNLL-2005
Preslav Nakov
EECS, Computer Science Division
University of California, Berkeley
Marti Hearst
SIMS
University of California, Berkeley
1
Outline



Introduction
Related Work
Models and Features
2
Introduction


Noun compound bracketing
-> Noun compound interpretation
liver cell antibody


liver cell line


[[liver cell] antibody]
[liver [cell line]]
POS equivalent, different syntactic trees
3
This Paper

A highly accurate unsupervised method for
making bracketing decisions for noun
compounds (NCs)


Current: using bigram estimates to compute
adjacency and dependency scores
Improvement



χ2 measure
a new set of surface features for querying Web search
engines
Evaluate on 2 domains, encyclopedia & bioscience
4
Related Work

NC syntax and semantics



Still active -> J. of Com. Speech and Language –
Special Issue on Multiword Expressions
Adjacency model
Probabilistic dependency model, Laucer (1995)





Data sparseness (use categories instead)
244 NCs from encyclopedia
Inter-annotator agreement 81.5%
Baseline 66.8% -> 77.5%
Adding POS -> state-of-the-art result of 80.7%
5
2003~2005

Keller and Lapata (2003)




Use Web Search Engines for obtaining frequencies
for unseen bigrams
(2004) apply to six NLP tasks including
disambiguation of NCs
Simpler version (use frequency only) - 78.68%
Girju et al. (2005) supervised (decision tree)
(5 WordNet semantic features)

83.1%
6
Models and Features
Adjacency and dependency model
w1w2w3 -> [w1 [w2w3]] (two reasons)
take on right bracketing


1.
w2w3 is a compound (modified by w1)


2.
Adjacency model checks 1.
w1 and w2 independently modify w3



home health care
adult male rat
(Better) Dependency model checks 2.
Left bracketing -> only 1 choice

[law enforcement] agent
7
Computing Probabilities

Alternative

Calculations
8
χ2 measure




B=#(wi)-(A)
C=#(wj)-(A)
D=~N-A-B-C
N=8T
=google 8B pages X
1000 words/page
(Yang and Pedersen, 1997)
χ2 better than MI
9
蛋包飯






蛋 2067593
蛋包2217
包 10207448
包飯3398
飯 1672224
χ2 包飯750.34 > 蛋包67.32
10
Web-Derived Surface (1/2)


Authors sometimes (consciously or not) disambiguate the words
they write by using surface-level markers to suggest the correct
meaning.
Dash (hyphen)

left bracketing


right bracketing less reliable




donor T-cell
fiber optics-system
t-cell-depletion
Possessive marker


cell cycle analysis -> cell-cycle
brain’s stem cells, brain stem’s cells, brain’s stem-cells
Internal capitalization


Plasmodium vivax Malaria, brain Stem cells
disable this feature on Roman digits and single-letter words

vitamin D deficiency
11
Web-Derived Surface (2/2)

Embedded slashes




a comma, a dot or a colon




“health care, provider” or “lung cancer: patients” (weak indicator)
mouse-brain stem cells (weak indicator)
Unfortunately, Web SE ignore punctuation characters - hyphens,
brackets, apostrophes, etc.


leukemia/lymphoma cell
growth factor (beta) or (growth factor) beta
(brain) stem cells
collect them indirectly – post-processing the resulting summaries
(up to 1000 results)
Above features are clearly more reliable than others, we do not
try to weight them
Features verifying


Counts returned by SE, page hits as a proxy for n-gram frequencies
from 1000 summaries
12
Other Web-Derived
Features

Abbreviations



Concatenation


health care reform -> healthcare, carereform
Wildcard (*)


tumor necrosis factor (NF)
tumor necrosis (TN) factor
“health care * reform” <-> “health * care reform”
Reorder

reform health care <-> care reform health


Internal inflection variability


myosin heavy chain, heavy chain myosin
tyrosine kinase activation, tyrosine kinases activation
Switching

“adult male rat”, we would also expect “male adult rat”.
13
新發現
14
Paraphrases

Warren (1978) proposes



Copula paraphrase



stem cells in the brain
cells from the brain stem
office building that/which is a skyscraper
pain associated with arthritis migraine
search engines lack linguistic annotations


small set of hand-chosen paraphrases
associated with, caused by, contained in, derived from,
focusing on, found in, involved in, located at/in, made of,
performed by, preventing, related to and used by/in/for
15
Evaluations

Lauer’s Dataset (1995)


244 unambiguous 3-noun NC-s
Biomedical Dataset (Nakov et al., 2005,
SIG BioLink)

Open NLP tools


sentence splitted, tokenized, POS tagged and
shallow parsed a set of 1.4 million MEDLINE
abstracts (citations between 1994 and 2003)
500 NCs, 361 left, 69 right, 70 ambiguous
16
Experiments

used MSN Search statistics for the ngrams and the paraphrases (unless the
pattern contained a “*”)


MSN always returned exact numbers
Google for the surface features

Google and Yahoo rounded their page hits,
which generally leads to lower accuracy
(Yahoo was better than Google for these
estimates)
17
Tools Mentioned

UMLS Specialist lexicon



得到生物領域字不同的拼法
http://www.nlm.nih.gov/pubs/factsheets/u
mlslex.html
Carroll’s morphological tools

http://www.cogs.susx.ac.uk/lab/nlp/carroll/
morph.html
18
UMLS Lexicon


{base=AAAentry=E0000049
cat=noun
variants=metareg
variants=uncount acronym_of=abdominal
aortic aneurysmectomy|E0429482
acronym_of=acne-associated
arthritis|E0429483
acronym_of=acquired aplastic
anemia|E0429484
acronym_of=acute anxiety attack|E0429485
acronym_of=androgenic anabolic agent|E0429486
acronym_of=aneurysm of ascending aorta
acronym_of=aromatic amino acid|E0356310 acronym_of=acute
apical abscess|E0356309
abbreviation_of=abdominal aortic
aneurysm|E0006446}
{base=AAMD
spelling_variant=A.A.M.D.
entry=E0000050
cat=noun
variants=groupuncount
acronym_of=American
Association on Mental Deficiency|E0000277}
19
20
21
22
Conclusions and Future Work


Improved upon the state-of-the-art
approaches to NC bracketing
Future include





test on > 3 words
recognize the ambiguous case
Include determiners and modifiers
on other NLP problems
refine the parser output

Parser typically assume right bracketing
23