Transcript ppt
Unambiguous + Unlimited = Unsupervised
Using the Web for
Natural Language Processing Problems
Marti Hearst
School of Information, UC Berkeley
Joint work with Preslav Nakov
BYU CS Colloquium, Dec 6, 2007
This research supported in part by NSF DBI-0317510
Natural Language Processing
The ultimate goal: write programs that read and
understand stories and conversations.
This is too hard! Instead we tackle sub-problems.
There have been notable successes lately:
Machine translation is vastly improved
Speech recognition is decent in limited circumstances
Text categorization works with some accuracy
Marti Hearst, BYU CS 2007
How can a machine understand these differences?
Get the cat with the gloves.
Marti Hearst, BYU CS 2007
How can a machine understand these differences?
Get the sock
from the cat
with the gloves.
Get the glove
from the cat
with the socks.
Marti Hearst, BYU CS 2007
How can a machine understand these
differences?
Decorate the cake with the frosting.
Decorate the cake with the kids.
Throw out the cake with the frosting.
Throw out the cake with the kids.
Marti Hearst, BYU CS 2007
Why is this difficult?
Same syntactic structure, different meanings.
Natural language processing algorithms have to
deal with the specifics of individual words.
Enormous vocabulary sizes.
The average English speaker’s vocabulary is around
50,000 words,
Many of these can be combined with many others,
And they mean different things when they do!
Marti Hearst, BYU CS 2007
How to tackle this problem?
The field was stuck for quite some time.
Hand-enter all semantic concepts and relations
A new approach started around 1990
Get large text collections
Compute statistics over the words in those collections
There are many different algorithms.
Marti Hearst, BYU CS 2007
Size Matters
Recent realization: bigger is better than smarter!
Banko and Brill ’01: “Scaling to Very, Very Large
Corpora for Natural Language Disambiguation”, ACL
Marti Hearst, BYU CS 2007
Example Problem
Grammar checker example:
Which word to use?
<principal> <principle>
Solution: use well-edited text and look at which
words surround each use:
I am in my third year as the principal of Anamosa High
School.
School-principal transfers caused some upset.
This is a simple formulation of the quantum mechanical
uncertainty principle.
Power without principle is barren, but principle without
power is futile. (Tony Blair)
Marti Hearst, BYU CS 2007
Using Very, Very Large Corpora
Keep track of which words are the neighbors of each
spelling in well-edited text, e.g.:
Principal: “high school”
Principle: “rule”
At grammar-check time, choose the spelling best
predicted by the surrounding words.
Surprising results:
Log-linear improvement even to a billion words!
Getting more data is better than fine-tuning algorithms!
Marti Hearst, BYU CS 2007
The Effects of LARGE Datasets
From Banko & Brill ‘01
Marti Hearst, BYU CS 2007
How to Extend this Idea?
This is an exciting result …
BUT relies on having huge amounts of text
that has been appropriately annotated!
Marti Hearst, BYU CS 2007
How to Avoid Manual Labeling?
“Web as a baseline” (Lapata & Keller 04,05)
Main idea: apply web-determined counts to
every problem imaginable.
Example: for t in {<principal> <principle>}
Compute f(w-1, t, w+1)
The largest count wins
Marti Hearst, BYU CS 2007
Web as a Baseline
Works very well in some cases
machine translation candidate selection
article generation
noun compound interpretation
noun compound bracketing
adjective ordering
Significantly better than the
best supervised algorithm.
Not significantly different
from the best supervised.
But lacking in others
spelling correction
countability detection
prepositional phrase attachment
How to push this idea further?
Marti Hearst, BYU CS 2007
Using Unambiguous Cases
The trick: look for unambiguous cases to start
Use these to improve the results beyond what cooccurrence statistics indicate.
An Early Example:
Hindle and Rooth, “Structural Ambiguity and Lexical
Relations”, ACL ’90, Comp Ling’93
Problem: Prepositional Phrase attachment
I eat/v spaghetti/n1 with/p a fork/n2.
I eat/v spaghetti/n1 with/p sauce/n2.
Question: does n2 attach to v or to n1?
Marti Hearst, BYU CS 2007
Using Unambiguous Cases
How to do this with unlabeled data?
First try:
Parse some text into phrase structure
Then compute certain co-occurrences
f(v, n1, p) f(n1, p)
f(v, n1)
Problem: results not accurate enough
The trick: look for unambiguous cases:
Spaghetti with sauce is delicious. (pre-verbal)
I eat with a fork.
(no direct object)
Use these to improve the results beyond what cooccurrence statistics indicate.
Marti Hearst, BYU CS 2007
Unambiguous + Unlimited = Unsupervised
Apply the Unambiguous Case Idea to the Very, Very Large
Corpora idea
The potential of these approaches are not fully realized
Our work (with Preslav Nakov):
Structural Ambiguity Decisions
PP-attachment
Noun compound bracketing
Coordination grouping
Semantic Relation Acquisition
Hypernym (ISA) relations
Verbal relations between nouns
SAT Analogy problems
Marti Hearst, BYU CS 2007
Applying U + U = U to Structural Ambiguity
We introduce the use of (nearly) unambiguous
features:
Surface features
Paraphrases
Combined with ngrams
From very, very large corpora
Achieve state-of-the-art results without labeled
examples.
Marti Hearst, BYU CS 2007
Noun Compound Bracketing
(a)
(b)
[ [ liver cell ] antibody ]
[ liver [cell line] ]
(left bracketing)
(right bracketing)
In (a), the antibody targets the liver cell.
In (b), the cell line is derived from the liver.
Marti Hearst, BYU CS 2007
Dependency Model
right bracketing: [w1[w2w3] ]
w2w3 is a compound (modified by w1)
home health care
w1 and w2 independently modify w3
adult male rat
w1
w2
w3
w1
w2
w3
left bracketing : [ [w1w2 ]w3]
only 1 modificational choice possible
law enforcement officer
Marti Hearst, BYU CS 2007
Our U + U + U Algorithm
Compute bigram estimates
Compute estimates from surface features
Compute estimates from paraphrases
Combine these scores with a voting algorithm to
choose left or right bracketing.
We use the same general approach for two other
structural ambiguity problems.
Marti Hearst, BYU CS 2007
Using n-grams to make predictions
Say trying to distinguish:
[home health] care
home [health care]
Main idea: compare these co-occurrence
probabilities
“home health” vs
“health care”
Marti Hearst, BYU CS 2007
Computing Bigram Statistics
Dependency Model, Frequencies
Compare #(w1,w2) to #(w1,w3)
Dependency model, Probabilities
Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)
Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)
right
w1
w2
w3
left
So we compare Pr(w1w2|w2) to Pr(w1w3|w3)
Marti Hearst, BYU CS 2007
Using ngrams to estimate probabilities
Using page hits as a proxy for n-gram counts
Pr(w1w2|w2) = #(w1,w2) / #(w2)
#(w2)
#(w1,w2)
word frequency; query for “w2”
bigram frequency; query for “w1 w2”
smoothed by 0.5
Use 2 to determine if w1 is associated with w2
(thus indicating left bracketing), and same for w1
with w3
Marti Hearst, BYU CS 2007
Our U + U + U Algorithm
Compute bigram estimates
Compute estimates from surface features
Compute estimates from paraphrases
Combine these scores with a voting algorithm to
choose left or right bracketing.
Marti Hearst, BYU CS 2007
Web-derived Surface Features
Authors often disambiguate noun compounds using
surface markers, e.g.:
amino-acid sequence left
brain stem’s cell left
brain’s stem cell right
The enormous size of the Web makes these
frequent enough to be useful.
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Dash (hyphen)
Left dash
cell-cycle analysis left
Right dash
donor T-cell right
Double dash
T-cell-depletion unusable…
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Possessive Marker
Attached to the first word
brain’s stem cell right
Attached to the second word
brain stem’s cell left
Combined features
brain’s stem-cell right
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Capitalization
anycase – lowercase – uppercase
Plasmodium vivax Malaria left
plasmodium vivax Malaria left
lowercase – uppercase – anycase
brain Stem cell right
brain Stem Cell right
Disable this on:
Roman digits
Single-letter words: e.g. vitamin D deficiency
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Embedded Slash
Left embedded slash
leukemia/lymphoma cell right
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Parentheses
Single-word
growth factor (beta) left
(brain) stem cell right
Two-word
(growth factor) beta left
brain (stem cell) right
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Comma, dot, semi-colon
Following the first word
home. health care right
adult, male rat right
Following the second word
health care, provider left
lung cancer: patients left
Marti Hearst, BYU CS 2007
Web-derived Surface Features:
Dash to External Word
External word to the left
mouse-brain stem cell right
External word to the right
tumor necrosis factor-alpha left
Marti Hearst, BYU CS 2007
Other Web-derived Features:
Abbreviation
After the second word
tumor necrosis factor (NF) right
After the third word
tumor necrosis (TN) factor right
We query for, e.g., “tumor necrosis tn factor”
Problems:
Roman digits: IV, VI
States: CA
Short words: me
Marti Hearst, BYU CS 2007
Other Web-derived Features:
Concatenation
Consider health care reform
healthcare : 79,500,000
carereform : 269
healthreform: 812
Adjacency model
healthcare vs. carereform
Dependency model
healthcare vs. healthreform
Triples
“healthcare reform” vs. “health carereform”
Marti Hearst, BYU CS 2007
Other Web-derived Features:
Reorder
Reorders for “health care reform”
“care reform health” right
“reform health care” left
Marti Hearst, BYU CS 2007
Other Web-derived Features:
Internal Inflection Variability
Vary inflection of second word
tyrosine kinase activation
tyrosine kinases activation
Marti Hearst, BYU CS 2007
Other Web-derived Features:
Switch The First Two Words
Predict right, if we can reorder
adult male rat
male adult rat
as
Marti Hearst, BYU CS 2007
Our U + U + U Algorithm
Compute bigram estimates
Compute estimates from surface features
Compute estimates from paraphrases
Combine these scores with a voting algorithm to
choose left or right bracketing.
Marti Hearst, BYU CS 2007
Paraphrases
The semantics of a noun compound is often made
overt by a paraphrase (Warren,1978)
Prepositional
stem cells in the brain right
cells from the brain stem left
Verbal
virus causing human immunodeficiency left
Copula
office building that is a skyscraper right
Marti Hearst, BYU CS 2007
Paraphrases
prepositional paraphrases:
We use: ~150 prepositions
verbal paraphrases:
We use: associated with, caused by, contained in, derived
from, focusing on, found in, involved in, located at/in,
made of, performed by, preventing, related to and used
by/in/for.
copula paraphrases:
We use: is/was and that/which/who
optional elements:
articles: a, an, the
quantifiers: some, every, etc.
pronouns: this, these, etc.
Marti Hearst, BYU CS 2007
Paraphrases: pattern (1)
(1)v n1 p n2 v n2 n1
Can we turn “n1 p n2” into a noun compound “n2 n1”?
meet/v demands/n1 from/p customers/n2
meet/v the customer/n2 demands/n1
Problem: ditransitive verbs like give
(noun)
gave/v an apple/n1 to/p him/n2
gave/v him/n2 an apple/n1
Solution:
no determiner before n1
determiner before n2 is required
the preposition cannot be to
Marti Hearst, BYU CS 2007
Paraphrases: pattern (2)
(2)v n1 p n2 v p n2 n1
(verb)
If “p n2” is an indirect object of v, then it could
be switched with the direct object n1.
had/v a program/n1 in/p place/n2
had/v in/p place/n2 a program/n1
Determiner before n1 is required to prevent
“n2 n1” from forming a noun compound.
Marti Hearst, BYU CS 2007
Paraphrases: pattern (3)
(3)v n1 p n2 p n2 * v n1
(verb)
“*” indicates a wildcard position (up to
three intervening words are allowed)
Looks for appositions, where the PP has
moved in front of the verb, e.g.
I gave/v an apple/n1 to/p him/n2
to/p him/n2 I gave/v an apple/n1
Marti Hearst, BYU CS 2007
Paraphrases: pattern (4)
(4)v n1 p n2 n1 p n2 v
(noun)
Looks for appositions, where “n1 p n2” has
moved in front of v
shaken/v confidence/n1 in/p markets/n2
confidence/n1 in/p markets/n2 shaken/v
Marti Hearst, BYU CS 2007
Paraphrases: pattern (5)
(5)v n1 p n2 v PRONOUN p n2
(verb)
n1 is a pronoun verb (Hindle&Rooth, 93)
Pattern (5) substitutes n1 with a dative pronoun
(him or her), e.g.
put/v a client/n1 at/p odds/n2
put/v him at/p odds/n2
Marti Hearst, BYU CS 2007
Paraphrases: pattern (6)
(6)v n1 p n2 BE n1 p n2
(noun)
BE is typically used with a noun attachment
Pattern (6) substitutes v with a form of to be (is
or are), e.g.
eat/v spaghetti/n1 with/p sauce/n2
is spaghetti/n1 with/p sauce/n2
Marti Hearst, BYU CS 2007
Our U + U + U Algorithm
Compute bigram estimates
Compute estimates from surface features
Compute estimates from paraphrases
Combine these scores with a voting algorithm to
choose left or right bracketing.
Marti Hearst, BYU CS 2007
Evaluation: Datasets
Lauer Set
244 noun compounds (NCs)
from Grolier’s encyclopedia
inter-annotator agreement: 81.5%
Biomedical Set
430 NCs
from MEDLINE
inter-annotator agreement: 88% ( =.606)
Marti Hearst, BYU CS 2007
Co-occurrence Statistics
Lauer set
Bio set
Marti Hearst, BYU CS 2007
Paraphrase and Surface Features Performance
Lauer Set
Biomedical Set
Marti Hearst, BYU CS 2007
Individual Surface Features Performance: Bio
Marti Hearst, BYU CS 2007
Individual Surface Features Performance: Bio
Marti Hearst, BYU CS 2007
Results Lauer
Marti Hearst, BYU CS 2007
Results: Comparing with Others
Marti Hearst, BYU CS 2007
Results Bio
Marti Hearst, BYU CS 2007
Results for Noun Compound Bracketing
Introduced search engine statistics that go
beyond the n-gram (applicable to other
tasks)
surface features
paraphrases
Obtained new state-of-the-art results on NC
bracketing
more robust than Lauer (1995)
more accurate than Keller&Lapata (2004)
Marti Hearst, BYU CS 2007
Prepositional Phrase Attachment
Problem:
(a) Peter spent millions of dollars.
(b) Peter spent time with his family.
Which attachment for quadruple:
(noun attach)
(verb attach)
(v, n1, p, n2)
Results:
Much simpler than other algorithms
As good as or better than best unsupervised,
and better than some supervised approaches
Marti Hearst, BYU CS 2007
Noun Phrase Coordination
(Modified) real sentence:
The Department of Chronic Diseases and Health
Promotion leads and strengthens global efforts to
prevent and control chronic diseases or disabilities
and to promote health and quality of life.
Marti Hearst, BYU CS 2007
NC coordination: ellipsis
Ellipsis
car and truck production
means car production and truck production
No ellipsis
president and chief executive
All-way coordination
Securities and Exchange Commission
Marti Hearst, BYU CS 2007
Results
428 examples from Penn TB
Marti Hearst, BYU CS 2007
Semantic Relation Detection
Goal: automatically augment a lexical database
Many potential relation types:
ISA (hypernymy/hyponymy)
Part-Of (meronymy)
Idea: find unambiguous contexts which (nearly)
always indicate the relation of interest
Marti Hearst, BYU CS 2007
Lexico-Syntactic Patterns
Marti Hearst, BYU CS 2007
Lexico-Syntactic Patterns
Marti Hearst, BYU CS 2007
Adding a New Relation
Marti Hearst, BYU CS 2007
Semantic Relation Detection
Lexico-syntactic Patterns:
Should occur frequently in text
Should (nearly) always suggest the relation of interest
Should be recognizable with little pre-encoded
knowledge.
These patterns have been used extensively by
other researchers.
Marti Hearst, BYU CS 2007
Semantic Relation Detection
What relationship holds between two nouns?
olive oil – oil comes from olives
machine oil – oil used on machines
Assigning the meaning relations between these
terms has been seen as a very difficult solution
Our solution:
Use clever queries against the web to figure out the
relations.
Marti Hearst, BYU CS 2007
Queries for Semantic Relations
Convert the noun-noun compound into a query of the form:
noun2 that * noun1
“oil that * olive(s)”
This returns search result snippets containing interesting
verbs.
In this case:
Come from
Be obtained from
Be extracted from
Made from
…
Marti Hearst, BYU CS 2007
Uncovering Semantic Relations
More examples:
Migraine drug -> treat, be used for, reduce, prevent
Wrinkle drug -> treat, be used for, reduce, smooth
Printer tray -> hold, come with, be folded, fit under,
be inserted into
Student protest -> be led by, be sponsored by, pit, be,
be organized by
Marti Hearst, BYU CS 2007
Conclusions
Unambiguous + Unlimited = Unsupervised
The enormous size of the web opens new opportunities for
text analysis
There are many words, but they are more likely to appear together
in a huge dataset
This allows us to do word-specific analysis
To counter the labeled-data roadblock, we start with
unambiguous features that we can find naturally.
We’ve applied this to structural and semantic language problems.
These are stepping stones towards sophisticated language
understanding.
Marti Hearst, BYU CS 2007
Thank you!
http://biotext.berkeley.edu
Supported in part by NSF DBI-0317510