Transcript salesforce

Using the Web for
Natural Language Processing Problems
Marti Hearst
School of Information, UC Berkeley
Salesforce.com
July 19, 2006
This research supported in part by NSF DBI-0317510
Natural Language Processing
 The ultimate goal: write programs that read and
understand stories and conversations.
 This is too hard! Instead we tackle sub-problems.
 There have been notable successes lately:
 Machine translation is vastly improved
 Decent speech recognition in limited circumstances
 Text categorization works with some accuracy
Invited talk: Salesforce.com 2006
Automatic Help Desk Translation at MS
Invited talk: Salesforce.com 2006
Why is text analysis difficult?
 One reason: enormous vocabulary size.
 The average English speaker’s vocabulary is
around 50,000 words,
 Many of these can be combined with many
others,
 And they mean different things when they do!
Invited talk: Salesforce.com 2006
How can a machine understand these?
 Decorate the cake with the frosting.
 Decorate the cake with the kids.
 Throw out the cake with the frosting.
 Get the sock from the cat with the gloves.
 Get the glove from the cat with the socks.
 It’s in the plastic water bottle.
 It’s in the plastic bag dispenser.
Invited talk: Salesforce.com 2006
How to tackle this problem?
 The field was stuck for quite some time.
 CYC: hand-enter all semantic concepts and relations
 A new approach started around 1990
 How to do it:
 Get large text collections
 Compute statistics over the words in those collections
 Many different algorithms for doing this.
Invited talk: Salesforce.com 2006
Size Matters
 Recent realization: bigger is better than smarter!
 Banko and Brill ’01: “Scaling to Very, Very Large
Corpora for Natural Language Disambiguation”, ACL
Invited talk: Salesforce.com 2006
Example Problem
 Grammar checker example:
Which word to use?
<principal> <principle>
 Solution: look at which words surround each use:
 I am in my third year as the principal of Anamosa High
School.
 School-principal transfers caused some upset.
 This is a simple formulation of the quantum mechanical
uncertainty principle.
 Power without principle is barren, but principle without
power is futile. (Tony Blair)
Invited talk: Salesforce.com 2006
Using Very, Very Large Corpora
 Keep track of which words are the neighbors of each
spelling in well-edited text, e.g.:
 Principal: “high school”
 Principle: “rule”
 At grammar-check time, choose the spelling best
predicted by the surrounding words.
 Surprising results:
 Log-linear improvement even to a billion words!
 Getting more data is better than fine-tuning algorithms!
Invited talk: Salesforce.com 2006
The Effects of LARGE Datasets
 From Banko & Brill ‘01
Invited talk: Salesforce.com 2006
How to Extend this Idea?
 This is an exciting result …
 BUT relies on having huge amounts of text
that has been appropriately annotated!
Invited talk: Salesforce.com 2006
How to Avoid Labeling?
 “Web as a baseline” (Lapata & Keller 04,05)
 Main idea: apply web-determined counts to
every problem imaginable.
 Example: for t in {<principal> <principle>}
 Compute f(w1, t, w2)
 The largest count wins
Invited talk: Salesforce.com 2006
Web as a Baseline
 Works very well in some cases





machine translation candidate selection
article generation
noun compound interpretation
noun compound bracketing
adjective ordering
Significantly better than the
best supervised algorithm.
Not significantly different
from the best supervised.
 But lacking in others
 spelling correction
 countability detection
 prepositional phrase attachment
 How to push this idea further?
Invited talk: Salesforce.com 2006
Using Unambiguous Cases
 The trick: look for unambiguous cases to start
 Use these to improve the results beyond what cooccurrence statistics indicate.
 An Early Example:
 Hindle and Rooth, “Structural Ambiguity and Lexical
Relations”, ACL ’90, Comp Ling’93
 Problem: Prepositional Phrase attachment




I eat/v spaghetti/n1 with/p a fork/n2.
I eat/v spaghetti/n1 with/p sauce/n2.
quadruple: (v, n1, p, n2)
Question: does n2 attach to v or to n1?
Invited talk: Salesforce.com 2006
Using Unambiguous Cases
 How to do this with unlabeled data?
 First try:
 Parse some text into phrase structure
 Then compute certain co-occurrences
f(v, n1, p) f(n1, p)
f(v, n1)
 Problem: results not accurate enough
 The trick: look for unambiguous cases:
 Spaghetti with sauce is delicious. (pre-verbal)
 I eat it with a fork. (object of preposition can’t attach
to a pronoun)
 Use these to improve the results beyond what cooccurrence statistics indicate.
Invited talk: Salesforce.com 2006
Unambiguous + Unlimited = Unsupervised
 Apply the Unambiguous Case Idea to the Very, Very Large
Corpora idea
 The potential of these approaches are not fully realized
 Our work:
 Structural Ambiguity Decisions (work with Preslav Nakov)



PP-attachment
Noun compound bracketing
Coordination grouping
 Semantic Relation Acquisition


Hypernym (ISA) relations
Verbal relations between nouns
Invited talk: Salesforce.com 2006
Structural Ambiguity Problems
 Apply the U + U = U idea to structural ambiguity
 Noun compound bracketing
 Prepositional Phrase attachment
 Noun Phrase coordination
 Motivation: BioText project

In eukaryotes, the key to transcriptional regulation of the Heat Shock
Response is the Heat Shock Transcription Factor (HSF).

Open-labeled long-term study of the subcutaneous sumatriptan efficacy
and tolerability in acute migraine treatment.
•
BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immunoprecipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot
(anti-EEtag) using extracts human 293T cells co-transfected with EEtagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)
Invited talk: Salesforce.com 2006
Applying U + U = U to Structural Ambiguity
 We introduce the use of (nearly) unambiguous
features:
 surface features
 paraphrases
 Combined with very, very large corpora
 Achieve state-of-the-art results without labeled
examples.
Invited talk: Salesforce.com 2006
Noun Compound Bracketing
(a)
(b)
[ [ liver cell ] antibody ]
[ liver [cell line] ]
(left bracketing)
(right bracketing)
In (a), the antibody targets the liver cell.
In (b), the cell line is derived from the liver.
Invited talk: Salesforce.com 2006
Dependency Model
 right bracketing: [w1[w2w3] ]
 w2w3 is a compound (modified by w1)
 home health care
 w1 and w2 independently modify w3
 adult male rat
w1
w2
w3
w1
w2
w3
 left bracketing : [ [w1w2 ]w3]
 only 1 modificational choice possible
 law enforcement officer
Invited talk: Salesforce.com 2006
Related Work
 Marcus(1980), Pustejosky&al.(1993), Resnik(1993)
 adjacency model:
 Lauer (1995)
Pr(w1|w2) vs. Pr(w2|w3)
 dependency model: Pr(w1|w2) vs. Pr(w1|w3)
 Keller & Lapata (2004):
 use the Web
 unigrams and bigrams
Our approach:
 Girju & al. (2005)
• Web as data
 supervised model
2 , n-grams
•

 bracketing in context
• paraphrases
 requires WordNet senses
• surface features
to be given
Invited talk: Salesforce.com 2006
Computing Bigram Statistics
 Dependency Model, Frequencies
Compare #(w1,w2) to #(w1,w3)
 Dependency model, Probabilities
Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)
Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)
right
w1
w2
w3
left
 So we compare Pr(w1w2|w2) to Pr(w1w3|w3)
Invited talk: Salesforce.com 2006
Probabilities: Estimation
 Using page hits as a proxy for n-gram counts
 Pr(w1w2|w2) = #(w1,w2) / #(w2)
 #(w2)
 #(w1,w2)
word frequency; query for “w2”
bigram frequency; query for “w1 w2”
 smoothed by 0.5
Invited talk: Salesforce.com 2006
Association Models: 2 (Chi Squared)





A = #(wi,wj)
B = #(wi) – #(wi,wj)
C = #(wj) – #(wi,wj)
D = N – (A+B+C)
N = 8 trillion (= A+B+C+D)
8 billion Web pages x 1,000 wordsInvited talk: Salesforce.com 2006
Web-derived Surface Features
 Authors often disambiguate noun compounds using
surface markers, e.g.:
 amino-acid sequence  left
 brain stem’s cell  left
 brain’s stem cell  right
 The enormous size of the Web makes these
frequent enough to be useful.
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Dash (hyphen)
 Left dash
 cell-cycle analysis  left
 Right dash
 donor T-cell  right
 fiber optics-system  should be left..
 Double dash
 T-cell-depletion  unusable…
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Possessive Marker
 Attached to the first word
 brain’s stem cell  right
 Attached to the second word
 brain stem’s cell  left
 Combined features
 brain’s stem-cell  right
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Capitalization
 don’t-care – lowercase – uppercase
 Plasmodium vivax Malaria  left
 plasmodium vivax Malaria  left
 lowercase – uppercase – don’t-care
 brain Stem cell  right
 brain Stem Cell  right
 Disable this on:
 Roman digits
 Single-letter words: e.g. vitamin D deficiency
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Embedded Slash
 Left embedded slash
 leukemia/lymphoma cell  right
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Parentheses
 Single-word
 growth factor (beta)  left
 (brain) stem cell  right
 Two-word
 (growth factor) beta  left
 brain (stem cell)  right
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Comma, dot, semi-colon
 Following the first word
 home. health care  right
 adult, male rat  right
 Following the second word
 health care, provider  left
 lung cancer: patients  left
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Dash to External Word
 External word to the left
 mouse-brain stem cell  right
 External word to the right
 tumor necrosis factor-alpha  left
Invited talk: Salesforce.com 2006
Web-derived Surface Features:
Problems & Solutions
 Problem: search engines ignore punctuation in
queries
 “brain-stem cell” does not work
 Solution:
 query for “brain stem cell”
 obtain 1,000 document summaries
 scan for the features in these summaries
Invited talk: Salesforce.com 2006
Other Web-derived Features:
Abbreviation
 After the second word
 tumor necrosis factor (NF)  right
 After the third word
 tumor necrosis (TN) factor  right
 We query for, e.g., “tumor necrosis tn factor”
 Problems:
 Roman digits: IV, VI
 States: CA
 Short words: me
Invited talk: Salesforce.com 2006
Other Web-derived Features:
Concatenation
 Consider health care reform
 healthcare : 79,500,000
 carereform : 269
 healthreform: 812
 Adjacency model
 healthcare vs. carereform
 Dependency model
 healthcare vs. healthreform
 Triples
 “healthcare reform” vs. “health carereform”
Invited talk: Salesforce.com 2006
Other Web-derived Features:
Reorder
 Reorders for “health care reform”
 “care reform health”  right
 “reform health care”  left
Invited talk: Salesforce.com 2006
Other Web-derived Features:
Internal Inflection Variability
 Vary inflection of second word
 tyrosine kinase activation
 tyrosine kinases activation
Invited talk: Salesforce.com 2006
Other Web-derived Features:
Switch The First Two Words
 Predict right, if we can reorder
 adult male rat
 male adult rat
as
Invited talk: Salesforce.com 2006
Paraphrases
 The semantics of a noun compound is often made
overt by a paraphrase (Warren,1978)
 Prepositional
 stem cells in the brain  right
 cells from the brain stem  right
 Verbal
 virus causing human immunodeficiency  left
 Copula
 office building that is a skyscraper  right
Invited talk: Salesforce.com 2006
Paraphrases
 prepositional paraphrases:
 We use: ~150 prepositions
 verbal paraphrases:
 We use: associated with, caused by, contained in, derived
from, focusing on, found in, involved in, located at/in,
made of, performed by, preventing, related to and used
by/in/for.
 copula paraphrases:
 We use: is/was and that/which/who
 optional elements:
 articles: a, an, the
 quantifiers: some, every, etc.
 pronouns: this, these, etc.
Invited talk: Salesforce.com 2006
Evaluation: Datasets
 Lauer Set
 244 noun compounds (NCs)
 from Grolier’s encyclopedia
 inter-annotator agreement: 81.5%
 Biomedical Set
 430 NCs
 from MEDLINE
 inter-annotator agreement: 88% ( =.606)
Invited talk: Salesforce.com 2006
Evaluation: Experiments
 Exact phrase queries
 Limited to English
 Inflections:
 Lauer Set: Carroll’s morphological tools
 Biomedical Set: UMLS Specialist Lexicon
Invited talk: Salesforce.com 2006
Co-occurrence Statistics
 Lauer set
 Bio set
Invited talk: Salesforce.com 2006
Paraphrase and Surface Features Performance
 Lauer Set
 Biomedical Set
Invited talk: Salesforce.com 2006
Individual Surface Features Performance: Bio
Invited talk: Salesforce.com 2006
Individual Surface Features Performance: Bio
Invited talk: Salesforce.com 2006
Results Lauer
Invited talk: Salesforce.com 2006
Results: Comparing with Others
Invited talk: Salesforce.com 2006
Results Bio
Invited talk: Salesforce.com 2006
Results for Noun Compound Bracketing
 Introduced search engine statistics that go
beyond the n-gram (applicable to other
tasks)
 surface features
 paraphrases
 Obtained new state-of-the-art results on NC
bracketing
 more robust than Lauer (1995)
 more accurate than Keller&Lapata (2004)
Invited talk: Salesforce.com 2006
Prepositional Phrase Attachment
(a) Peter spent millions of dollars.
(b) Peter spent time with his family.
(noun attach)
(verb attach)
quadruple: (v, n1, p, n2)
(a) (spent, millions, of, dollars)
(b) (spent, time, with, family)
Invited talk: Salesforce.com 2006
Noun Phrase Coordination
 (Modified) real sentence:
 The Department of Chronic Diseases and Health
Promotion leads and strengthens global efforts to
prevent and control chronic diseases or disabilities
and to promote health and quality of life.
Invited talk: Salesforce.com 2006
NC coordination: ellipsis
 Ellipsis
 car and truck production
 means car production and truck production
 No ellipsis
 president and chief executive
 All-way coordination
 Securities and Exchange Commission
Invited talk: Salesforce.com 2006
NC Coordination: ellipsis
 Quadruple (n1,c,n2,h)
 Penn Treebank annotations
 ellipsis:
(NP car/NN and/CC truck/NN production/NN).
 no ellipsis:
(NP (NP president/NN) and/CC (NP chief/NN
executive/NN))
 all-way: can be annotated either way
 This is a problem a parser must deal with.
Collins’ parser always predicts ellipsis,
but other parsers (e.g. Charniak’s)
to solve2006
it.
Invited talk:try
Salesforce.com
Results
428 examples from Penn TB
Invited talk: Salesforce.com 2006
Semantic Relation Detection
 Goal: automatically augment a lexical database
 Many potential relation types:
 ISA (hypernymy/hyponymy)
 Part-Of (meronymy)
 Idea: find unambiguous contexts which (nearly)
always indicate the relation of interest
Invited talk: Salesforce.com 2006
Lexico-Syntactic Patterns
Invited talk: Salesforce.com 2006
Lexico-Syntactic Patterns
Invited talk: Salesforce.com 2006
Adding a New Relation
Invited talk: Salesforce.com 2006
Semantic Relation Detection
 Lexico-syntactic Patterns:
 Should occur frequently in text
 Should (nearly) always suggest the relation of interest
 Should be recognizable with little pre-encoded
knowledge.
 These patterns have been used extensively by
other researchers.
Invited talk: Salesforce.com 2006
Semantic Relation Detection
 What relationship holds between two nouns?
 olive oil – oil comes from olives
 machine oil – oil used on machines
 Assigning the meaning relations between these
terms has been seen as a very difficult solution
 Our solution:
 Use clever queries against the web to figure out the
relations.
Invited talk: Salesforce.com 2006
Queries for Semantic Relations
 Convert the noun-noun compound into a query of the form:

noun2 that * noun1

“oil that * olive(s)”
 This returns search result snippets containing interesting
verbs.
 In this case:





Come from
Be obtained from
Be extracted from
Made from
…
Invited talk: Salesforce.com 2006
Queries for Semantic Relations
 More examples:
 Migraine drug -> treat, be used for, reduce, prevent
 Wrinkle drug -> treat, be used for, reduce, smooth
 Printer tray -> hold, come with, be folded, fit under,
be inserted into
 Student protest -> be led by, be sponsored by, pit, be,
be organized by
Invited talk: Salesforce.com 2006
Conclusions
 The enormous size of the web opens new
opportunities for text analysis
 There are many words, but they are more likely to
appear together in a huge dataset
 This allows us to do word-specific analysis
 We’ve applied it to structural and semantic
language problems.
 These are stepping stones towards sophisticated
language understanding.
Invited talk: Salesforce.com 2006
Thank you!
http://biotext.berkeley.edu
Supported in part by NSF DBI-0317510
Using n-grams to make predictions
 Say trying to distinguish:
[home health] care
home [health care]
 Main idea: compare these co-occurrence
probabilities
 “home health” vs
 “health care”
Invited talk: Salesforce.com 2006
Using n-grams to make predictions
 Use search engines page hits as a proxy for n-gram
counts
 compare Pr(w1w2|w2) to Pr(w1w3|w3)
 Pr(w1 w2|w2 ) = #(w1,w2) / #(w2)
 #(w2)
 #(w1,w2)
word frequency; query for “w2”
bigram frequency; query for “w1 w2”
Invited talk: Salesforce.com 2006
Probabilities: Why? (1)
 Why should we use:
 (a) Pr(w1w2|w2), rather than
 (b) Pr(w2w1|w1)?
 Keller&Lapata (2004) calculate:
 AltaVista queries:
 (a): 70.49%
 (b): 68.85%
 British National Corpus:
 (a): 63.11%
 (b): 65.57%
Invited talk: Salesforce.com 2006
Probabilities: Why? (2)
 Why should we use:
 (a) Pr(w1w2|w2), rather than
 (b) Pr(w2w1|w1)?
 Maybe to introduce a bracketing prior.
 Just like Lauer (1995) did.
 But otherwise, no reason to prefer either one.
 Do we need probabilities? (association is OK)
 Do we need a directed model? (symmetry is OK)
Invited talk: Salesforce.com 2006
Adjacency & Dependency (2)
 right bracketing: [w1[w2w3] ]
 w2w3 is a compound (modified by w1)
 w1 and w2 independently modify w3
 adjacency model
 Is w2w3 a compound?
 (vs. w1w2 being a compound)
w1
w2
w3
w1
w2
w3
w1
w2
w3
 dependency model
 Does w1 modify w3?
 (vs. w1 modifying w2)
Invited talk: Salesforce.com 2006
Paraphrases: pattern (1)
(1)v n1 p n2  v n2 n1

Can we turn “n1 p n2” into a noun compound “n2 n1”?



meet/v demands/n1 from/p customers/n2 
meet/v the customer/n2 demands/n1
Problem: ditransitive verbs like give



(noun)
gave/v an apple/n1 to/p him/n2 
gave/v him/n2 an apple/n1
Solution:



no determiner before n1
determiner before n2 is required
the preposition cannot be to
Invited talk: Salesforce.com 2006
Paraphrases: pattern (2)
(2)v n1 p n2  v p n2 n1

(verb)
If “p n2” is an indirect object of v, then it could
be switched with the direct object n1.


had/v a program/n1 in/p place/n2 
had/v in/p place/n2 a program/n1
Determiner before n1 is required to prevent
“n2 n1” from forming a noun compound.
Invited talk: Salesforce.com 2006
Paraphrases: pattern (3)
(3)v n1 p n2  p n2 * v n1
(verb)

“*” indicates a wildcard position (up to
three intervening words are allowed)

Looks for appositions, where the PP has
moved in front of the verb, e.g.


I gave/v an apple/n1 to/p him/n2 
to/p him/n2 I gave/v an apple/n1
Invited talk: Salesforce.com 2006
Paraphrases: pattern (4)
(4)v n1 p n2  n1 p n2 v

(noun)
Looks for appositions, where “n1 p n2” has
moved in front of v


shaken/v confidence/n1 in/p markets/n2 
confidence/n1 in/p markets/n2 shaken/v
Invited talk: Salesforce.com 2006
Paraphrases: pattern (5)
(5)v n1 p n2  v PRONOUN p n2
(verb)
pronoun
 n1 is a pronoun  verb (Hindle&Rooth, 93)

Pattern (5) substitutes n1 with a dative pronoun
(him or her), e.g.


put/v a client/n1 at/p odds/n2 
put/v him at/p odds/n2
Invited talk: Salesforce.com 2006
Paraphrases: pattern (6)
(6)v n1 p n2  BE n1 p n2
(noun)
to
be
 BE is typically used with a noun attachment

Pattern (6) substitutes v with a form of to be (is
or are), e.g.


eat/v spaghetti/n1 with/p sauce/n2 
is spaghetti/n1 with/p sauce/n2
Invited talk: Salesforce.com 2006