pptx - Center for Language and Speech Processing

Download Report

Transcript pptx - Center for Language and Speech Processing

Better Together:
Large Monolingual, Bilingual and
Multimodal Corpora in Natural
Language Processing
Shane Bergsma
Johns Hopkins University
Fall, 2011
Research Vision
Robust processing of human language
requires knowledge beyond what’s in small
manually-annotated data sets
Many NLP successes exploit web-scale raw data:
• Google Translate
• IBM’s Watson
• Things people use every day:
– spelling correction, speech recognition, etc.
2
More data is better data
[Banko &
Brill, 2001]
Grammar
Correction
Task
@Microsoft
This Talk
Derive lots of knowledge from web-scale data
and apply to syntax, semantics, discourse:
1) Raw text on the web (Google N-grams)
 Part 1: Non-referential pronouns
2) Bilingual text (words plus their translations)
 Part 2: Parsing noun phrases
3) Visual data (labelled online images)
 Part 3: Learning the meaning of words
4
Search Engines for NLP
• Early web work:
Use an Internet
search engine to
get data
[Keller & Lapata, 2003]
“Britney Spears”
“Britany Spears”
5
269,000,000 pages
693,000 pages
Search Engines
• Search Engines for NLP: some objections
– Scientific: not reproducible, unreliable
[Kilgarriff, 2007, “Googleology is bad science.”]
– Practical: Too slow for millions of queries
6
N-grams
• Google N-gram Data [Brants & Franz, 2006]
– N words in sequence + their count on web
– A compressed version of all the text on web
• 24 GB zipped fits on your hard drive
– Enables better features for a range of tasks
[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]
7
Part 1: Non-Referential Pronouns
E.g. the word “it” in English
• “You can make it in advance.”
– referential (50-75%)
• “You can make it in Hollywood.”
– non-referential (25-50%)
8
Non-Referential Pronouns
• [Hirst, 1981]: detect non-referential
pronouns, “lest precious hours be lost in
bootless searches for textual referents.”
• Most existing pronoun/coreference
systems just ignore the problem
• A common ambiguity:
– “it” comprises 1% of English tokens
9
Non-Ref Detection as Classification
• Input:
s = “You can make it in advance”
• Output:
Is it a non-referential pronoun in s?
Method: train a supervised classifier to make
this decision on the basis of some features
[Evans, 2001, Boyd et al. 2005, Müller 2006]
10
A Machine Learning Approach
h(x) = w ∙ x
(predict non-ref if h(x) > 0)
• Typical ‘lexical’ features: binary indicators of
context:
x = (previous-word=make, next-word=in, previoustwo-words=can+make, …)
• Use training data to learn good values for the
weights, w
– Classifier learns, e.g., to give negative weight to
PPs immediately preceding ‘it’ (e.g. … from it)
11
Better: Features from the Web
[Bergsma, Lin, Goebel, ACL 2008]
• Convert sentence to a context pattern:
“make ____ in advance”
• Collect counts from the web:
– “make it/them in advance”
• 442 vs. 449 occurrences in Google N-gram Data
– “make it/them in Hollywood”
• 3421 vs. 0 occurrences in Google N-gram Data
12
Applying the Web Counts
• How wide should the patterns span?
– We can use all that Google N-gram Data allows:
You can make _
can make _ in
make _ in advance
_ in advance .
– Five 5-grams, four 4-grams, three 3-grams and two bigrams
• What fillers to use? (e.g. it, they/them, any NP?)
13
Web Count Features
“it”:
log-cnt(“You can make it in”)
5-grams
log-cnt(“can make it in advance”)
log-cnt(“make it in advance .”)
...
log-cnt(“You can make it”)
log-cnt(“can make it in”)
4-grams
...
...
“them”:
log-cnt(“You can make them in”)
...
14
...
5-grams
A Machine Learning Approach Revisited
h(x) = w ∙ x
(predict non-ref if h(x) > 0)
• Typical features: binary indicators of context:
x = (previous-word=make, next-word=in, previoustwo-words=can+make, …)
• New features: real-valued counts in web text:
x = (log-cnt(“make it in advance”), log-cnt(“make
them in advance”, log-cnt(“make * in advance”), …)
• Key conclusion: classifiers with web features are
robust on new domains! [Bergsma, Pitler, Lin, ACL 2010]
15
NADA [Bergsma & Yarowsky, DAARC 2011]
• Non-Anaphoric Detection Algorithm:
– a system for identifying non-referential pronouns
http://code.google.com/p/nada-nonref-pronoun-detector/
• Works on raw sentences; no parsing/tagging
of input needed
• Classifies ‘it’ in up to 20,000 sentences/second
• It works well when used out-of-domain
– Because it’s got those Web count features
16
Using web counts works great…
but is it practical?
All N-grams in the Google N-gram corpus
Extract N-grams of length-4 only
Extract N-grams containing it, they, them only
Lower-case, truncate tokens to four
characters, replace special tokens (e.g. named
entities, pronouns, digits) with symbols, etc.
Encode tokens (6 bytes) and values (2 bytes),
store only changes from previous line
gzip resulting file
17
93 GB
33 GB
500 MB
189 MB
44 MB
33 MB
NADA versus Other Systems
85
75
Paice &
Husk
65
Charniak
& Elsner
55
NADA
45
35
Precision
18
Recall
F-Score
Accuracy
Part 1: Conclusion
• N-gram data better than search engines
• Classifiers with N-gram counts are very
effective, particularly on new domains
• But we needed a large corpus of manuallyannotated data to learn how to use the counts
• We’ll see now how bilingual data can provide
the supervision (for some problems)
19
Part 2: Coordination Ambiguity in NPs
1) [dairy and meat] production
2) [sustainability] and [meat production]
yes: [dairy production] in (1)
no: [sustainability production] in (2)
[Bergsma, Yarowsky & Church, ACL 2011]
• new semantic features from raw web text
and a new approach to using bilingual data
as soft supervision
20
Coordination Ambiguity
• Words whose POS tags match pattern:
[DT|PRP$] (N.*|J.*) and [DT|PRP$] (N.*|J.*) N.*
• Output: Decide if one NP or two
• Resolving Coordination is classic hard problem
– Treebank doesn’t annotate NP-internal structure
– Modern parsers thus do very poorly on these
decisions (78% Minipar, 79% for C&C parser)
– For training/evaluation, we patched Treebank with
Vadas & Curran ’07 NP annotations
21
One Noun Phrase or Two:
A Machine Learning Approach
Input: “dairy and meat production”→ features: x
x = (…, first-noun=dairy, …
second-noun=meat, …
first+second-noun=dairy+meat, …)
h(x) = w ∙ x
(predict one NP if h(x) > 0)
• Set w via training on annotated training data
using some machine learning algorithm
22
Leveraging Web-Derived Knowledge
[dairy and meat] production
• If there is only one NP, then it is implicitly talking
about “dairy production”
• Do we see this phrase occurring a lot on the web? [Yes]
sustainability and [meat production]
• If there is only one NP, then it is implicitly talking
about “sustainability production”
• Do we see this phrase occurring a lot on the web? [No]
• Classifier has features for these counts
– But the web can gives us more!
23
Features for Explicit Paraphrases
❶
and
❷
❸
dairy and meat production
❶
and
❷
❸
sustainability and meat production
Pattern: ❸ of ❶ and ❷
↑ C o u nt ( production of
dairy and meat)
↓Count(production of
sustainability and meat)
Pattern: ❷ ❸ and ❶
↓Count(meat production
and dairy)
↑ C o u nt ( meat production
and sustainability)
New paraphrases extending ideas in [Nakov & Hearst, 2005]
24
HumanAnnotated
Data
(small)
Google
N-gram Data
Raw Data
(HUGE)
25
Training Examples
Feature Vectors
x1, x2, x3, x4
Machine Learning
Classifier: h(x)
Using Bilingual Data
• Bilingual data: a rich source of paraphrases
dairy and meat production  producción láctea y cárnica
• Build a classifier which uses bilingual features
– Applicable when we know the translation of the NP
26
Bilingual “Paraphrase” Features
❶
and
❷
❸
dairy and meat production
❶
and
❷
❸
sustainability and meat production
Pattern: ❸ ❶ … ❷ (Spanish)
C o u nt ( p r o d u cc i ó n
l á c te a y cá r n i ca )
unseen
Pattern: ❶ … ❸ ❷ (Italian)
unseen
27
C o u nt ( s o ste n i b i l i tà e l a
produzione di carne)
Bilingual “Paraphrase” Features
❶
and
❷
❸
dairy and meat production
❶
28
❷
❸
sustainability and meat production
Pattern: ❶- … ❷❸ (Finnish)
Count(maidon ̶ ja
lihantuotantoon)
and
unseen
HumanAnnotated
Data
(small)
Translation
Data
Bilingual Data
(medium)
29
Training Examples
Feature Vectors
x1, x2, x3, x4
Machine Learning
Classifier: h(xb)
Training Examples
+ Features from
Google Data
h(xm)
coal and steel money
h(xb)
Training Examples
30
rocket and mortar attacks
Bitext Examples
+ Features from
Translation Data
Training Examples
+ Features from
Google Data
h(xm)
business and computer science
the environment and air transport
the Bosporus and Dardanelles straits
h(xb)1
Training Examples
coal and steel money
rocket and mortar attacks
31
+ Features from
Translation Data
Training Examples
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
+ Features from
Google Data
h(xm)1
h(xb)1
Co-Training: [Yarowsky’95], [Blum & Mitchell’98]
Training Examples
coal and steel money
rocket and mortar attacks
32
+ Features from
Translation Data
Error rate (%) of co-trained classifiers
h(xb)i
h(xm)i
33
Error rate (%) on Penn Treebank (PTB)
20
unsupervised
15
800 PTB
training
examples
10
h(xm)N
800 PTB
training
examples
2 training
examples
5
0
Broad-coverage Nakov & Hearst
Parsers
(2005)
34
Pitler et al
(2010)
New Supervised Co-trained
Monoclassifier Monoclassifier
Part 2: Conclusion
• Knowledge from large-scale monolingual
corpora is crucial for parsing noun phrases
– New paraphrase features
• New way to use bilingual data as soft
supervision to guide the use of monolingual
features
35
Part 3: Using visual data to learn the
meaning of words
• Large volumes of visual data also reveal meaning
(semantics), but in language-universal way
• Humans label their images as they post them online,
providing the word-meaning link
• There’s lots of images to work with
[from Facebook’s Twitter feed]
36
English Web Images
cockatoo
37
Spanish Web Images
vela
turtle
cacatúa
candle
tortuga
[Bergsma and Van Durme, IJCAI 2011]
Linking bilingual words by web-based
visual similarity
Step 1: Retrieve online images via Google Image
Search (in each lang.), 20 images for each word
– Google competitive with “hand-prepared
datasets” [Fergus et al., 2005]
38
Step 2: Create Image Feature Vectors
Color histogram features
39
Step 2: Create Image Feature Vectors
SIFT keypoint features
Using David Lowe’s software [Lowe, 2004]
40
Step 3: Compute an Aggregate Similarity
for Two Words
Vector
Cosine
Similarity
0.33
0.55
0.19
0.46
41
Avg. over Best match for
all
one English
English
image
images
Output: Ranking of Foreign Translations
by Aggregate Visual Similarities
English
Spanish
French
rosary
1. camándula:0.151
1. chapelet:0.213
2. puntaje:0.140
2. activité:0.153
3. accidentalidad:0.139 3. rosaire:0.150
…
…
Lots of details in the paper:
• Finding a class of words where this works (physical objects)
• Comparing visual similarity to string similarity (cognate finder)
42
Task #2: Lexical Semantics from Images
Selectional
Preference:
Is noun X a
plausible
object for
verb Y?
Can you eat
“migas”?
Can you eat
“carillon”?
Can you eat
“mamey”?
[Bergsma and Goebel, RANLP 2011]
43
Conclusion
• Robust NLP needs to look beyond humanannotated data to exploit large corpora
• Size matters:
– Many NLP systems trained on 1 million words
– We use:
• billions of words in bitexts
• trillions of words of monolingual text
• online images: hundreds of billions
(⨯1000 words each  a 100 trillion words!)
44
Questions + Thanks
• Gold sponsors:
• Platinum sponsors (collaborators):
– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta),
Dekang Lin (Google), Emily Pitler (Penn), Benjamin Van Durme
(Johns Hopkins) and David Yarowsky (Johns Hopkins)
45