pptx - Johns Hopkins University

Download Report

Transcript pptx - Johns Hopkins University

Three kinds of web data that
can help computers make better
sense of human language
Shane Bergsma
Johns Hopkins University
Fall, 2011
Computers that understand language
“William Wilkinson’s ‘An Account of the
Principalities of Wallachia and Moldavia’
inspired this author’s most famous novel.”
2
Research Vision
Robust processing of human language
requires knowledge beyond what’s in small
manually-annotated data sets
Derive meaning from real-world data:
1) Raw text on the web
2) Bilingual text (words plus their translations)
 Part 1: Parsing noun phrases
3) Visual data (labelled online images)
 Part 2: Learning the meaning of words
3
Part 1: Parsing Noun Phrases (NPs)
Google: What pages/ads should be returned for
the query “washed baby carrots”?
[washed baby] carrots vs. washed [baby carrots]
carrots for washed babies
4
baby carrots that are washed
Training a parser via machine learning
washed baby carrots
PARSER
with
weights,
w0
[washed baby] carrots
TESTER
INCORRECT in
training data
5
Training a parser via machine learning
washed baby carrots
PARSER
Training corpus:
retired [science teacher]
[social science] teacher
female [bus driver]
[school bus] driver
zebra [hair straightener]
alleged [Canadian lover]
…
6
with
weights,
w1
washed [baby carrots]
TESTER
CORRECT in
gold standard
More data is better data (learning curve)
[Banko & Brill, 2001]
Grammar
Correction
Task
Testing a parser on new data
Big Challenge: For parsing washed baby smell
NPs, every word matters
with final
PARSER
weights,
- both parses are
wN
grammatical
- we can’t generalize from
“washed baby carrots” in
training to “washed baby washed [baby smell]
smell” at test time
TESTER
- HavingNew
seensources
washedof[baby carrots] in training…
Solution:
INCORRECT
data
8
English Data for Parsing
Human Annotated Penn (Parse-)Treebank
[Marcus et al., 1993]
• 1 MILLION words
Bitexts
• 1 BILLION words
Canadian Hansards, etc.
[Callison-Burch et al., 2010]
Web text [N-grams] Google N-gram Data
[Brants & Franz, 2006]
• 1 TRILLION words
9
Task: Parsing NPs with conjunctions
1) [dairy and meat] production
2) [sustainability] and [meat production]
yes: [dairy production] in (1)
no: [sustainability production] in (2)
• Our contributions: new semantic features
from raw web text and a new approach to
using bilingual data as soft supervision
10
[Bergsma, Yarowsky & Church, ACL 2011]
One Noun Phrase or Two:
A Machine Learning Approach
Input: “dairy and meat production”→ features: x
x = (…, first-noun=dairy, …
second-noun=meat, …
first+second-noun=dairy+meat, …)
h(x) = w ∙ x
(predict one NP if h(x) > 0)
• Set w via training on annotated training data
using some machine learning algorithm
11
Leveraging Web-Derived Knowledge
[dairy and meat] production
• If there is only one NP, then it is implicitly talking
about “dairy production”
• Do we see this phrase occurring a lot on the web? [Yes]
sustainability and [meat production]
• If there is only one NP, then it is implicitly talking
about “sustainability production”
• Do we see this phrase occurring a lot on the web? [No]
• Classifier has features for these counts
12
Search Engine Page Counts for NLP
• Early web work:
Use an Internet
search engine to
get web counts
[Keller & Lapata, 2003]
“dairy production”
“sustainability production”
714,000 pages
11,000 pages
Problem: Using a search engine is just too inefficient
to get data on a large scale
13
Google N-gram Data for NLP
• Google N-gram Data [Brants & Franz, 2006]
– N words in sequence + their count on web:
…
dairy
dairy
dairy
dairy
dairy
dairy
producers
production
professionals
profits
propaganda
protein
22724
17704
204
82
15
1268
…
– A compressed version of all the text on web
– Enables new features/statistics for a range of tasks
[Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]
14
Features for Explicit Paraphrases
❶
and
❷
❸
dairy and meat production
❶
and
❷
❸
sustainability and meat production
Pattern: ❸ of ❶ and ❷
↑ C o u nt ( production of
dairy and meat)
↓Count(production of
sustainability and meat)
Pattern: ❷ ❸ and ❶
↓Count(meat production
and dairy)
↑ C o u nt ( meat production
and sustainability)
New paraphrases extending ideas in [Nakov & Hearst, 2005]
15
HumanAnnotated
Data
(small)
Google
N-gram Data
Raw Data
(HUGE)
16
Training Examples
Feature Vectors
x1, x2, x3, x4
Machine Learning
Classifier: h(x)
Using Bilingual Data
• Bilingual data: a rich source of paraphrases
dairy and meat production  producción láctea y cárnica
• Build a classifier which uses bilingual features
– Applicable when we know the translation of the NP
17
Bilingual “Paraphrase” Features
❶
and
❷
❸
dairy and meat production
❶
and
❷
❸
sustainability and meat production
Pattern: ❸ ❶ … ❷ (Spanish)
C o u nt ( p r o d u cc i ó n
l á c te a y cá r n i ca )
unseen
Pattern: ❶ … ❸ ❷ (Italian)
unseen
18
C o u nt ( s o ste n i b i l i tà e l a
produzione di carne)
Bilingual “Paraphrase” Features
❶
and
❷
❸
dairy and meat production
❶
19
❷
❸
sustainability and meat production
Pattern: ❶- … ❷❸ (Finnish)
Count(maidon ̶ ja
lihantuotantoon)
and
unseen
HumanAnnotated
Data
(small)
Translation
Data
Bilingual Data
(medium)
20
Training Examples
Feature Vectors
x1, x2, x3, x4
Machine Learning
Classifier: h(xb)
Training Examples
+ Features from
Google Data
h(xm)
coal and steel money
h(xb)
Training Examples
21
rocket and mortar attacks
Bitext Examples
+ Features from
Translation Data
Training Examples
+ Features from
Google Data
h(xm)
business and computer science
the environment and air transport
the Bosporus and Dardanelles straits
h(xb)1
Training Examples
coal and steel money
rocket and mortar attacks
22
+ Features from
Translation Data
Training Examples
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
+ Features from
Google Data
h(xm)1
h(xb)1
Co-Training: [Yarowsky’95], [Blum & Mitchell’98]
Training Examples
coal and steel money
rocket and mortar attacks
23
+ Features from
Translation Data
Error rate (%) of co-trained classifiers
h(xb)i
h(xm)i
24
Error rate (%) on Penn Treebank (PTB)
20
unsupervised
15
800 PTB
training
examples
10
h(xm)N
800 PTB
training
examples
2 training
examples
5
0
Broad-coverage Nakov & Hearst
Parsers
(2005)
25
Pitler et al
(2010)
New Supervised Co-trained
Monoclassifier Monoclassifier
Part 1: Conclusion
• Knowledge from large-scale monolingual corpora
is crucial for parsing noun phrases
– New paraphrase features
• New way to use bilingual data as soft supervision
to guide the use of monolingual features
• Next steps: Use bilingual data even when we
don’t know the translations to begin with
– infer translations jointly with syntax
– i.e., beyond bitexts (1B), make use of huge (1T+) Ngram corpora in English, Spanish, French, …
26
Part 2: Using visual data to learn the
meaning of words
• Large volumes of visual data also reveal word
meaning (semantics), but in language-universal way
• Humans label their images as they post them online,
providing the word-meaning link
• There’s lots of images to work with
[from Facebook’s Twitter feed]
27
Part 2: Using visual data to learn the
meaning of words
Progress in the area of “lexical semantics”
Task #1: learning translations of words into
foreign languages using visual data, e.g.
“turtle” in English = “tortuga” in Spanish
Main contribution: a totally new approach
to building bilingual dictionaries
[Bergsma and Van Durme, IJCAI 2011]
28
English Web Images
cockatoo
29
Spanish Web Images
vela
turtle
cacatúa
candle
tortuga
Task #1: Bilingual Lexicon Induction
• Why?
– Needed for automatic machine translation,
cross-language information retrieval, etc.
– Poor coverage of human-compiled
dictionaries/bitexts
• How to do it with monolingual data only?
– Link words to information that is preserved across
languages (clues to common meaning)
30
Clues to Common Meaning: Spelling
[Koehn & Knight 2002, many others]
natural-natural
higiénico:hygenic
radón-radon
vela-candle
*calle-candle
31
Clues to Common Meaning: Images
calle
candle
vela
Visual similarities:
• high contrast
• black background
• glowing flame
32
Link words by web-based
visual similarity
Step 1: Retrieve online images via Google Image
Search (in each lang.), 20 images for each word
– Google competitive with “hand-prepared
datasets” [Fergus et al., 2005]
33
Step 2: Create Image Feature Vectors
Color histogram features
34
Step 2: Create Image Feature Vectors
SIFT keypoint features
Using David Lowe’s software [Lowe, 2004]
35
Step 3: Compute an Aggregate Similarity
for Two Words
Vector
Cosine
Similarity
0.33
0.55
0.19
0.46
36
Avg. over Best match for
all
one English
English
image
images
Output: Ranking of Foreign Translations
by Aggregate Visual Similarities
English
Spanish
French
rosary
1. camándula:0.151
1. chapelet:0.213
2. puntaje:0.140
2. activité:0.153
3. accidentalidad:0.139 3. rosaire:0.150
…
37
…
Experiments
• 500-word lists in each language
• Results on all pairs from German, English,
Spanish, French, Italian, Dutch
• Avg. Top-N Accuracy: How often correct
answer is in top N most similar words?
– Lots more details in paper, including how we
determine which words are ‘physical objects’
38
Average Top-N Accuracy on 14
Language Pairs
80
70
60
50
40
30
20
10
0
39
Top-1
Top-20
Task #2: Lexical Semantics from Images
Selectional
Preference:
Is noun X a
plausible
object for
verb Y?
Can you eat
“migas”?
Can you eat
“carillon”?
Can you eat
“mamey”?
40
[Bergsma and Goebel, RANLP 2011]
Conclusion
• Robust NLP needs to look beyond humanannotated data to exploit large corpora
• Size matters:
– Most parsing systems trained on 1 million words
– We use:
• billions of words in bitexts
• trillions of words of monolingual text
• online images: hundreds of billions
(⨯1000 words each  a 100 trillion words!)
41
Questions + Thanks
• Gold sponsors:
• Platinum sponsors (collaborators):
– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta),
Dekang Lin (Google), Emily Pitler (Penn), Benjamin Van Durme
(Johns Hopkins) and David Yarowsky (Johns Hopkins)
42