Meaning from Text: Teaching Computers to Read
Download
Report
Transcript Meaning from Text: Teaching Computers to Read
Meaning from Text:
Teaching Computers to Read
Steven Bethard
University of Colorado
1
Query: “Who is opposing the railroad
through Georgia?”
1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea
…they destroyed the railroads and the manufacturing and agricultural
infrastructure of the state…
Henry Clay Work wrote the song Marching Through Georgia…
…
3 www.ischool.berkeley.edu/~mkduggan/politics.html
While the piano piece "Marching Through Georgia" has no words...
Party of California (1882) has several verses opposing the "railroad
robbers"...
…
71 www.azconsulatela.org/brazaosce.htm
Azerbaijan, Georgia and Turkey plan to start construction of KarsAkhalkalaki-Tbilisi-Baku railroad in May, 2007…
However, we’ve witnessed a very strong opposition to this project both in
Congress and White House. President George Bush signed a bill
prohibiting financing of this railroad…
2
What went wrong?
Didn’t find some similar word forms (Morphology)
Didn’t know how words should be related (Syntax)
Looking for:
Finds:
opposing railroad
opposing the “railroad robbers”
Didn’t know that “is opposing” means current (Semantics/Tense)
Finds opposing but not opposition
Finds railroad but not railway
Looking for:
Finds:
recent documents
Civil War documents
Didn’t know that “who” means a person (Semantics/Entities)
Looking for:
Finds:
<person> opposing
several verses opposing
3
Teaching Linguistics to Computers
Natural Language Processing (NLP)
Symbolic approaches
Statistical approaches
Machine learning overview
Statistical NLP
Example: Identifying people and places
Example: Constructing timelines
4
Early Natural Language Processing
Symbolic approaches
Small domains
Example:
SHRDLU block world
Vocabulary of ~50 words
Simple word combinations
Hand-written rules to
understand sentences
Person: WHAT DOES THE BOX
CONTAIN?
Comp: THE BLUE PYRAMID.
Person: WHAT IS THE PYRAMID
SUPPORTED BY?
Comp: THE BOX.
Person: HOW MANY BLOCKS ARE
NOT IN THE BOX?
5
Comp: SEVEN OF THEM.
Recent Natural Language Processing
Large scale linguistic corpora
e.g. Penn TreeBank million words of syntax:
sentence
noun-phrase
verb-phrase
proper-noun proper-noun signed
George
Bush
noun-phrase
determiner
noun
the
bill
Statistical machine learning
e.g. Charniak parser
Trained on the TreeBank
Builds new trees with 90% accuracy
6
Machine Learning
General approach
Analyze data
Extract preferences
Classify new examples using learned preferences
Supervised machine learning
Data have human-annotated labels
e.g. each sentence in the TreeBank has a syntactic tree
Learns human preferences
7
Supervised Machine Learning Models
Given:
?
Goal:
An Ndimensional feature
space
Points in that space
A human-annotated label for
each point
A Two-Dimensional Space
Learn a function to assign
labels to points
?
Methods:
K-nearest-neighbors,
support vector machines,
etc.
8
Machine Learning Examples
Character Recognition
Feature space:
Labels:
age, sex, heart rate, …
has arrythmia, doesn’t have arrythmia
Mushrooms
256 pixels (0 = black, 1 = white)
A, B, C, …
Cardiac Arrhythmia
Feature space:
Labels:
Feature space:
Labels:
cap shape, gill color, stalk surface, …
poisonous, edible
… and many more:
http://www.ics.uci.edu/~mlearn/MLRepository.html
9
Machine Learning and Language
Example:
Identifying people, places, organizations (named entities)
However, we’ve witnessed a very strong opposition to
this project both in [ORG Congress] and
[ORG White House]. President [PER George Bush] signed a
bill prohibiting financing of this railroad.
This doesn’t look like that lines and dots example!
What’s the classification problem?
What’s the feature space?
10
Named Entities: Classification
Word
Word-by-word
classification
Is the word beginning,
inside or outside of a
named entity?
Label
in
Outside
Congress
Begin-ORG
and
Outside
White
Begin-ORG
House
Inside-ORG
.
Outside
President
Outside
George
Begin-PER
11
Named Entities: Clues
The word itself
Part of speech
The Locations Turkey and Georgia are nouns
(though the White of White House is not)
Is the first letter of the word capitalized?
U.S. is always a Location
(though Turkey is not)
Bush and Congress are capitalized
(though the von of von Neumann is not)
Is the word at the start of the sentence?
In the middle of a sentence, Will is likely a Persion
(but at the start it could be an auxiliary verb)
12
Named Entities: Clues as Features
Each clue defines part of the feature space
Word
in
Part of Speech Starts Sent Initial Caps
preposition
Label
False
False
Outside
Congress noun
False
True
Begin-ORG
and
conjunction
False
False
Outside
White
adjective
False
True
Begin-ORG
House
noun
False
True
Inside-ORG
.
punctuation
False
False
Outside
President noun
True
True
Outside
George
False
True
Begin-PER
noun
13
Named Entities: String Features
But machine learning
models need numeric
features!
True 1
False 0
Congress ?
adjective ?
Solution:
Binary feature for each word
String
Feature
Numeric
Features
destroyed
10000
the
01000
railroads
00100
and
00010
the
01000
manufacturing 0 0 0 0 1
and
00010
14
Named Entities: Review
…[ORG Congress] and [ORG White House]…
Congress noun
False True
Begin-ORG
and
conjunction False False
Outside
White
adjective
False True
Begin-ORG
House
noun
False True
Inside-ORG
1000
100
0
1
Begin-ORG
0100
010
0
0
Outside
0010
001
0
1
Begin-ORG
0001
100
0
1
Inside-ORG
15
Named Entities: Features and Models
String features
How many numeric features?
word itself
part of speech
starts sentence
has initial capitalization
N = Nwords + Nparts-of-speech + 1 + 1
Nwords ≈ 10,000
Nparts-of-speech ≈ 50
Need efficient implementations, e.g. TinySVM
16
Named Entities in Use
We know how to:
View named entity recognition as classification
Convert clues to an N-dimensional feature space
Train a machine learning model
How can we use the model?
17
Named Entities in Search Engines
18
Named Entities in Research
TREC-QA
State of the art performance: ~90%
Factoid question answering
Various research systems compete
All use named entity matching
That’s 10% wrong!
But good enough for real use
Named entities are a “solved” problem
So what’s next?
19
Learning Timelines
The top commander of a
Cambodian resistance force
said Thursday he has sent a
team to recover the remains
of a British mine removal
expert kidnapped and
presumed killed by Khmer
Rouge guerrillas almost
two years ago.
20
Learning Timelines
The top commander of a
Cambodian resistance force
said Thursday he has sent a
team to recover the remains
of a British mine removal
expert kidnapped and
presumed killed by Khmer
Rouge guerrillas almost
two years ago.
21
Learning Timelines
kidnapped includes
The top commander of a
Cambodian resistance force
said Thursday he has sent a
team to recover the remains
of a British mine removal
expert kidnapped and
presumed killed by Khmer
Rouge guerrillas almost
two years ago.
almost two
years ago
before
killed
includes
before
sent
before
said
includes
Thursday
before
recover
22
Why Learn Timelines?
Timelines are summarization
1996
1998
Khmer Rouge kidnapped and killed British mine removal expert
Cambodian commander sent recovery team
…
Timelines allow reasoning
Q:
A:
Q:
A:
When was the expert kidnapped?
Almost two years ago.
Was the team sent before the expert was killed?
No, afterwards.
23
Learning Timelines: Classification
Standard questions:
What’s the classification problem?
What’s the feature space?
Three different problems
Identify times
Identify events
Identify links (temporal relations)
24
Times and Events: Classification
Word-by-word
classification
Time features:
word itself
has digits
…
Event features:
word itself
suffixes (e.g. -ize, -tion)
root (e.g. evasionevade)
…
Word
Part of Speech
Label
The
determiner
Outside
company
noun
Outside
’s
possessive
Outside
sales
noun
Outside
force
noun
Outside
applauded verb
Begin-Event
the
determiner
Outside
shake
noun
Begin-Event
up
particle
Inside-Event
.
punctuation
Outside
25
Times and Events: State of the Art
Performance:
Times:
Events:
~90%
~80%
Mr Bryza, it's been [Event reported] that Azerbaijan,
Georgia and Turkey [Event plan] to [Event start]
[Event construction] of KarsAkhalkalakiTbilisiBaku
railroad in [Time May], [Time 2007].
Why are events harder?
No orthographic cues (capitalization, digits, etc.)
More parts of speech (nouns, verbs and adjectives)
26
Temporal Links
Everything so far looked like:
Aaaa [X bb] ccccc [Y dd eeeee] fff [Z gggg]
But now we want this:
Aaaa
bb
ccccc
X
dd eeeee
fff
ggg
Y
Word-by-word classification won’t work!
27
Temporal Links: Classification
Pairwise classification
Each event with each time
Saddam Hussein
[Time today] [Event sought]
[Event peace] on another
front by [Event promising] to
[Event withdraw] from
Iranian territory and
[Event release] soldiers
[Event captured] during the
Iran-Iraq [Event war].
Event
Time
Label
sought
today
During
peace
today
After
promising today
During
withdraw today
After
release
today
After
captured
today
Before
war
today
Before
28
Temporal Links: Clues
Tense of the event
Nearby temporal expression
said (past tense) is probably Before today
says (present tense) is probably During today
In “said today”, said is During today
In “captured in 1989”, captured is During 1989
Negativity
In “People believe this”, believe is During today
In “People don’t believe this any more”, believe is Before today
29
Temporal Links: Features
Saddam Hussein [Time today] [Event sought] [Event peace] on another
front by [Event promising] to [Event withdraw] from Iranian
territory…
sought
today past
today
positive
During
peace
today none
none
positive
After
promising today present
none
positive
During
withdraw today infinitive
none
positive
After
1000
0
1000
10
0
During
0100
0
0100
01
0
After
0010
0
0010
01
0
During
0001
0
0001
01
0
After
30
Temporal Links: State of the Art
Corpora with temporal links:
PropBank:
TimeBank:
verbs and subjects/objects
certain pairs of events
(e.g. reporting event and event reported)
TempEval A: events and times in the same sentence
TempEval B: events in a document and document time
Performance on TempEval data:
Same-sentence links (A):
Document time links (B):
~60%
~80%
31
What will make timelines better?
Larger corpora
TempEval is only ~50 documents
Treebank is ~2400
More types of links
Event-time pairs for all events
TempEval only considers high-frequency events
Event-event pairs in the same sentence
32
Summary
Statistical NLP asks:
What’s the classification problem?
What’s the feature space?
Word-by-word?
Pairwise?
What are the linguistic clues?
What does the N-dimensional space look like?
Statistical NLP needs:
Learning algorithms efficient when N is very large
Large-scale corpora with linguistic labels
33
Future Work: Automate this!
34
References
Symbolic NLP
Terry Winograd. 1972. Understanding Natural Language. Academic
Press.
Statistical NLP
Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel. 1999. “An
Algorithm that Learns What's in a Name.” Machine Learning.
Kadri Hacioglu,Ying Chen and Benjamin Douglas. 2005. “Automatic
Time Expression Labeling for English and Chinese Text.” In
Proceedings of CICLing-2005.
Ellen M. Voorhees and Hoa Trang Dang. 2005. “Overview of the
TREC 2005 Question Answering Track.” In proceedings of The
Fourteenth Text REtrieval Conference.
35
References
Corpora
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. “Building a large annotated corpus
of english: The penn treebank.” Computational
Linguistics, 19:313-330.
Martha Palmer, Dan Gildea, Paul Kingsbury. 2005. The
Proposition Bank: A Corpus Annotated with Semantic
Roles, Computational Linguistics Journal, 31:1.
James Pustejovsky, Patrick Hanks, Roser Saurí, Andrew
See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev,
Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo.
2003. The TIMEBANK Corpus. Proceedings of Corpus
Linguistics 2003: 647-656.
36
37
Feature Windowing (1)
Word
Problem:
Word-by-word gives
no context
Solution:
Include surrounding
features
Part of Speech
Label
The
determiner
Outside
company
noun
Outside
’s
possessive
Outside
sales
noun
Outside
force
noun
Outside
applauded verb
Begin-Event
the
determiner
Outside
shake
noun
Begin-Event
up
particle
Inside-Event
.
punctuation
Outside
38
Feature Windowing (2)
From previous word:
From current word:
From following word:
Need special values like !START! and !END!
Word-1 Word0 Word+1 POS-
features, label
features
features
POS
1
0
POS+1 Label-1
Label
the
shake up
DT
NN
PRT
Outside
Begin
shake
up
.
NN
PRT
O
Begin
Inside
up
.
!END! PRT
O
!END! Inside
Outside
39
Evaluation: Precision, Recall, F
# entities predicted correctly
precision
# entities predicted
# entities predicted correctly
recall
# entities actually present
2 precision recall
F
( precision recall)
40