SIMS 290-2: Applied Natural Language Processing: Marti

Download Report

Transcript SIMS 290-2: Applied Natural Language Processing: Marti

SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
October 13, 2004
1
Today
Finish hand-built rule systems
Machine Learning approaches to information
extraction
Sliding Windows
Rule-learners (older)
Feature-base ML (more recent)
IE tools
2
Two kinds of NE approaches
Knowledge Engineering
rule based
developed by experienced
language engineers
make use of human intuition
requires only small amount of
training data
development could be very time
consuming
some changes may be hard to
accommodate
Adapted from slides by Cunningham & Bontcheva
Learning Systems
use statistics or other machine
learning
developers do not need LE
expertise
requires large amounts of
annotated training data
some changes may require reannotation of the entire training
corpus
annotators are cheap (but you get
what you pay for!)
3
Baseline: list lookup approach
System that recognises only entities stored in its lists
(gazetteers).
Advantages - Simple, fast, language independent, easy
to retarget (just create lists)
Disadvantages – impossible to enumerate all names,
collection and maintenance of lists, cannot deal with
name variants, cannot resolve ambiguity
Adapted from slides by Cunningham & Bontcheva
4
Creating Gazetteer Lists
Online phone directories and yellow pages for person and
organisation names (e.g. [Paskaleva02])
Locations lists
US GEOnet Names Server (GNS) data – 3.9 million
locations with 5.37 million names (e.g., [Manov03])
UN site: http://unstats.un.org/unsd/citydata
Global Discovery database from Europa technologies Ltd,
UK (e.g., [Ignat03])
Automatic collection from annotated training data
Adapted from slides by Cunningham & Bontcheva
5
Rule-based Example: FACILE
FACILE - used in MUC-7 [Black et al 98]
Uses Inxight’s LinguistiX tools for tagging and
morphological analysis
Database for external information, role similar to a
gazetteer
Linguistic info per token, encoded as feature vector:
Text offsets
Orthographic pattern (first/all capitals, mixed, lowercase)
Token and its normalised form
Syntax – category and features
Semantics – from database or morphological analysis
Morphological analyses
Example:
(1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F)
(("Mrs." "Title" "Abbr")) NIL)
PER_CIV_F – female civilian (from database)
Adapted from slides by Cunningham & Bontcheva
6
FACILE
Context-sensitive rules written in special rule notation,
executed by an interpreter
Writing rules in PERL is too error-prone and hard
Rules of the kind:
A => B\C/D, where:
A is a set of attribute-value expressions and optional score, the
attributes refer to elements of the input token feature vector
B and D are left and right context respectively and can be empty
B, C, D are sequences of attribute-value pairs and Kleene
regular expression operations; variables are also supported
[syn=NP, sem=ORG] (0.9) =>
\ [norm="university"],
[token="of"],
[sem=REGION|COUNTRY|CITY] / ;
Adapted from slides by Cunningham & Bontcheva
7
FACILE
# Rule for the mark up of person names when the first name is not
# present or known from the gazetteers: e.g 'Mr J. Cass',
[SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M,
LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS
=>
[SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]
\[SYN=NAME, ORTH=I|O, TOKEN=_I]?,
[ORTH=C|A, SYN=PROP, TOKEN=_F]?,
[SYN=NAME, ORTH=I|O, TOKEN=_I]?,
[SYN=NAME, TOKEN=_M]?,
[ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE]
#proper name, not recognised by a rule
/;
Adapted from slides by Cunningham & Bontcheva
8
FACILE
Preference mechanism:
The rule with the highest score is preferred
Longer matches are preferred to shorter matches
Results are always one semantic categorisation of the
named entity in the text
Evaluation (MUC-7 scores):
Organization: 86% precision, 66% recall
Person: 90% precision, 88% recall
Location: 81% precision, 80% recall
Dates: 93% precision, 86% recall
Adapted from slides by Cunningham & Bontcheva
9
Extraction by Sliding Window
Slide adapted from William Cohen's
10
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slide adapted from William Cohen's
11
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slide adapted from William Cohen's
12
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slide adapted from William Cohen's
13
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slide adapted from William Cohen's
14
A Naïve Bayes Sliding Window
Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words & their context)
Slide adapted from William Cohen's
15
Naïve Bayes Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Slide adapted from William Cohen's
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
16
SRV: a realistic sliding-window[Frietag AAAI ‘98]
classifier IE system
What windows to consider?
all windows containing as many tokens as the shortest
example, but no more tokens than the longest example
How to represent a classifier? It might:
Restrict the length of window;
Restrict the vocabulary or formatting used
before/after/inside window;
Restrict the relative order of tokens;
Use inductive logic programming techniques to express all
these…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
Slide adapted from William Cohen's
17
SRV: a rule-learner for slidingwindow classification
Primitive predicates used by SRV:
token(X,W), allLowerCase(W), numerical(W), …
nextToken(W,U), previousToken(W,V)
HTML-specific predicates:
inTitleTag(W), inH1Tag(W), inEmTag(W),…
emphasized(W) = “inEmTag(W) or inBTag(W) or …”
tableNextCol(W,U) = “U is some token in the column
after the column W is in”
tablePreviousCol(W,V), tableRowHeader(W,T),…
Slide adapted from William Cohen's
18
Automatic Pattern-Learning Systems
Language
Input
Trainer
Answers
Model
Language
Input
Decoder
Answers
19
Automatic PatternLearning Systems
Language
Input
Trainer
Answers
Model
Pros:
Answers
Portable across domains
Language
Decoder
Input
Tend to have broad coverage
Robust in the face of degraded input.
Automatically finds appropriate statistical patterns
System knowledge not needed by those who supply the
domain knowledge.
Cons:
Annotated training data, and lots of it, is needed.
Isn’t necessarily better or cheaper than hand-built sol’n
Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK
(UMass); Mooney et al. Rapier (UTexas):
learn lexico-syntactic patterns from templates
20
Rapier
[Califf & Mooney, AAAI-99]
Rapier learns three regex-style patterns for each slot:
Pre-filler pattern
 Filler pattern
Slide adapted from Chris Manning's
 Post-filler pattern
21
Features for IE Learning Systems
Part of speech: syntactic role of a specific word
Semantic Classes: Synonyms or other related words
“Price” class: price, cost, amount, …
“Month” class: January, February, March, …, December
“US State” class: Alaska, Alabama, …, Washington, Wyoming
WordNet: large on-line thesaurus containing (among other things)
semantic classes
Slide adapted from Chris Manning's
22
Rapier rule matching example
“…sold to the bank for an undisclosed amount…”
POS:
vb pr det nn pr det
jj
nn
SClass:
price
Pre-filler
1) tag: {nn,nnp}
2) list: length 2
Filler
Post-Filler
1) word: “undisclosed” 1) sem: price
tag: jj
“…paid Honeywell an undisclosed price…”
POS:
SClass:
vb
Slide adapted from Chris Manning's
nnp
det
jj
nn
price
23
Rapier Rules: Details
Rapier rule :=
pre-filler pattern
filler pattern
post-filler pattern
pattern := subpattern +
subpattern := constraint +
constraint :=
Word - exact word that must be present
Tag - matched word must have given POS tag
Class - semantic class of matched word
Can specify disjunction with “{…}”
List length N - between 0 and N words satisfying other constraints
Slide adapted from Chris Manning's
24
Rapier’s Learning Algorithm
Input: set of training examples (list of documents annotated with
“extract this substring”)
Output: set of rules
Init: Rules = a rule that exactly matches each training example
Repeat several times:
Seed: Select M examples randomly and generate the K
most-accurate maximally-general filler-only rules
(prefiller = postfiller = match anything)
Grow:
Repeat For N = 1, 2, 3, …
Try to improve K best rules by adding N context words
of prefiller or postfiller context
Keep:
Rules = Rules  the best of the K rules – subsumed rules
Slide adapted from Chris Manning's
25
Learning example (one iteration)
2 examples:
‘… located in Atlanta, Georgia…”
‘… offices in Kansas City, Missouri…’
Init
maximally general rules
(low precision, high recall)
Grow
maximally specific rules
(high precision, low recall)
26
appropriately general rule (high precision, high recall)
Slide adapted from Chris Manning's
Rapier results:
Precision vs. # Training Examples
Slide adapted from William Cohen's
27
Rapier: results:
Recall vs. # Training Examples
Slide adapted from William Cohen's
28
Summary: Rule-learning approaches
to sliding-window classification
SRV, Rapier, and WHISK
[Soderland KDD ‘97]
Representations for classifiers allow restriction of
the relationships between tokens, etc
Representations are carefully chosen subsets of
even more powerful representations
Use of these “heavyweight” representations is
complicated, but seems to pay off in results
Slide adapted from William Cohen's
29
Successors to MUC
CoNNL: Conference on Computational Natural Language Learning
Different topics each year
2002, 2003: Language-independent NER
2004: Semantic Role recognition
2001: Identify clauses in text
2000: Chunking boundaries
– http://cnts.uia.ac.be/conll2003/ (also conll2004, conll2002…)
– Sponsored by SIGNLL, the Special Interest Group on Natural
Language Learning of the Association for Computational
Linguistics.
ACE: Automated Content Extraction
Entity Detection and Tracking
– Sponsored by NIST
– http://wave.ldc.upenn.edu/Projects/ACE/
Several others recently
See http://cnts.uia.ac.be/conll2003/ner/
30
CoNNL-2003
Goal: identify boundaries and types of named entities
People, Organizations, Locations, Misc.
Experiment with incorporating external resources
(Gazeteers) and unlabeled data
Data:
Using IOB notation
4 pieces of info for each term
Word
POS Chunk EntityType
31
Details on Training/Test Sets
Reuters Newswire + European Corpus Initiative
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition, Proceedings of CoNLL-2003
32
Summary of Results
16 systems participated
Machine Learning Techniques
Combinations of Maximum Entropy Models (5) + Hidden
Markov Models (4) + Winnow/Perceptron (4)
Others used once were Support Vector Machines,
Conditional Random Fields, Transformation-Based learning,
AdaBoost, and memory-based learning
Combining techniques often worked well
Features
Choice of features is at least as important as ML method
Top-scoring systems used many types
No one feature stands out as essential (other than words)
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition, Proceedings of CoNLL-2003
33
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition, Proceedings of CoNLL-2003
34
Use of External Information
Improvement from using Gazeteers vs. unlabeled data nearly equal
Gazeteers less useful for German than English (higher quality)
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition, Proceedings of CoNLL-2003
35
Precision, Recall, and F-Scores
*
*
*
*
*
* Not significantly different
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent
Named Entity Recognition, Proceedings of CoNLL-2003
36
Combining Results
What happens if we combine the results of all of the
systems?
Used a majority-vote of 5 systems for each set
English:
F = 90.30 (14% error reduction of best system)
German:
F = 74.17 (6% error reduction of best system)
Top four systems in more detail …
37
Zhang and Johnson
Experimented with the effects of different features
Used a learning method they developed called Robust
Risk Minimization
Related to the Winnow method
Used it to predict the class label ti associated with
each token wi
Estimate P(ti = c| xi) for every possible class label c
where xi is a feature vector associated with token i
xi can including information about previous tags
Found that the relatively simple, language
independent features get you much of the way
38
Zhang and Johnson
Simple features include:
The tokens themselves, in window of +/- 2
The previous 2 predicted tags
The conjunction of the previous tag and the current token
Initial capitalization of tokens, in window of +/- 2
More elaborate features include:
Word “shape” information: initial caps, all caps, all digits,
digits containing punctuation
Token prefix (len 3-4) and suffix (len 1-4)
POS
Chunking info (chunk bag-of-words at current token)
Marked up entities from training data
Other dictionaries
39
Language
independent
40
Florian, Ittycheria, Jing, Zhang
Combined four machine learning algorithms
The best-performing was the Zhang & Johnson RRM
Voting algorithm
– Giving them all equal-weight votes worked well
– So did using the RRM algorithm to choose among them
 English F-measure went from 89.94 to 91.63
Did well with the supplied features; did even better with
some complex additional features:
The output of 2 other NER systems
– Trained on 1.7M annotated words in 32 categories
A list of gazetteers
Improved English F-measure to 93.9
– (21% error reduction)
41
Effects of Unknown Words
Florian et al. note that German is harder
Has more unknown words
All nouns are capitalized
42
Klein, Smarr, Nguyen, Manning
Standard approach for unknown words is to extract
features like suffixes, prefixes, and capitalization
Idea: use all-character n-grams, rather than words,
as the primary representation
Integrates unknown words seamlessly into the model
Improved results of their classifier by 25%
43
Balancing n-grams with Other
Evidence
Example: “morning at Grace Road”
Need the classifiers to determine “Grace” is part of a
location rather than a Person
Used Conditional Markov Model (aka Maximum
Entropy Model)
Also, added other “shape” information
“20-month” -> d-x
“Italy”
-> Xx
44
45
46
Chieu and Ng
Used a Maximum Entropy approach
Estimates probabilities based on the principle of
making as few assumptions as possible
But allows specification of constraints between
featurs and outcome (derived from training data)
Used a rich feature set, like those already discussed
Interesting additional features:
Lists derived from training set
“Global” features: look at how the words appeared
elsewhere within the document
Doesn’t say which of these features do well
47
Lists Derived from Training Data
UNI: (useful unigrams)
Top 20 words that precede instances of that class
Computed using a correlation metric
UBI (useful bigrams): pairs of preceding words
CITY OF, ARRIVES IN
The bigram have higher probability of preceding the
class than the unigram
– CITY OF better evidence than just OF
NCS: Useful Name Class Suffixes
Tokens that frequenty terminate a class
– INC, COMMITTEE
48
Using Other Occurrences within
the Document
Zone:
Where is the token from? (headline, author, body)
Unigrams
If UNI holds for an occurrence of w elsewhere
Bigrams
If UBI holds for an occurrence of w elsewhere
Suffix
If NCS holds of an occurrence of w elsewhere
InitCaps
A way to check if a word is capitalized due to its position in
the sentence or not. Also, check the first work in
sequence of capitalized words.
– Even News Broadcasting Corp., noted for its accurate
reporting, made the erroneous announcement.
49
MUC Redux
Task: fill slots of templates
MUC-4 (1992)
All systems hand-engineered
One MUC-6 entry used learning; failed miserably
50
51
MUC Redux
Fast forward 12 years … now use ML!
Chieu et. al. show a machine learning approach that
can do as well as most of the hand-engineered MUC4 systems
Uses state-of-the-art:
–
–
–
–
–
Sentence segmenter
POS tagger
NER
Statistical Parser
Co-reference resolution
Features look at syntactic context
– Use subject-verb-object information
– Use head-words of NPs
Train classifiers for each slot type
Chieu, Hai Leong, Ng, Hwee Tou, & Lee, Yoong Keok (2003). Closing the Gap: Learning-Based Information
52
Extraction Rivaling Knowledge-Engineering Methods, In (ACL-03).
Best systems took 10.5 person-months of hand-coding!
53
IE Techniques: Summary
Machine learning approaches are doing well, even
without comprehensive word lists
Can develop a pretty good starting list with a bit of
web page scraping
Features mainly have to do with the preceding and
following tags, as well as syntax and word “shape”
The latter is somewhat language dependent
With enough training data, results are getting pretty
decent on well-defined entities
ML is the way of the future!
54
IE Tools
Research tools
Gate
– http://gate.ac.uk/
MinorThird
– http://minorthird.sourceforge.net/
Alembic (only NE tagging)
– http://www.mitre.org/tech/alembic-workbench/
Commercial
?? I don’t know which ones work well
55
NE Annotation Tools - GATE
Adapted from slides by Cunningham & Bontcheva
56
NE Annotation Tools - Alembic
Adapted from slides by Cunningham & Bontcheva
57
NE Annotation Tools – Alembic (2)
Adapted from slides by Cunningham & Bontcheva
58
GATE
GATE – University of Sheffield’s open-source infrastructure
for language processing
Automatically deals with document formats, saving of
results, evaluation, and visualisation of results for
debugging
has a finite-state pattern-action rule language
Has an example rule-based system called ANNIE
ANNIE modified for MUC guidelines – 89.5% f-measure on
MUC-7 corpus
Adapted from slides by Cunningham & Bontcheva
59
NE Components
The ANNIE system – a reusable and easily extendable set of
components
Adapted from slides by Cunningham & Bontcheva
60
Gate’s Named Entity Grammars
Phases run sequentially and constitute a cascade of FSTs
over the pre-processing results
Hand-coded rules applied to annotations to identify NEs
Annotations from format analysis, tokeniser, sentence
splitter, POS tagger, and gazetteer modules
Use of contextual information
Finds person names, locations, organisations, dates,
addresses.
Adapted from slides by Cunningham & Bontcheva
61
Named Entities in GATE
Adapted from slides by Cunningham & Bontcheva
62
Named Entity Coreference
Adapted from slides by Cunningham & Bontcheva
63