Learning in NLP: When can we reduce or avoid annotation cost

Download Report

Transcript Learning in NLP: When can we reduce or avoid annotation cost

Learning in NLP: When can we
reduce or avoid annotation cost?
Tutorial at RANLP 2003
Ido Dagan
Bar Ilan University, Israel
1
Introduction
• Motivations for learning in NLP
– NLP requires huge amounts of diverse types of
knowledge – learning makes knowledge
acquisition more feasible, automatically or semiautomatically
– Much of language behavior is preferential in
nature, so need to acquire both quantitative and
qualitative knowledge
2
Introduction (cont.)
• Apparently, empirical modeling obtains (so
far) mostly “first-order” approximation of
linguistic behavior
– Often, learning models that are more complex
computationally improve results only to a
modest extent
– Often, several learning models obtain
comparable results
– Proper linguistic modeling seems crucial
3
Information Units of Interest - Examples
• Explicit units:
– Documents
– Lexical units: words, terms (surface/base form)
• Implicit (hidden) units – human stipulation:
–
–
–
–
–
4
Word senses, name types
Document categories
Lexical syntactic units: part of speech tags
Syntactic relationships between words – parsing
Semantic concepts and relationships
Tasks and Applications
• Supervised/classification: identify hidden
units (concepts) of explicit units
– Syntactic analysis, word sense disambiguation,
name classification, categorization, …
• Unsupervised: identify relationships and
properties of explicit units (terms, docs)
– Association, topicality, similarity, clustering
• Combinations
5
Data and Representations
• Frequencies of units
• Co-occurrence frequencies
– Between all relevant types of units (term-doc,
term-term, term-category, sense-term, etc.)
• Representations and modeling
– Sequences
– Feature sets/vectors
6
Characteristics of Learning in NLP
– Very high dimensionality
– Sparseness of data and relevant modeling
– Addressing the basic problems of
language:
• Ambiguity – of concepts and features
– One way to say many things
• Variability
– Many ways to say the same thing
7
Supervised Classification
• Hidden concept is defined by a set of labeled
training examples (category, sense)
• Classification is based on entailment of the
hidden concept by related elements/features
– Example: two senses of “sentence”:
• word, paragraph, description
• judge, court, lawyer
Sense1
Sense2
• Single or multiple concepts per example
– Word sense vs. document categories
8
Supervised Tasks and Features
• Typical Classification Tasks:
– Lexical: Word sense disambiguation, target word
selection in translation, name-type classification, accent
restoration, text categorization (notice task similarity)
– Syntactic: POS tagging, PP-attachment, parsing
– Hybrid: anaphora resolution, information extraction
• Features (“feature engineering”):
– Adjacent context: words, POS, …
• In various relationships – distance, syntactic
• possibly generalized to classes
– Other: morphological, orthographic, syntactic
9
Learning to Classify
• Two possibilities for acquiring “entailment”
relationships:
– Manually: by an expert (“rules”)
• time consuming, difficult – “expert system” approach
– Automatically: concept is defined by a set of
training examples
• training quantity/quality
• Training: learn entailment of concept by
features of training examples (a model)
• Classification: apply model to new examples
10
Supervised Learning Scheme
“Labeled”
Examples
Training
Algorithm
Classification
Model
New
Examples
11
Classification
Algorithm
Classifications
Learning Approaches
• Model-based: define entailment relations and their
strengths by training algorithm
– Statistical/Probabilistic: model is composed of
probabilities (scores) computed from training statistics
– Iterative feedback/search (neural network): start from
some model, classify training examples, and correct
model according to feedback
• Memory-based: no training algorithm and model classify by matching to raw training (compare to
unsupervised tasks)
12
Motivation of Tutorial Theme: Reducing or
Avoiding Manual Labeling
• Basic supervised setting – requires large manually labeled
training corpora
• Annotation is often very expensive
• Many results rely on standard training materials, which were
assembled through dedicated projects and evaluation
frameworks
– Penn Treebank, Brown Corpus, Semcor, TREC, MUC and SenseEval
evaluations, CoNLL shared tasks.
• Limited applicability for settings not covered by the generic
resources
13
– Different languages, specialized domains, full scope of word senses,
text categories, …
– Severely hurts industrial applicability
Tutorial Scope
• Obtaining some (noisy) labeled data without manual
annotation
• Exploiting bilingual resources
• Generalizations by unsupervised methods
• Bootstrapping
• Unsupervised clustering as an alternative to supervised classes
• Expectation-Maximization (EM) for detecting underlying
structures/concepts
• Selective sampling
• These approaches are demonstrated for basic statistical and
probabilistic learning models
Some of these approaches might be perceived as unsupervised learning,
though they actually address supervised tasks of identifying externally
imposed classes (“unsupervised” training)
14
Sources
• Major literature sources:
– Foundations of Statistical Natural Language
Processing, by Manning & Schutze, MIT Press,
2000 (2nd printing with corrections)
– Articles (see bibliography)
• Additional slide credits:
– Prof. Shlomo Argamon, Chicago
15
Evaluation
• Evaluation mostly based on (subjective) human
judgment of relevancy/correctness
– In some cases – task is objective (e.g. OCR), or
evaluate by applying mathematical criteria (likelihood)
• Basic measure for classification – accuracy
• Cross validation – different training/test splits
• In many tasks (extraction, multiple class perinstance, …) most instances are “negative”; hence
using recall/precision measures, following
information retrieval (IR) tradition
16
Evaluation: Recall/Precision
• Recall: #correct extracted/total correct
• Precision: #correct extracted/total extracted
• Recall/precision curve - by varying the
number of extracted items, assuming the
items are sorted by decreasing score
1
Precision
0
17
Recall
1
Simple Examples for Statistics-based
Classification
• Based on class-feature counts – from
labeled data
• Contingency table:
C
~C
f
a
b
~f
c
d
• We will see several examples of simple
models based on these statistics
18
Prepositional-Phrase Attachment
• Simplified version of Hindle & Rooth
(1993)
[MS 8.3]
• Setting: V NP-chunk PP
– Moscow sent soldiers into Afghanistan
– ABC breached an agreement with XYZ
• Motivation for the classification task:
– Attachment is often a problem for (full) parsers
– Augment shallow/chunk parsers
19
Relevant Probabilities
• P(prep|n) vs. P(prep|v)
– The probability of having the preposition prep attached
to an occurrence of the noun n (the verb v).
– Notice: a single feature for each class
• Example: P(into|send) vs. P(into|soldier)
• Decision measured by the likelihood ratio:
P( prep | v)
(v, n, p)  log
P( prep | n)
• Positive/negative λ  verb/noun attachment
20
Estimating Probabilities
• Based on attachment counts from a training corpus
• Maximum likelihood estimates:
attach _ freq( prep, v)
freq(v)
attach _ freq( prep, n)
P( prep | n) 
freq(n)
P( prep | v) 
• How to count from an unlabeled ambiguous
corpus? (Circularity problem)
• Some cases are unambiguous:
– The road to London is long
– Moscow sent him to Afghanistan
21
Heuristic Bootstrapping and Ambiguous
Counting
1. Produce initial estimates (model) by counting all
unambiguous cases
2. Apply the initial model to all ambiguous cases;
count each case under the resulting attachment if
|λ| is greater than a threshold
•
E.g. |λ|>2, meaning one attachment is at least 4 times
more likely than the other
3. Consider each remaining ambiguous case as a
0.5 count for each attachment.
•
22
Likely n-p and v-p pairs would “pop up” in the
ambiguous counts, while incorrect attachments are
likely to accumulate low counts
Example Decision
• Moscow sent soldiers into Afghanistan
attach _ freq(into, send )
86

 0.049
freq(send )
1742.5
attach _ freq(into, soldier )
1
P(into | soldier ) 

 0.0007
freq(soldier )
1478
0.049
 (send, soldier, into)  log 2
 log 2 70
0.0007
P(into | send ) 
• Verb attachment is 70 times more likely
23
Hindle & Rooth Evaluation
• H&R results for a somewhat richer model:
– 80% correct if we always make a choice
– 91.7% precision for 55.2% recall, when
requiring |λ|>3 for classification.
• Notice that the probability ratio doesn’t
distinguish between decisions made based
on high vs. low frequencies.
24
Possible Extensions
• Consider a-priori structural preference for “low”
attachment (to noun)
• Consider lexical head of the PP:
– I saw the bird with the telescope
– I met the man with the telescope
• Such additional factors can be incorporated easily,
assuming their independence
• Addressing more complex types of attachments,
such as chains of several PP’s
• Similar attachment ambiguities within noun
compounds: [N [N N]] vs. [[N N] N]
25
Classify by Best Single Feature: Decision List
• Training: for each feature, measure its “entailment score ”
for each class, and register the class with the highest score
– Sort all features by decreasing score
• Classification: for a given example, identify the highest
entailment score among all “active” features, and select the
appropriate class
– Test all features for the class in decreasing score order, until first
success  output the relevant class
– Default decision: the majority class
• For multiple classes per example: may apply a threshold on
the feature-class entailment score
• Suitable when relatively few strong features indicate class
(compare to manually written rules)
26
Example: Accent Restoration
• (David Yarowsky, 1994): for French and Spanish
• Classes: alternative accent restorations for words in
text without accent marking
– Labeled training generated from accented texts
• Example: côte (coast) vs. côté (side)
• A variant of the general word sense disambiguation
problem - “one sense per collocation” motivates
using decision lists
• Similar tasks (with available training):
– Capitalization restoration in ALL-CAPS text
– Homograph disambiguation in speech synthesis (wind as
noun and verb)
27
Accent Restoration - Features
• Word form coloocation features:
– Single words in window: ±1, ±k (20-50)
– Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex
features)
– Easy to implement
28
Accent Restoration - Features
• Local syntactic-based features (for Spanish)
–
–
–
–
29
Use a morphological analyzer
Lemmatized features - generalizing over inflections
POS of adjacent words as features
Some word classed (primarily time terms, to help with
tense ambiguity for unaccented words in Spanish)
Accent Restoration – Decision Score
P (c | f )
score ( f , c)  log
P(~ c | f )
c : class
f : feature
• Probabilities estimated from training statistics,
taken from a corpus with accents
• Smoothing - add small constant to all counts
• Pruning:
– Remove redundancies for efficiency: remove specific
features that score lower than their generalization
(domingo - WEEKDAY, w1w2 – w1)
– Cross validation: remove features that causes more
errors than correct classifications on held-out data
30
Probabilistic Estimation - Smoothing
• Counts are obtained from a sample of the probability space:
sample
• Maximum Likelihood Estimate proportional to sample counts:
MLE estimate – 0 probability for unobserved events
• Smoothing discounts observed events, leaving probability
“mass” to unobserved events:
31
discounted estimate for observed events
positive estimate for
unobserved events
“Add-1/Add-Constant” Smoothing
c( x)
pMLE ( x) 
N
c( x) - the count for event x (e.g. word occurrence )
N - the total count for all x  X (e.g. corpus length)
 pMLE ( x)  0 for many low probabilit y events (sparsenes s)
Smoothing - discountin g and redistribu tion :
c( x)  
pS ( x) 
N  | X |
  1: Laplace, assuming uniform prior.
In natural language events : usually   1
32
Accent Restoration – Results
• Agreement with accented test corpus for
ambiguous words: 98%
– Vs. 93% for baseline of most frequent form
– Accented test corpus also includes errors
• Worked well for most of the highly ambiguous
cases (see random sample in next slide)
• Results slightly better than Naive Bayes (weighing
multiple features)
– Consistent with related study on binary homograph
disambiguation, where combining multiple features
almost always agrees with using a single best feature
– Incorporating many low-confidence features may
introduce noise that would override the strong features
33
Accent Restoration – Tough Examples
34
Related Application: Anaphora Resolution
(Dagan, Justeson, Lappin, Lease, Ribak 1995)
The terrorist pulled the grenade from his pocket and
?
threw it at the policeman
Traditional AI-style approach
Manually encoded semantic preferences/constraints
Actions
Weapon
<object – verb>
Bombs
grenade
35
Cause_movement
throw
drop
Statistical Approach
“Semantic”
Judgment
Corpus
(text collection)
<verb–object: throw-grenade> 20 times
<verb–object: throw-pocket>
1 time
• Statistics can be acquired from unambiguous (non-anaphoric)
occurrences in raw (English) corpus (cf. PP attachment)
• Semantic confidence combined with syntactic preferences
 it  grenade
• “Language modeling” for disambiguation
36
Word Sense Disambiguation
• Many words have multiple meanings
– E.g, river bank, financial bank
• Problem: Assign proper sense to each
ambiguous word in text
• Applications:
– Machine translation
– Information retrieval (mixed evidence)
– Semantic interpretation of text
37
Approaches
•
Supervised learning:
Learn from a pre-tagged corpus (Semcor, SenseEval)
•
•
all sense-occurrences are hidden – vs. PP and anaphora
Bilingual-based methods
Obtain sense labels by mapping to another language
•
Dictionary-Based Learning
Learn to distinguish senses based on dictionary
entries
•
Unsupervised Learning
Automatically cluster word occurrences into different
senses
38
Using an Aligned Bilingual Corpus
• Goal: get sense tagging cheaply
• Use correlations between aphrases in two languages to
disambiguate
E.g, interest =
In German
‘legal share’ (acquire an interest)
‘attention’
(show interest)
Beteiligung erwerben
Interesse zeigen
• For each occurrence of an ambiguous word, determine which
sense applies according to the aligned translation
• Limited to senses that are discriminated by the other language;
suitable for disambiguation in translation
• Gale, Church and Yarowsky (1992) – Bayesian model
39
Evaluation Settings
• Train and test on pre-tagged (or bilingual) texts
– Difficult to come by
• Artificial data – pseudo-senses – cheap to train
and test: ‘merge’ two words to form an
‘ambiguous’ word with two ‘senses’
– E.g, replace all occurrences of door and of window with
doorwindow and see if the system figures out which is
which
– Useful to develop sense disambiguation methods
40
Performance Bounds
• How good is (say) 83.2%??
• Evaluate performance relative to lower and
upper bounds:
– Baseline performance: how well does the
simplest “reasonable” algorithm do? E.g.,
compare to selecting the most frequent sense
– Human performance: what percentage of the
time do people agree on classification?
• Nature of the senses used impacts accuracy
levels
41
Word Sense Disambiguation
for Machine Translation
I bought soap bars
I bought window bars
?
?
sense1 sense2
(‘chafisa’) (‘sorag’)
Corpus
(text collection)
sense1 sense2
(‘chafisa’) (‘sorag’)
Sense1:
<noun-noun: soap-bar>
20 times
<noun-noun: chocolate-bar> 15 times
Sense2:
<noun-noun: window-bar>
<noun-noun: iron-bar>
17 times
22 times
• Features: co-occurrence within distinguished syntactic relations
• “Hidden” senses – manual labeling required(?)
42
Solution: Dictonary-based Mapping to
Target Language
English(-English)-Hebrew Dictionary:
bar1 ‘chafisa’
bar2 ‘sorag’
soap  ‘sabon’
window  ‘chalon’
Map ambiguous “relations” to second language (all possibilities):
<noun-noun: soap-bar>
1 <noun-noun: ‘cahfisat-sabon’>
2 <noun-noun: ‘sorag-sabon’>
20 times
0 times
<noun-noun: window-bar>
1 <noun-noun: ‘cahfisat-chalon’>
2 <noun-noun: ‘sorag-chalon’>
0 times
15 times
• Exploiting ambiguities difference
• Principle – intersecting redundancies
(Dagan and Itai 1994)
43
Hebrew
Corpus
The Selection Model
• Constructed to choose (classify) the right translation
for a complete relation rather than for each individual
word at a time
– since both words in a relation might be ambiguous, having
their translations dependent upon each other
• Assuming a multinomial model, under certain
linguistic assumptions
– The multinomial variable: a source relation
– Each alternative translation of the relation is a possible
outcome of the variable
44
An Example Sentence
• A Hebrew sentence with 3 ambiguous words:
• The alternative translations to English:
45
Example - Relational Representation
46
Selection Model
• We would like to use as a classification score the log
of the odds ratio between the most probable relation i
and all other alternatives (in particular, the second
most probable one j):  p 
ln  i 
p 
 j
• Estimation is based on smoothed counts
• A potential problem: the odds ratio for probabilities
doesn’t reflect the absolute counts from which the
probabilities were estimated.
– E.g., a count of 3 vs. (smoothed) 0
• Solution: using a one sided confidence interval (lower
bound) for the odds ratio
47
Selection Model (cont.)
• The distribution of the log of the odds ratio (across
samples) converges to normal distribution
• Selection “confidence” score for a single relation the lower bound for the odds-ratio:
 pi 
 ni 
1 1


ln
 ln    Z1

 Conf (i )  
p 
n 
n
n
i
j
 j
 j
• The most probable translation i for the relation is
selected if Conf(i), the lower bound for the log odds
ratio, exceeds θ.
• Notice roles of θ vs. α, and impact of n1,n2
48
Handling Multiple Relations in a
Sentence: Constraint Propagation
1.
2.
4.
Compute Conf(i) for each ambiguous source relation.
Pick the source relation with highest Conf(i).
If Conf(i)< θ, or if no source relations left, then stop;
Otherwise, select word translations according to target
relation i and remove the source relation from the list.
Propagate the translation constraints: remove any target
relation that contradicts the selections made; remove
source relations that now become unambiguous.
Go to step 2.
•
Notice similarity to the decision list algorithm
3.
49
Selection Algorithm Example
50
Evaluation Results
• Results - HebrewEnglish translation:
Coverage: ~70%
Precision within coverage: ~90%
– ~20% improvement over choosing most
frequent translation (95% statistical confidence
for an improvement relative to this common
baseline)
51
Analysis
• Correct selections capture:
– Clear semantic preferences: sign/seal treaty
– Lexical collocation usage: peace treaty/contract
• No selection:
– Mostly: no statistics for any alternative (data
sparseness)
• investigator/researcher of corruption
– Also: similar statistics for several alternatives
– Solutions:
• Consult more features in remote (vs. syntactic) context
prime minister … take position/job
• Class/similarity-based generalizations (corruption-crime)
52
Analysis (cont.)
• Confusing multiple sources (senses) for the same
target relation:
– ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young)
Valid (frequent) target relations:
• small chance - correct
• young prospect – incorrect, due to -
– “Young prospect” is the translation of another Hebrew
expression – ‘tikva’ (hope) ‘zeira’ (young)
• The “soundness” assumption of the multinomial
model is violated:
– Assume counting the generated target relations corresponds
to sampling the source relation, hence assuming a known
1:n mapping (also completeness – another source of errors)
– Potential solutions: bilingual corpus, “reverse” translation
53
Sense Translation Model: Summary
• Classification instance: a relation with multiple words,
rather than a single word at a time, to capture immediate
(“circular”) dependencies.
• Make local decisions, based on a single feature
• Taking into account statistical confidence of decisions
• Constraint propagation for multiple dependent
classifications (remote dependencies)
• Decision list style rational – classifying by a single high
confidence evidence is simpler, and may work better, than
considering all weaker evidence simultaneously
– Computing statistical confidence for a combination of multiple
events is difficult; easier to perform for each event at a time
• Statistical classification scenario (model) constructed for
the linguistic setting
– Important to identify explicitly the underlying model assumptions,
and to analyze the resulting errors
54
Data Sparseness and Generalization via
Distributional Similarity
<verb–object: ‘hidpis-tikiya’>
?
<verb–object: print-folder> 0 times
<verb–object: print-file_cabinet> 0 times
• Standard approach: “back-off” to single term frequency
• Similarity-based inference:
folder
Similar
print
55
<verb-object>
file
directory
record
…
file_cabinet
Similar
print
<verb-object>
cupboard
closet
suitcase
…
Computing Distributional Similarity
print
erase
open
retrieve
browse
save
…
folder
Similar
file
• Association between word u (“folder”) and its “attributes”
(context words/features) is based on mutual information:
Patt, u  

  Patt u  



I u, att   log 2 

log
2 
 Patt Pu  



 Patt  
• Similarity between u and v (weighted Jaccard, [0,1]):
 min I u, att , I v, att 
sim u , v  
 max I u, att , I v, att 
att
att
56
Evaluation
• Results (HebrewEnglish):
• 15% coverage increase, while decreasing precision by 2%
• Accuracy 15% better than back-off to single word
frequency
(Dagan, Marcus and Markovitch 1995)
57
Probabilistic Framework - Smoothing
• Counts are obtained from a sample of the probability space:
sample
• Maximum Likelihood Estimate proportional to sample counts:
MLE estimate – 0 probability for unobserved events
• Smoothing discounts observed events, leaving probability
“mass” to unobserved events:
58
discounted estimate for observed events
positive estimate for
unobserved events
Smoothing Conditional Attribute Probability
• Good-Turing smoothing scheme – discount & redistribute:
 Pd (att | u )

1
P(att | u )  norm(u )

att : count (att, u )  0
count (att, u )  0
count (att, u )  0
• Katz seminal back-off scheme (speech language modeling):
P(att | u )  norm(u ) P(att )
count (att, u )  0
• Similarity-based smoothing: (Dagan, Lee, Pereira 1999)
P (att | u )  norm (u ) PSIM (att | u )
count (att, u )  0
where
PSIM (att | u ) 
59

u
1

f SIM (u , u ) u
f SIM (u , u )P (att | u )
Similarity/Distance Functions for Probability
Distributions
• Jensen-Shannon divergence (KL-distance to the average)
 Pu  Pv 
 Pu  Pv  
1

  DKL  Pv
 
Aq, r    DKL  Pu
2
2 
2 


Pu  Pv
att   1 Pu att   Pv att 
2
2
Information loss by approximating u and v by their average
s.t.
A
f SIM
(u , v)  exp(  A(u , v))
β controls the relative influence of close vs. remote neighbors
• L1 norm
Lu, v    Patt u   Patt v 
att

L
f SIM
(u, v)  (2  Lu, v ) 
60

Sample Results
• Most similar words to “guy”:
Measure
Closest Words
A
guy kid thing lot man mother doctor friend boy son
L
guy kid lot thing man doctor girl rest son bit
PC
role people fire guy man year lot today way part
Typical common verb contexts: see get give tell take …
PC : an earlier attempt for similarity-based smoothing
• Several smoothing experiments (A performed best):
 Language modeling for speech (hunt bears?pears)
 Perplexity (predicting test corpus likelihood)
 Data recovery task (similar to sense disambiguation)
Insensitive to exact value of β
61
Other Forms of Generalization
• Capturing distributional similarity by
statistical clustering (vs. pair-wise similarity)
• Lexical resources of manually defined
generalizations as ontology and thesaurus
– WordNet, WordNet Domains, Roget
– Word meaning (syntagmatic)
– Domain of discourse (paradigmatic)
62
Bootstrapping
•
Goals:
–
–
•
Utilize a minimal amount of (initial) supervision
Obtain learning from many unlabeled examples (vs.
selective sampling)
General scheme:
1.
2.
3.
•
Initial supervision – seed examples for training an
initial model or seed model itself (indicative features)
Classify corpus with seed model
Add most confident classifications to training and
iterate.
Exploits feature redundancy in examples
63
Bootstrapping Decision List for Word
Sense Disambiguation
• Yarowsky (1995) scheme
• Relies on “one sense per collocation”
1. Initialize seed model with all collocations in the sense
dictionary definition – an initial decision list for each
sense.
– Alternatively – train from seed examples
2. Classify all examples with current model
– Any example containing a feature for some sense
3. Compute odds ratio for each feature-sense combination; if
above threshold then add the feature to the decision list
4. Optional: add the “one sense per discourse constraint” to
add/filter examples
5. Iterate (2)
64
“One Sense per Collocation”
65
Results
• Applies the “one sense per discourse” constraint after final
classification – classify all word occurrences in a
document by the majority sense
• Evaluation on ~37,000 examples:
66
Co-training for Name Classification
•
•
•
•
•
Collins and Singer, EMNLP 1999
A bootstrapping approach that relies on a
(natural) partition of the feature space to two
different sub-spaces
The bootstrapping process iterates by switching
sub-space in each iteration
Name classification task: person, organization,
location
Features for decision list:
1. Spelling features (full, contains, cap, non-alpha)
2. Context features (words, syntactic type)
67
Seed Features
• Initialized with 0.9999 score for the decision list
68
DL-CoTrain Algorithm
1.
2.
3.
4.
Initialize n=5 (max. #rules learned in an iteration)
Initialize spelling decision list with seed
Classify corpus by spelling rules (where applicable)
Induce contextual rule list from classified
examples; keep at most the n most frequent rules
whose score>θ for each class
5. Classify corpus by contextual rules
6. Induce spelling rules, as in step 4, and add to list
7. If n<2500 then n n+5, and return to step 3.
Otherwise, classify corpus by combined list and
induce a final classifier from these labeled examples
69
Co-Train vs. Yarowsky’s Algorithm
• Collins & Singer implemented a cautious version
of Yarowsky’s algorithm, where n rules are added
at each iteration
• Same accuracy (~91%) (as well as a boostingbased bootstrapping algorithm)
• Abney (ACL 2002) provides some analysis for
both algorithm, showing that they are based on
different independence assumptions
70
Classifying by Multiple Features:
Naive Bayes
• e : The example (instance) to be classified
– word occurrence, document
• c : a class, among all possible classes C
– word sense, document category
• f : Feature
– context word, document word
71
Bayes Decision Rule
• Among alternative classes, select c such that P(c|e) is
maximal
• Minimizes the probability of error
• Likelihood ratio view: P(c | e)
P(~ c | e)
 select c for which the ratio is maximal
• Binary classification: classify c if ratio>1, otherwise ~c
• Multiple (0 or more) classes per example: consider each
class as a binary classification (usually with threshold,
due to imprecise estimate)
• Easier to estimate P(e|c) (generative model), therefore
using Bayes rule
72
Conditional Probability and Bayes
Rule
 Conditiona l Probabilit y :
Probabilit y of event A given event B has happend
Pr[ A  B ]
Pr[ A B ] 
Pr[ B]
Pr[ A  B]  Pr[ A, B ]
Pr[ A  B]
Pr[ B A] 
Pr[ A]
 Bayes Rule
Pr[ A]
Pr[ A B ]  Pr[ B A]
Pr[ B ]
73
A
A B
B
Bayes and Independence
Assumption
P ( c | e)
P (e | c ) P (c )

P(~ c | e) P(e |~ c) P(~ c)
 P(
f e
Bayes
f | c)
P (c )


 P ( f |~ c ) P (~ c )
f e
Posterior
74
Prior
independen ce
assumption
Log-Likelihood Ratio
• Computational convenience – avoid underflow
P(c | e)
P( f | c)
P(C )
score (c)  log
  log
 log
P(~ c | e) f e
P( f |~ c)
P(~ C )
• Estimate probabilities from training corpus, by (smoothed)
count ratios
P( f | c)
• log
: the “entailment score” of f for c
P( f |~ c)
• Working with this ratio indicates explicitly feature “weight”
within the score, compared to score (c)   log P( f | c)  log P(C )
f e
– log P(f|c) itself is large for frequent f’s, regardless of discrimination
• Was applied to many classification tasks
75
Word Sense Disambiguation
by Naïve Bayes
• Each ambiguous word w is a classification
problem
• e : The example (instance) to be classified
– An occurrence of the ambiguous word
• c : a class, among all possible classes C
– A word sense, among all listed senses
• f : Feature
– A context word or phrase, in near or broad
context, possibly within a syntactic relationship
76
Estimating Probabilities
• Assume a sense labeled training corpus
• Apply some smoothing to avoid zero counts:
freq( f , c)
freq(c)
freq(c)
P (c ) 
freq( w)
P( f | c) 
• freq(c) – sum of all context positions
• Context features (words) that tend to occur mostly with a
specific sense, and not with the others, contribute high values
to its accumulative score
• Naïve Bayes Combines weaker evidence from broad context,
vs. using a single strong collocation feature in decision list
– Recent classification methods combine evidence more accurately
77
Examples for Significant Features
• Senses of drug (Gale et al. 1992):
‘medication’ prices, prescription, patent,
increase, consumer,
pharmaceutical
‘illegal substance’
abuse, paraphernalia, illicit,
alcohol, cocaine, traffickers
78
Clustering and Unsupervised
Disambiguation
• Word sense disambiguation without labeled
training or other information sources
• Cannot label to predefined senses (there are
none), so try to find “natural” senses
• Use clustering methods to divide different
contexts of a word into “sensible” classes
• Other applications of clustering:
– Thesaurus construction, document clustering
– Forming word classes for generalization in
language modeling and disambiguation
79
Clustering
80
Clustering Techniques
• Hierarchical:
– Bottom-up (agglomerative)
– Top-down (divisive)
• Flat (non-hierarchical)
– Usually iterative
– Soft (vs. hard) clustering
• “Degree” of membership
81
Comparison
Hierarchical:
• Preferable for detailed
data analysis
• More information
provided
• No clearly preferred
algorithm
• Less efficient (at least
O(n2))
82
Non-hierarchical:
• Preferable if efficiency
is important or lots of
data
• K-means is the
simplest method and
often good enough
• If no Euclidean space,
can use EM instead
Hierarchical Clustering
and
83
but
in
on
with
for
at
from
of
to
as
Similarity Functions
• sim(c1,c2) = similarity between clusters
– Defined over pairs of individual elements, and
(inductively) over cluster pairs
• Values typically between 0 and 1
For hierarchical clustering, require:
• Monotonicity:
For all possible clusters,
min[sim(c1,c2),sim(c1,c3)]  sim(c1,c2  c3)
Merging two clusters does not increase similarity
84
Bottom-Up Clustering
Input: Set X = {x1,…,xn} of objects
Similarity function sim: 2X  2X
for i 1 to n do:
ci {xi}
C {ci, ..., cn}
j n + 1
while |C| > 1 do:
(ca, cb)  arg max(cu,cv) sim(cu,cv)
cj ca  cb; Trace merge to construct cluster hierarchy
C  (C \ {ca,cb})  {cj}
j++
85
Types of Cluster Pair Similarity
single link
complete link
similarity of most similar
(nearest) members
similarity of most dissimilar
(remote) members
- property of unified cluster
group average
average pairwise similarity of
members in unified cluster
- property of unified cluster
- O(n2) through efficient incremental
update for cosine similarity
86
Single Link Clustering (“chaining” risk)
87
Complete Link Clustering
• “Global view” yields tighter clusters – usually
more suitable in NLP
88
Non-Hierarchical Clustering
• Iterative clustering:
– Start with initial (random) set of clusters
– Assign each object to a cluster (or clusters)
– Re-compute cluster parameters
• E.g. centroids:



s (c ) 1
 (c ) 
 x
| c | | c | xc
– Stop when clustering is “good”
• Q: How many clusters?
89
K-means Algorithm
Input: Set X = {x1,…,xn} of objects
Distance measure d: X  X 
Mean function : 2XX
Select k initial cluster centers f1, ..., fk
while not finished do:
for all clusters cj do:
cj  { xi | fj = arg minf d(xi, f) }
for all means fj do:
fj  (cj)
Complexity: O(n), assuming constant number of iterations
90
K-means Clustering
91
Example
Clustering of words from NY Times using
cooccurring words
1.
2.
3.
4.
5.
92
ballot, polls, Gov, seats
profit, finance, payments
NFL, Reds, Sox, inning, quarterback, score
researchers, science
Scott, Mary, Barbara, Edward
Buckshot Algorithm
• Often, random initialization for K-means works
well
• If not:
– Randomly choose n points
– Run hierarchical group average clustering:
O(( n)2)=O(n)
– Use those cluster means as starting points for K-means
– O(n) total complexity
• Scatter/Gather (Cutting, Karger, Pedersen, 1993)
93
– An interactive clustering-based browsing scheme for
text collections and search results (constant time
improvements)
The EM Algorithm
• Soft clustering method to solve
* = arg max P( observation | )
– Soft clustering – probabilistic association between
clusters and objects ( - the model parameters)
Note: Any occurrence of the data consists of:
– Observable variables: The objects we see
• Words in each context
• Word sequence in POS tagging
– Hidden variables: Which cluster generated which
object
• Which sense generates each context
• Underlying POS sequence
• Soft clustering can capture ambiguity (of words)
and multiple topics (of documents)
94
Two Principles
Expectation: If we knew  we could
compute the expectation of the hidden
variables (e.g, probability of x belonging to some
cluster)
Maximization: If we knew the hidden
structure, we could compute the maximum
likelihood value of 
Algorithm: initialize  (randomly) and iterate
95
EM for Word-sense “Discrimination”
Cluster the contexts (occurrences) of an ambiguous word,
where each cluster constitutes a “sense”:
• Initialize randomly model parameters P(vj | sk) and P(sk)
• E-step: Compute P(sk | ci) for each context ci and sense sk
• M-step: Re-estimate the model parameters P(vj | sk) and
P(sk), for context words and senses
• Continue as long as log-likelihood of all corpus contexts
1 ≤ i ≤ I increases (EM – guaranteed to increase in each
step till convergence to a maximum, possibly local):
log  P(ci | sk ) P( sk )  log  P(ci | sk ) P( sk )
i
96
k
i
k
E-Step
• To compute P(ci | sk), use naive Bayes assumption:
P(ci | sk ) 
 P (v
j
| sk )
v j sk
• For each context i & each sense sk, estimate the
posterior probability hik = P(sk | ci) (an expected
“count” of the sense for ci), using Bayes rule:
P(ci | sk ) P( sk )
hik 
k ' P(ci | sk ' ) P(sk ' )
97
M-Step
Re-estimate parameters using maximumlikelihood estimation:
P (v j
 h
|s )
 h
ci :v j ci
k
j ci :v j ci
h

P( s ) 
 h
i
ik
k
ik
k
98
ik
i
ik
h


i
I
ik
Decision Procedure
Assign senses by (same as Naïve Bayes):


s( wi )  arg max s log P( sk )   log P(v j | sk )
v c


• Can adjust the pre-determined number of senses k
to get finer or coarser distinctions
k
j
i
– Bank as physical location vs. abstract corporation
• If adding more senses doesn’t increase loglikelihood much, then stop
99
Results (Schütze 1998)
Word
suit
motion
train
Sense
lawsuit
suit you wear
physical movement
proposal for action
line of railroad cars
to teach
Accuracy
95  0
96  0
85  1
88  13
79  19
55  33
• Works better for topic-related senses (given the
broad-context features used)
• Improved IR performance by 14% - representing
both query and document as senses, and
combining
results
with
word-based
retrieval
100
HMM Part-of-Speech Tagging
• Generally speaking, the POS is the
“grammatical type” of a word:
– Verb, Noun, Adjective, Adverb, Article, …
• We can also include inflection:
–
–
–
–
Verbs: Tense, number, …
Nouns: Number, proper/common, …
Adjectives: comparative, superlative, …
…
• Most commonly used POS sets for English
have 50-80 different tags
101
Task: Part-Of-Speech Tagging
• Goal: Assign the correct part-of-speech to
each word (and punctuation) in a text.
• Example:
Two
old
men
bet
on
the game
CRD
AJ0
NN2
VVD
PP0
AT0
NN1
.
PUN
• Learn a local model of POS dependencies,
usually from pre-tagged data
• No parsing
102
Hidden Markov Models
• Assume: POS (state) sequence generated as timeinvariant random process, and each POS randomly
generates a word (output symbol)
0.2
AJ0
0.2
“a” 0.6
0.3
0.3
“men”
0.9
0.5
NN1
0.1
“cat”
103
NN2
0.5
AT0
“the” 0.4
“cats”
“bet”
Definition of HMM for Tagging
• Set of states – all possible tags
• Output alphabet – all words in the language
• State/tag transition probabilities
• Initial state probabilities: the probability of
beginning a sentence with a tag t (t0t)
• Output probabilities – producing word w at state t
• Output sequence – observed word sequence
• State sequence – underlying tag sequence
104
HMMs For Tagging
• First-order (bigram) Markov assumptions:
– Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
– Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk)
• Output probabilities:
– Probability of getting word wk for tag tj: P(wk | tj)
– Assumption:
Not dependent on other tags or words!
105
Combining Probabilities
• Probability of a tag sequence:
P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)
Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)
• Prob. of word sequence and tag sequence:
P(W,T) = i P(ti-1ti) P(wi | ti)
106
Training from Labeled Corpus
• Labeled training = each word has a POS tag
• Thus:
PMLE(tj) = C(tj) / N
PMLE(tjtk) = C(tj, tk) / C(tj)
PMLE(wk | tj) = C(tj:wk) / C(tj)
• Smoothing applies as usual
107
Viterbi Tagging
• Most probable tag sequence given text:
T* = arg maxT Pm(T | W)
= arg maxT Pm(W | T) Pm(T) / Pm(W)
(Bayes’ Theorem)
= arg maxT Pm(W | T) Pm(T)
(W is constant for all T)
= arg maxT i[m(ti-1ti) m(wi | ti) ]
= arg maxT i log[m(ti-1ti) m(wi | ti) ]
• Exponential number of possible tag sequences – use
dynamic programming for efficient computation
108
w1
t1
-2.3
t0
w2
-1.7
t1
-3
-6
t1
-7.3
-0.3
t2 -4.7
-3.4
-1.3
-1
t2 -10.3
-1.3
t3 -2.7
109
-1.7
-0.3
-1.7 2
t
w3
t3 -6.7
t3 -9.3
-log m t1
t2
t3
-log m w1
w2
w3
t0 
2.3
1.7
1
t1
0.7
2.3
2.3
t1 
1.7
1
2.3
t2
1.7
0.7
3.3
t2 
0.3
3.3
3.3
t3
1.7
1.7
1.3
t3 
1.3
1.3
2.3
Viterbi Algorithm
1.
2.
3.
D(0, START) = 0
for each tag t != START do: D(1, t) = -
for i  1 to N do:
a. for each tag tj do:
D(i, tj)  maxk D(i-1,tk) + lm(tktj) + lm(wi|tj)
Record best(i,j)=k which yielded the max
4.
5.
log P(W,T) = maxj D(N, tj)
Reconstruct path from maxj backwards
Where: lm(.) = log m(.) and D(i, tj) – max joint probability of state
and word sequences till position i, ending at tj.
Complexity: O(Nt2 N)
110
Three Basic Problems
1. Compute the probability of a text (observation)
•
language modeling – evaluate alternative texts and
models
Pm(W1,N)
2. Compute maximum probability tag (state) sequence
•
Tagging/classification
arg maxT1,N Pm(T1,N | W1,N)
3. Compute maximum likelihood model
•
training / parameter estimation
arg maxm Pm(W1,N)
111
Compute Text Probability
• Recall: P(W,T) = i P(ti-1ti) P(wi | ti)
• Text probability: need to sum P(W,T) over
all possible sequences – an exponential
number
• Dynamic programming approach – similar
to the Viterbi algorithm
• Will be used also for estimating model
parameters from an untagged corpus
112
Forward Algorithm
Define:
Ai(k) = P(w1,k, tk= ti);
Nt – total num. of tags
For i = 1 To Nt: Ai(1) = m(t0ti)m(w1 | ti)
1. For k = 2 To N; For j = 1 To Nt:
i.
2.
Aj(k) =
Then:
[ A (k-1)m(t t )]m(w | t )
Pm(W1,N) =
i
i
i
j
k
j
 A (N)
i
i
Complexity = O(Nt2 N) (like Viterbi,  instead of max)
113
Forward Algorithm
w1
t1
w2
A1(1)
m(t1t1)
t1
Pm(W1,3)
w3
A1(2)
m(t2t1)
m(t1t1)
t1 A1(3)
m(t2t1)
t2
A2(1) m(t3t1)
t2 A2(2) m(t3t1)
t2 A2(3)
t3
A3(1)
t3 A3(2)
t3 A3(3)
t4
A4(1)
m(t0ti)
m(t4t1)
m(t5t1)
t5
114
A5(1)
m(t4t1)
t4 A4(2)
t4 A4(3)
m(t5t1)
t5 A5(2)
t5 A5(3)
Backward Algorithm
Define Bi(k) = P(wk+1,N | tk=ti)
1. For i = 1 To Nt: Bi(N) = 1
2. For k = N-1 To 1; For j = 1 To Nt:
i. Bj(k) =
3. Then:
[ m(t t )m(w
Pm(W1,N) =
i
j
]
i)B (k+1)
|
t
k+1
i
 m(t t )m(w | t )B (1)
i
0
Complexity = O(Nt2 N)
115
i
i
1
i
i
Pm(W1,3)
Backward Algorithm
w1
t1
w2
B1(1)
m(t0ti)
m(t1t1)
t1
w3
B1(2)
m(t2t1)
t2
B2(1)
t3
B3(1)
t4
B4(1)
m(t3t1)
m(t4t1)
116
B5(1)
t1 B1(3)
m(t2t1)
t2 B2(2)
t3 B3(2)
t4 B4(2)
m(t5t1)
t5
m(t1t1)
t5 B5(2)
m(t3t1)
m(t4t1)
t2 B2(3)
t3 B3(3)
t4 B4(3)
m(t5t1)
t5 B5(3)
Estimation from Untagged Corpus: EM –
Expectation-Maximization
1. Start with some initial model
2. Compute the probability of (virtually) each state
sequence given the current model
3. Use this probabilistic tagging to produce
probabilistic counts for all parameters, and use
these probabilistic counts to estimate a revised
model, which increases the likelihood of the
observed output W in each iteration
4. Repeat until convergence
Note: No labeled training required. Initialize by
lexicon constraints regarding possible POS for
each word (cf. “noisy counting” for PP’s)
117
Notation
• aij = Estimate of P(titj)
• bjk = Estimate of P(wk | tj)
• Ai(k) = P(w1,k, tk=ti)
(from Forward algorithm)
• Bi(k) = P(wk+1,N | tk=ti)
(from Backwards algorithm)
118
Estimating transition probabilities
Define pk(i,j) as prob. of traversing arc titj at
time k given the observations:
pk(i,j)
= P(tk = ti, tk+1 = tj | W)
= P(tk = ti, tk+1 = tj,W) / P(W)
=
=
119
Ai (k )aijb jk B j (k  1)

Nt
r 1
Ar (k ) Br (k )
Ai (k )aijb jk B j (k  1)
 
Nt
Nt
r 1
s 1
Ar (k )arsb jk Bs (k  1)
Expected transitions
• Define gi(k) = P(tk = ti | W), then:
gi(k) =

Nt
j 1
pk (i, j )
• Now note that:
– Expected number of transitions from tag i =

N
k 1
g i (k )
– Expected transitions from tag i to tag j =

N
k 1
120
pk (i, j )
Re-estimation of Maximum Likelihood
Parameters
• a’ij =
expected # of transitio ns from tag i to j
expected # of transitio ns from tag i
= k 1 pk (i, j )
N

• b’ik =
N
k 1
g i (k )
expected # of observatio ns of k for tag i
expected number of transitio ns from tag i
= r:w w  j 1 pr (i, j )
Nt
r
121
k

N
k 1
g i (k )
EM Algorithm
1. Choose initial model = <a,b,g(1)>
2. Repeat until results don’t improve (much):
1. Compute pk based on current model, using
Forward & Backwards algorithms to compute
A and B (Expectation for counts)
2. Compute new model <a’,b’,g’(1)>
(Maximization of parameters)
Note: Output likelihood is guaranteed to
increase in each iteration, but might
converge to a local maximum!
122
Initialize Model by Dictionary
Constraints
• Training should be directed to correspond to the
linguistic perception of POS (recall local max)
• Achieved by a dictionary with possible POS for
each word
• Word-based initialization:
– P(w|t) = 1 / #of listed POS for w, for the listed POS;
and 0 for unlisted POS
• Class-based initialization (Kupiec, 1992):
– Group all words with the same possible POS into a
‘metaword’
– Estimate parameters and perform tagging for
metawords
– Frequent words are handled individually
123
Some extensions for HMM POS tagging
• Higher-order models: trigrams, possibly
interpolated with bigrams
• Incorporating text features:
– Output prob = P(wi,fj | tk) where f is a vector of
features (capitalized, ends in –d, etc.)
– Features useful to handle unknown words
• Combining labeled and unlabeled training
(initialize with labeled then do EM)
124
Selective Sampling in Training
• Motivation – reduce amount of needed
training by smartly selecting most
informative examples
125
126
Approaches
• Early results – use classification score as
certainty measure – choose for training the
least certain examples
– But possibly the example is indeed ambiguous
for the classification model
• Argamon, Dagan – JAIR 1999: extending
the Query By Committee approach to
probabilistic classification
– Richer modeling of example “usefulness”
127
128
129
130
131
132
133
134
135
136
137
138
139
Some Other Potential Directions
• Parsing:
– Syntactic structure learned from tree-banks
• Unsupervised training didn’t succeed much (EM –
the Inside/Outside algorithm)
– Crucial lexical dependency information –
typically also learned from the tree-bank
– Potential to utilize unsupervised learning of
lexical co-occurrence from untagged corpus!
140
Human Input: Class Labels vs. Modeling
• Text Categorization
– Typically – “black box” learning from labeled training of
entailment from terms (the features) to categories
– Potential of human seeding and feedback of (meaningful)
terms rather than class-labels
• May be more efficient wrt amount of human effort
• Better control of category scope via “glass box” approach
• Similar situations in Information Extraction (IE)
– People define or review extraction patterns
• Generally: optimizing human input of class labels
vs. features/rules (the model itself)
– Relevancy of rule/knowledge-based models to NLP
– Inadequacy of strict “black box” supervised learning?
141
Modeling Semantic Variability
• Semantic variability in supervised tasks
– IE patterns – all possible ways to express a
semantic relationship
– Similarly, QA question types (where, when,…)
• Unsupervised learning of variability
– Frequent patterns in an IE class
– Learning paraphrases from repeated descriptions
of the same fact
• Parallel, comparable, or stand-alone corpora
• Potential contribution for many applications
142
Conclusions
• Different ways to reduce annotation – some
help more, some less, a lot to explore
• The traditional scheme of supervised
learning from examples isn’t necessarily
optimal for NLP
• A critical question – what’s the best way to
combine human input with machine learning
• Reducing human supervision is critical for
NLP applicability – in research and industry
143