Final review

Transcript Final review

Final review
LING 572
Fei Xia
03/07/06
Misc
• Parts 3 and 4 were due at 6am today.
• Presentation: email me the slides by 6am on 3/9
• Final report: email me by 6am on 3/14.
• Group meetings: 1:30-4:00pm on 3/16.
Outline
• Main topics
• Applying to NLP tasks
• Tricks
Main topics
Main topics
• Supervised learning
–
–
–
–
–
Decision tree
Decision list
TBL
MaxEnt
Boosting
• Semi-supervised learning
–
–
–
–
Self-training
Co-training
EM
Co-EM
Main topics (cont)
• Unsupervised learning
– The EM algorithm
– The EM algorithm for PM models
• Forward-backward
• Inside-outside
• IBM models for MT
• Others
–
–
–
–
Two dynamic models: FSA and HMM
Re-sampling: bootstrap
System combination
Bagging
Main topics (cont)
• Homework
– Hw1: FSA and HMM
– Hw2: DT, DL, CNF, DNF, and TBL
– Hw3: Boosting
• Project:
– P1: Trigram (learn to use Carmel, relation between
HMM and FSA)
– P2: TBL
– P3: MaxEnt
– P4: Bagging, boosting, system combination, SSL
Supervised learning
A classification problem
District
House
type
Income
Suburban
Detached
Suburban
Rural
Urban
…
Outcome
High
Previous
Customer
No
Semidetached
High
Yes
Respond
Semidetached
Detached
Low
No
Respond
Low
Yes
Nothing
Nothing
Classification and estimation
problems
• Given
– x: input attributes
– y: the goal
– training data: a set of (x, y)
• Predict y given a new x:
– y is a discrete variable  classification problem
– y is a continuous variable  estimation problem
Five ML methods
•
•
•
•
•
Decision tree
Decision list
TBL
Boosting
MaxEnt
Decision tree
• Modeling: tree representation
• Training: top-down induction, greedy
algorithm
• Decoding: find the path from root to a leaf
node, where the tests along the path are
satisfied.
Decision tree (cont)
• Main algorithms: ID3, C4.5, CART
• Strengths:
– Ability to generate understandable rules
– Ability to clearly indicate best attributes
• Weakness:
– Data splitting
– Trouble with non-rectangular regions
– The instability of top-down induction  bagging
Decision list
• Modeling: a list of decision rules
• Training: greedy, iterative algorithm
• Decoding: find the 1st rule that applies
• Each decision is based on a single piece of
evidence, in contrast to MaxEnt, boosting, TBL
TBL
• Modeling: a list of transformations (similar
to decision rules)
• Training:
– Greedy, iterative algorithm
– The concept of current state
• Decoding: apply every transformation to
the data
TBL (cont)
• Strengths:
– Minimizing error rate directly
– Ability to handle non-classification problem
• Dynamic problem: POS tagging
• Non-classification problem: parsing
• Weaknesses:
– Transformations are hard to interpret as they interact
with one another
– Probabilistic TBL: TBL-DT
Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f
…
ML
Weighted Sample
fT
Boosting (cont)
• Modeling: combining a set of weak classifiers
to produce a powerful committee.
• Training: learn one classifier at each iteration
• Decoding: use the weighted majority vote of
the weak classifiers
Boosting (cont)
• Strengths
– It comes with a set of theoretical guarantee
(e.g., training error, test error).
– It only needs to find weak classifiers.
• Weaknesses:
– It is susceptible to noise.
– The actual performance depends on the data
and the base learner.
MaxEnt
p*  arg max H ( p)
The task: find p* s.t.
pP
where
P  { p | E p f j  E ~p f j , j  {1,..., k}}
If p* exists, it has of the form
k
  j f j ( x)
p( x) 
e j 1
Z
MaxEnt (cont)
• If p* exists, then
p*  arg max L(q)
where
qQ
k
 j f j ( x)
Q  {q | q( x) 
e j1
Z
,  j  0}
L( q )   ~
p ( x) log q( x)
x
MaxEnt (cont)
• Training: GIS, IIS
• Feature selection:
– Greedy algorithm
– Select one (or more) at a time
• In general, MaxEnt achieves good
performance on many NLP tasks.
Common issues
• Objective function / Quality measure:
– DT, DL: e.g., information gain
– TBL, Boosting: minimize training errors
– MaxEnt: maximize entropy while satisfying
constraints
Common issues (cont)
• Avoiding overfitting
– Use development data
– Two strategies:
• stop early
• post-pruning
Common issues (cont)
• Missing attribute values:
– Assume a “blank” value
– Assign most common value among all “similar”
examples in the training data
– (DL, DT): Assign a fraction of example to each
possible class.
• Continuous-valued attributes
– Choosing thresholds by checking the training data
Common issues (cont)
• Attributes with different costs
– DT: Change the quality measure to include
the costs
• Continuous-valued goal attribute
– DT, DL: each “leaf” node is marked with a real
value or a linear function
– TBL, MaxEnt, Boosting: ??
Comparison of supervised learners
DT
DL
TBL
Boosting
MaxEnt
Probabilistic
PDT
PDL
TBL-DT
Confidence
Y
Parametric
N
N
N
N
Y
representation
Tree
Ordered Ordered
list of
list of
rules
transfor
mations
List of
weighted
classifiers
List of
weighte
d
features
Each iteration
Attribute Rule
Transfor
mation
Classifier &
weight
Feature
& weight
Data
processing
Split
data
Split
data*
Change
cur_y
Reweight
(x,y)
None
decoding
Path
1st rule
Sequenc Calc f(x)
e of rules
Calc f(x)
Semi-supervised Learning
Semi-supervised learning
• Each learning method makes some
assumptions about the problem.
• SSL works when those assumptions are
satisfied.
• SSL could degrade the performance when
mistakes reinforce themselves.
SSL (cont)
• We have covered four methods: selftraining, co-training, EM, co-EM
Co-training
• The original paper: (Blum and Mitchell, 1998)
– Two “independent” views: split the features into two
sets.
– Train a classifier on each view.
– Each classifier labels data that can be used to train
the other classifier.
• Extension:
– Relax the conditional independence assumptions
– Instead of using two views, use two or more
classifiers trained on the whole feature set.
Unsupervised learning
Unsupervised learning
• EM is a method of estimating parameters
in the MLE framework.
• It finds a sequence of parameters that
improve the likelihood of the training data.
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence
– E-step: calculate
n
Q( , )   P( y | xi , t ) log P( xi , y |  )
t
i 1
y
– M-step: find
 (t 1)  arg max Q( , t )

The EM algorithm (cont)
• The optimal solution for the M-step exists
for many classes of problems.
 A number of well-known methods are
special cases of EM.
• The EM algorithm for PM models
– Forward-backward algorithm
– Inside-outside algorithm
–…
Other topics
FSA and HMM
• Two types of HMMs:
– State-emission and arc-emission HMMs
– They are equivalent
• We can convert HMM into WFA
• Modeling: Marcov assumption
• Training:
– Supervised: counting
– Unsupervised: forward-backward algorithm
• Decoding: Viterbi algorithm
Bootstrap
ML
f1
ML
f2
ML
fB
f
Bootstrap (cont)
• A method of re-sampling:
– One original sample  B bootstrap samples
• It has a strong mathematical background.
• It is a method for estimating standard
errors, bias, and so on.
System combination
ML1
f1
ML2
f2
MLB
fB
f
System combination (cont)
f ( x)  g ( f1 ( x),...., f m ( x))
• Hybridization: combine substructures to
produce a new one.
– Voting
– Naïve Bayes
• Switching: choose one of the fi(x)
– Similarity switching
– Naïve Bayes
Bagging
ML
f1
ML
f2
ML
fB
bootstrap + system combination
f
Bagging (cont)
• It is effective for unstable learning
methods:
– Decision tree
– Regression tree
– Neural network
• It does not help stable learning methods
– K-nearest neighbors
Relations
Relations
• WFSA and HMM
• DL, DT, TBL
• EM, EM for PM
WFSA and HMM

Start
HMM
Finish
Add a “Start” state and a transition from “Start” to any state in HMM.
Add a “Finish” state and a transition from any state in HMM to “Finish”.
DT, CNF, DNF, DT, TBL
K-DL
k-CNF
k-DT
k-TBL
k-DNF
The EM algorithm
The generalized EM
The EM algorithm
PM
Inside-Outside
Forward-backward
IBM models
Gaussian Mix
Solving a NLP problem
Issues
• Modeling: represent the problem as a formula and
decompose the formula into a function of parameters
• Training: estimate model parameters
• Decoding: find the best answer given the parameters
• Other issues:
– Preprocessing
– Postprocessing
– Evaluation
–…
Modeling
• Generative vs. discriminative models
P( F , E ) P( F | E ) P( E | F )
• Introducing hidden variables
P( F | E )   P( F , a | E )
a
• The order of decomposition
P ( F , a | E )  P ( a | E ) * P ( F | a, E )
P( F , a | E )  P( F | E ) * P(a | F , E )
Modeling (cont)
• Approximation / assumptions
P(a | E )   P(ai | E, a )   P(ai |a i 1 )
i 1
1
i
i
• Final formulae and types of parameters
P(m | l ) m l
P( F | E ) 
P( f j | ei )
m 
(l  1) j 1 i 1
Modeling (cont)
• Using classifiers for non-classification
problem
– POS tagging
– Chunking
– Parsing
Training
• Objective functions:
–
–
–
–
Maximize likelihood: EM
Minimize error rate: TBL
Maximum entropy: MaxEnt
….
• Supervised, semi-supervised, unsupervised:
– Ex: Maximize likelihood
• Supervised: simple counting
• Unsupervised: EM
Training (cont)
• At each iteration:
– Choose one attribute / rule / weight / … at a time, and
never change it in later time: DT, DL, TBL,
– Update all the parameters at each iteration: EM
• Choose “untrained” parameters (e.g.,
thresholds): use development data.
– Minimal “gain” for continuing iteration
Decoding
• Dynamic programming:
– CYK for PCFG
– Viterbi for HMM
• Dynamic problem:
– Decode from left to right
– Features only look at the left context
– Keep top-N hypotheses at each position
Preprocessing
•
•
•
•
•
•
Sentence segmentation
Sentence alignment (for MT)
Tokenization
Morphing
POS tagging
…
Post-processing
• System combination
• Casing (MT)
• …
Evaluation
• Use standard training/test data if possible.
• Choose appropriate evaluation measures:
– WSD: for what applications?
– Word alignment: F-measure vs. AER. How
does it affect MT result?
– Parsing: F-measure vs. dependency link
accuracy
Tricks
Tricks
•
•
•
•
Algebra
Probability
Optimization
Programming
Algebra
The order of sums:
... f ( x ,..., x
1
x1
x2
n
)  ... f ( x1 ,..., xn )
xn
xn xn 1
x1
Pulling out constants:
... c f ( x ,..., x
1
x1
x2
xn
n
)  c... f ( x1 ,..., xn )
x1
x2
xn
Algebra (cont)
The order of sums and products:
n
n
  ...   f ( x )    f ( x )
x11 x2  2
xn  n i 1
i
i
i 1 xi  i
The order of log and product / sum:
log  f i   log f i
i
i
i
i
Probability
Introducing a new random variable:
p ( x |  )   p ( x, y |  )   p ( y |  ) * p ( x | y ,  )
y
y
The order of decomposition:
P ( x, y , z )  P ( x ) P ( y | x ) P ( z | x, y )
P ( x, y , z )  P ( y ) P ( z | y ) P ( x | y , z )
More general cases
P( A1 ) 
 P( A ,..., A )
1
n
A2 ,..., An
P( A1 ,..., An )   P( Ai | A1 ,... Ai 1 )
i 1
Probability (cont)
Bayes Rule:
p( y ) p( x | y )
p ( y | x) 
p ( x)
Source-channel model:
arg max p( y | x)  arg max p( y) p( x | y)
y
y
Probability (cont)
Normalization:
Ct ( x)
p ( x) 
 Ct( x)
x
Jensen’s inequality:
E[log( p( x)]  log( E[ p( x)])
Optimization
• When there is no analytical solution, use
iterative approach.
• If the optimal solution to g(x) is hard to
find, look for the optimal solution to a
(tight) lower bound of g(x).
Optimization (cont)
• Using Lagrange multipliers:
Constrained problem:
maximize f(x) with the constraint that g(x)=0
Unconstrained problem: maximize f(x) – λg(x)
• Taking first derivatives to find the stationary points.
Programming
• Using/creating a good package:
– Tutorial, sample data, well-written code
– Multiple levels of code
• Core ML algorithm: e.g., TBL
• Wrapper for a task: e.g., POS tagger
• Wrapper to deal with input, output, etc.
Programming (cont)
• Good practice:
– Write notes and create wrappers (all the commands should be
stored in the notes, or even better in a wrapper code)
– Use standard directory structures:
• src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/
– Give meaning filenames only to important code: e.g.,
aaa100.exec, build_trigram_tagger.pl
– Give meaning function, variable names
– Don’t use global variables
Final words
• We have covered a lot of topics: 5+4+3+4
• It takes time to digest, but at least we
understand the basic concepts.
• The next step: applying them to real
applications.

Final review

Transcript Final review

Directory