Transcript intro

Introduction
LING 572
Fei Xia
Week 1: 1/3/06
Outline
• Course overview
• Problems and methods
• Mathematical foundation
– Probability theory
– Information theory
Course overview
Course objective
• Focus on statistical methods that produce stateof-the-art results
• Questions: for each algorithm
– How the algorithm works: input, output, steps
– What kind of tasks an algorithm can be applied to?
– How much data is needed?
• Labeled data
• Unlabeled data
General info
• Course website:
– Syllabus (incl. slides and papers): updated every week.
– Message board
– ESubmit
• Office hour: W: 3-5pm.
• Prerequisites:
– Ling570 and Ling571.
– Programming: C, C++, or Java, Perl is a plus.
– Introduction to probability and statistics
Expectations
• Reading:
– Papers are online: who don’t have access to printers?
– Reference book: Manning & Schutze (MS)
– Finish reading before class. Bring your questions to
class.
• Grade:
–
–
–
–
Homework (3): 30%
Project (6 parts): 60%
Class participation: 10%
No quizzes, exams
Assignments
Hw1: FSA and HMM
Hw2: DT, DL, and TBL.
Hw3: Boosting
 No coding
 Bring the finished assignments to class.
Project
P1: Method 1 (Baseline): Trigram
P2: Method 2: TBL
P3: Method 3: MaxEnt
P4: Method 4: choose one of four tasks.
P5: Presentation
P6: Final report
Methods 1-3 are supervised methods.
Method 4: bagging, boosting, semi-supervised learning, or system combination.
P1 is an individual task, P2-P6 are group tasks.
A group should have no more than three people.


Use ESubmit
Need to use others’ code and write your own code.
Summary of Ling570
•
•
•
•
•
•
•
•
•
Overview: corpora, evaluation
Tokenization
Morphological analysis
POS tagging
Shallow parsing
N-grams and smoothing
WSD
NE tagging
HMM
Summary of Ling571
•
•
•
•
•
•
Parsing
Semantics
Discourse
Dialogue
Natural language generation (NLG)
Machine translation (MT)
570/571 vs. 572
• 572 focuses more on statistical approaches.
• 570/571 are organized by tasks; 572 is
organized by learning methods.
• I assume that you know
– The basics of each task: POS tagging, parsing, …
– The basic concepts: PCFG, entropy, …
– Some learning methods: HMM, FSA, …
An example
• 570/571:
– POS tagging: HMM
– Parsing: PCFG
– MT: Model 1-4 training
• 572:
– HMM: forward-backward algorithm
– PCFG: inside-outside algorithm
– MT: EM algorithm
 All special cases of EM algorithm, one method of
unsupervised learning.
Course layout
• Supervised methods
– Decision tree
– Decision list
– Transformation-based learning (TBL)
– Bagging
– Boosting
– Maximum Entropy (MaxEnt)
Course layout (cont)
• Semi-supervised methods
– Self-training
– Co-training
• Unsupervised methods
– EM algorithm
• Forward-backward algorithm
• Inside-outside algorithm
• EM for PM models
Outline
• Course overview
• Problems and methods
• Mathematical foundation
– Probability theory
– Information theory
Problems and methods
Types of ML problems
•
•
•
•
•
Classification problem
Estimation problem
Clustering
Discovery
…
A learning method can be applied to one or
more types of ML problems.
We will focus on the classification problem.
Classification problem
• Given a set of classes and data x, decide
which class x belongs to.
• Labeled data:
– (xi, yi) is a set of labeled data.
– xi is a list of attribute values.
– yi is a member of a pre-defined set of classes.
Examples of classification problem
• Disambiguation:
– Document classification
– POS tagging
– WSD
– PP attachment given a set of other phrases
• Segmentation:
– Tokenization / Word segmentation
– NP Chunking
Learning methods
• Modeling: represent the problem as a
formula and decompose the formula into a
function of parameters
• Training stage: estimate the parameters
• Test (decoding) stage: find the answer
given the parameters
Modeling
• Joint vs. conditional models:
– P(data, model)
– P(model | data)
– P(data | model)
• Decomposition:
– Which variable conditions on which variable?
– What independent assumptions?
An example of different modeling
P( A, B, C )  P( A) * P( B | A) * P(C | A, B)
 P( A) * P( B | A) * P(C | B)
P( A, B, C )  P( B) * P(C | B) * P( A | B, C )
 P( B) * P(C | B) * P( A | B)
Training
• Objective functions:
– Maximize likelihood:
ˆML  arg max P(data |  )

– Minimize error rate
– Maximum entropy
– ….
• Supervised, semi-supervised, unsupervised:
– Ex: Maximize likelihood
• Supervised: simple counting
• Unsupervised: EM
Decoding
• DP algorithm
– CYK for PCFG
– Viterbi for HMM
–…
• Pruning:
– TopN: keep topN hyps at each node.
– Beam: keep hyps whose weights >= beam *
max_weight
– Threshold: keep hyps whose weights >= threshold
–…
Outline
• Course overview
• Problems and methods
• Mathematical foundation
– Probability theory
– Information theory
Probability Theory
Probability theory
• Sample space, event, event space
• Random variable and random vector
• Conditional probability, joint probability,
marginal probability (prior)
Sample space, event, event space
• Sample space (Ω): a collection of basic
outcomes.
– Ex: toss a coin twice: {HH, HT, TH, TT}
• Event: an event is a subset of Ω.
– Ex: {HT, TH}
• Event space (2Ω): the set of all possible
events.
Random variable
• The outcome of an experiment need not
be a number.
• We often want to represent outcomes as
numbers.
• A random variable is a function that
associates a unique numerical value with
every outcome of an experiment.
• Random variable is a function X: ΩR.
• Ex: toss a coin once: X(H)=1, X(T)=0
Two types of random variable
• Discrete random variable: X takes on only
a countable number of distinct values.
– Ex: Toss a coin 10 times. X is the number of
tails that are noted.
• Continuous random variable: X takes on
uncountable number of possible values.
– Ex: X is the lifetime (in hours) of a light bulb.
Probability function
• The probability function of a discrete variable X is a
function which gives the probability p(xi) that the
random variable equals xi: a.k.a. p(xi) = p(X=xi).
0  p ( xi )  1
 p( x )  1
i
xi
Random vector
• Random vector is a finite-dimensional
vector of random variables: X=[X1,…,Xk].
• P(x) = P(x1,x2,…,xn)=P(X1=x1,…., Xn=xn)
• Ex: P(w1, …, wn, t1, …, tn)
Three types of probability
• Joint prob: P(x,y)= prob of x and y
happening together
• Conditional prob: P(x|y) = prob of x given a
specific value of y
• Marginal prob: P(x) = prob of x for all
possible values of y
Common equations
P( A)   P( A, B)
B
P( A, B)  P( A) * P( B | A)  P( B) * P( A | B)
P( A, B)
P( B | A) 
P( A)
More general cases
P( A1 ) 
 P( A ,..., A )
1
n
A2 ,..., An
P( A1 ,..., An )   P( Ai | A1 ,... Ai 1 )
i 1
Information Theory
Information theory
• It is the use of probability theory to
quantify and measure “information”.
• Basic concepts:
– Entropy
– Joint entropy and conditional entropy
– Cross entropy and relative entropy
– Mutual information and perplexity
Entropy
• Entropy is a measure of the uncertainty associated
with a distribution.
H ( X )   p( x) log p( x)
x
• The lower bound on the number of bits it takes to
transmit messages.
• An example:
– Display the results of horse races.
– Goal: minimize the number of bits to encode the results.
An example
• Uniform distribution: pi=1/8.
1
1
H ( X )  8 * ( log 2 )  3 bits
8
8
• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64,
1/64, 1/64, 1/64)
1
1 1
1 1
1 1
1
1
1
H ( X )  ( log  log  log  log  4 * log )  2 bits
2
2 4
4 8
8 16
16
64
64
(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Entropy of a language
• The entropy of a language L:
 p( x
1n
H ( L)   lim
n 
) log p ( x1n )
x1 n
n
• If we make certain assumptions that the
language is “nice”, then the cross entropy can be
calculated as:
log p( x1n )
log p( x1n )
H ( L)   lim

n
n
n
Joint and conditional entropy
• Joint entropy:
H ( X , Y )   p ( x, y ) log p ( x, y )
x
y
• Conditional entropy:
H (Y | X )   p ( x, y ) log p ( y | x)
x
y
 H ( X ,Y )  H ( X )
Cross Entropy
• Entropy:
H ( X )   p( x) log p( x)
x
• Cross Entropy:
H c ( X )   p( x) log q( x)
x
• Cross entropy is a distance measure between
p(x) and q(x): p(x) is the true probability; q(x) is
our estimate of p(x).
Hc (X )  H(X )
Cross entropy of a language
• The cross entropy of a language L:
 p( x
1n
H ( L, q)   lim
n 
) log q ( x1n )
x1 n
n
• If we make certain assumptions that the
language is “nice”, then the cross entropy can be
calculated as:
log q( x1n )
log q( x1n )
H ( L, q)   lim

n
n
n
Relative Entropy
• Also called Kullback-Leibler distance:
p ( x)
KL( p || q)   p( x) log 2
 Hc (X )  H(X )
q ( x)
• Another distance measure between prob
functions p and q.
• KL distance is asymmetric (not a true
distance):
KL( p, q)  KL(q, p )
Relative entropy is non-negative
z  0 log z  z  1
KL( p || q )
p( x)
q( x)
  p ( x) log
q( x)
p ( x)
x
x
q( x)
 ( p ( x)(
 1))  ( (q ( x)  p ( x)))
p( x)
x
x
  p ( x) log
 (  p ( x ))  (  q ( x ))  0
x
x
Mutual information
• It measures how much is in common
between X and Y:
p ( x, y )
I ( X ; Y )   p ( x, y ) log
p( x) p( y )
x
y
 H ( X )  H (Y )  H ( X , Y )
 I (Y ; X )
• I(X;Y)=KL(p(x,y)||p(x)p(y))
Perplexity
• Perplexity is 2H.
• Perplexity is the weighted average number
of choices a random variable has to make.
Summary
• Course overview
• Problems and methods
• Mathematical foundation
– Probability theory
– Information theory
 M&S Ch2
Next time
• FSA
• HMM: M&S Ch 9.1 and 9.2