Transcript learning

Machine Learning
Sudeshna Sarkar
IIT Kharagpur
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
1
Learning methodologies
 Learning from labelled data (supervised learning)

eg. Classification, regression, prediction, function approx
 Learning from unlabelled data (unsupervised learning)

eg. Clustering, visualization, dimensionality reduction
 Learning from sequential data

eg. Speech recognition, DNA data analysis
 Associations
 Reinforcement Learning
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
2
Unsupervised Learning
 Clustering: grouping similar instances
 Example applications
 Clustering items based on similarity
 Clustering users based on interests
 Clustering words based on similarity of usage
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
3
Inductive Learning Methods





Find Similar
Decision Trees
Naïve Bayes
Bayes Nets
Support Vector Machines (SVMs)
 All support:
 “Probabilities” - graded membership; comparability across categories
 Adaptive - over time; across individuals
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
5
Find Similar
 Aka, relevance feedback
xi , j
xi , j
 Rocchio
wj   
 
irel n
inon _ rel N  n
 Classifier parameters are a weighted
combination of weights in positive and negative
j w j x j
examples -- “centroid”
 New items classified using:
 0
 Use all features, idf weights,
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
6
Decision Trees
 Learn a sequence of tests on features, typically
using top-down, greedy search
 Binary (yes/no) or continuous decisions
f7
P(class) = .6
Oct 17, 2006
f1
!f1
!f7
P(class) = .9
P(class) = .2
Sudeshna Sarkar, IIT Kharagpur
7
Naïve Bayes
 Aka, binary independence model
 Maximize: Pr (Class | Features)

 P( x | class )  P(class )
P(class | x ) 

P( x )
 Assume features are conditionally independent - math
easy; surprisingly effective
C
x1
Oct 17, 2006
x2
x3
xn
Sudeshna Sarkar, IIT Kharagpur
8
Bayes Nets
 Maximize: Pr (Class | Features)
 Does not assume independence of features dependency modeling
C
x1
Oct 17, 2006
x2
x3
xn
Sudeshna Sarkar, IIT Kharagpur
9
Support Vector Machines
 Vapnik (1979)
 Binary classifiers that maximize margin
 Find hyperplane separating positive and negative examples
 Optimization for maximum margin:
 Classify new items using:
2  
 
 w x
w
 
min w , w  x  b  1, w  x  b  1
support vectors
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
10
Support Vector Machines
 Extendable to:
 Non-separable problems (Cortes & Vapnik, 1995)
 Non-linear classifiers (Boser et al., 1992)
 Good generalization performance
 OCR (Boser et al.)
 Vision (Poggio et al.)
 Text classification (Joachims)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
11
Cross-Validation
 Estimate the accuracy of a hypothesis induced
by a supervised learning algorithm
 Predict the accuracy of a hypothesis over
future unseen instances
 Select the optimal hypothesis from a given set
of alternative hypotheses
 Pruning decision trees
 Model selection
 Feature selection
 Combining multiple classifiers (boosting)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
12
Holdout Method
 Partition data set D = {(v1,y1),…,(vn,yn)} into training Dt
and validation set Dh=D\Dt
Training Dt
acch = 1/h 
Validation D\Dt
(vi,yi)Dh (I(Dt,vi),yi)
I(Dt,vi) : output of hypothesis induced by learner I
trained on data Dt for instance vi
(i,j) = 1 if i=j and 0 otherwise
Problems:
• makes insufficient use of data
and validation
set Sarkar,
are correlated
Oct •
17,training
2006
Sudeshna
IIT Kharagpur
13
Cross-Validation
 k-fold cross-validation splits the data set D into k mutually
exclusive subsets D1,D2,…,Dk
D1 D2 D3 D4
 Train and test the learning algorithm k times, each time it
is trained on D\Di and tested on Di
D1 D2 D3 D4
D1 D2 D3 D 4
D1 D2 D3 D4
D1 D2 D3 D 4
acc = 1/n  Sudeshna
(I(D\Di,vi),yi)
(vi,yi)D
Sarkar, IIT Kharagpur
Oct 17, 2006 cv
14
Cross-Validation
 Uses all the data for training and testing
 Complete k-fold cross-validation splits the dataset of
size m in all (m over m/k) possible ways (choosing m/k
instances out of m)
 Leave n-out cross-validation sets n instances aside for
testing and uses the remaining ones for training (leave
one-out is equivalent to n-fold cross-validation)
 Leave one out is widely used
 In stratified cross-validation, the folds are stratified so
that they contain approximately the same proportion of
labels as the original data set
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
15
Bootstrap
 Samples n instances uniformly from the data set
with replacement
 Probability that any given instance is not chosen
after n samples is (1-1/n)n  e-1  0.632
 The bootstrap sample is used for training the
remaining instances are used for testing
 accboot = 1/b  i=1b (0.632 0i + 0.368 accs)
where 0i is the accuracy on the test data of the
i-th bootstrap sample, accs is the accuracy
estimate on the training set and b the number of
bootstrap samples
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
16
Wrapper Model
Input
features
Feature subset
search
Induction
algorithm
Feature subset
evaluation
Feature subset
evaluation
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
17
Wrapper Model
 Evaluate the accuracy of the inducer for a given subset of
features by means of n-fold cross-validation
 The training data is split into n folds, and the induction algorithm
is run n times. The accuracy results are averaged to produce the
estimated accuracy.
 Forward elimination:
Starts with the empty set of features and greedily adds the
feature that improves the estimated accuracy at most
 Backward elimination:
Starts with the set of all features and greedily removes features
and greedily removes the worst feature
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
18
Bagging
 For each trial t=1,2,…,T create a bootstrap sample of size N.
 Generate a classifier Ct from the bootstrap sample
 The final classifier C* takes class that receives the majority votes
among the Ct
C*
instance
yes
C1
no
C2
train
Training set1
Oct 17, 2006
yes
yes
…
train
CT
train
…
Training
set2IIT Kharagpur
Sudeshna Sarkar,
Training set19T
Bagging
 Bagging requires ”instable” classifiers like for
example decision trees or neural networks
”The vital element is the instability of the
prediction method. If perturbing the learning
set can cause significant changes in the
predictor constructed, then bagging can
improve accuracy.” (Breiman 1996)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
20
Naïve Bayes Learner
Assume target function f: XV, where each instance x described
by attributes <a1, a2, …., an>. Most probable value of f(x) is:
v  max P (v j | a1 , a2 ....an )
vjV
 max
P ( a1 , a2 ....an | v j ) P (v j )
vjV
P ( a1 , a2 ....an )
 max P ( a1 , a2 ....an | v j ) P (v j )
vjV
Naïve Bayes assumption:
P(a1 , a2 ....an | v j )   P(ai | v j ) (attributes are conditionally independent)
i
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
21
Bayesian classification
 The classification problem may be formalized using
a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
 E.g. P(class=N | outlook=sunny,windy=true,…)
 Idea: assign to sample X the class label C such
that P(C|X) is maximal
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
22
Estimating a-posteriori probabilities
 Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
 P(X) is constant for all classes
 P(C) = relative freq of class C samples
 C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
 Problem: computing P(X|C) is unfeasible!
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
23
Naïve Bayesian Classification
 Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
 If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
 If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
 Computationally easy in both cases
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
24
NB Classifier Example
EnjoySport example: estimating P(xi|C)
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
outlook
P(sunny|P) = 2/9
P(sunny|N) = 3/5
P(overcast|P) =
4/9
P(overcast|N) = 0
P(rain|P) = 3/9
P(rain|N) = 2/5
temperature
P(hot|P) = 2/9
P(hot|N) = 2/5
P(mild|P) = 4/9
P(mild|N) = 2/5
P(cool|P) = 3/9
P(cool|N) = 1/5
Humidity
P(P) = 9/14
P(N) = 5/14
P(high|P) = 3/9
P(high|N) = 4/5
P(normal|P) = 6/9
P(normal|N) = 2/5
Windy
P(true|P) = 3/9
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
P(false|P) = 6/9
P(true|N) = 3/5
25
P(false|N) = 2/5
NB Classifier Example (cont’d)
 Given a training set, we can compute the
probabilities
Outlook
sunny
overcast
rain
Temperature
hot
mild
cool
Oct 17, 2006
P
2/9
4/9
3/9
2/9
4/9
3/9
N
3/5
0
2/5
2/5
2/5
1/5
Humidity P
N
high
3/9 4/5
normal 6/9 1/5
Windy
true
false
Sudeshna Sarkar, IIT Kharagpur
3/9 3/5
6/9 2/5
26
NB Classifier Example (cont’d)
Predict enjoying sport in the day with the condition
<sunny, cool, high, strong>
(P(v| o=sunny, t= cool, h=high w=strong)) using the training
data:
we have :
# days of enjoing sport with strong wind
# days of enjoying sport
p( P) p( sun | P) p(cool | P) p(high | P) p( strong | P)  .005
p( N ) p( sun | N ) p(cool | N ) p (high | N ) p( strong | N )  .021
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
27
The independence hypothesis…
 … makes computation possible
 … yields optimal classifiers when satisfied
 … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
 Attempts to overcome this limitation:
 Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
 Decision trees, that reason on one attribute at the time,
considering most important attributes first
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
28
The Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )
Classify_New_Instance (x)
v  max P ( v j )  P ( a i | v j )
vj V
Typical estimation of P(ai | vj)
nc  mp
P (ai | v j ) 
nm
Oct 17, 2006
ai x
Where
n: examples with v=vj; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
Sudeshna Sarkar, IIT Kharagpur
29
Bayesian Belief Networks
 Naïve Bayes assumption of conditional independence too
restrictive
 But it is intractable without some such assumptions
 Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes):
combining prior knowledge about dependencies among variables
with observed training data.
 Bayesian Net
 Node = variables
 Arc = dependency
 DAG, with direction on arc representing causality
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
30
Bayesian Networks:
Multi-variables with Dependency
 Bayesian Belief network (Bayesian net) describes conditional
independence among subsets of variables (attributes):
combining prior knowledge about dependencies among
variables with observed training data.
 Bayesian Net
 Node = variables and each variable has a finite set of mutually
exclusive states
 Arc = dependency
 DAG, with direction on arc representing causality
 Variable A with parents B1, …., Bn has a conditional probability
table P (A | B1, …., Bn)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
31
Bayesian Belief Networks
Occ
Age
Income
Buy X
Interested in
Insurance
Oct 17, 2006
•Age, Occupation and Income determine if
customer will buy this product.
•Given that customer buys product, whether
there is interest in insurance is now
independent of Age, Occupation, Income.
•P(Age, Occ, Inc, Buy, Ins ) =
P(Age)P(Occ)P(Inc)
P(Buy|Age,Occ,Inc)P(Int|Buy)
Current State-of-the Art: Given structure
and probabilities, existing algorithms can
handle inference with categorical values and
limited representation of numerical values
Sudeshna Sarkar, IIT Kharagpur
32
General Product Rule
n
P( x1 ,....xn | M )   P( xi | Pai , M )
i 1
Pai  parent ( xi )
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
33
Nodes as Functions
A node in BN is a conditional distribution function
l
m
h
a
A
b
B
ab ~ab a~b ~a~b
l 0.1
m 0.3
h 0.6
0.7 0.4 0.2
0.2 0.4 0.5
0.1 0.2 0.3
0.1
0.3
0.6
P(X|A=a, B=b)
X
•input: parents state values
•output: a distribution over its own value
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
34
Special Case : Naïve Bayes
h
e1
e2
………….
en
P(e1, e2, ……en, h ) = P(h) P(e1 | h) …….P(en | h)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
35
Inference in Bayesian Networks
Age
Income
How likely are elderly rich
people to buy DallasNews?
P( paper = DallasNews |
Age>60, Income > 60k)
Living
Location
House
Owner
Newspaper
Preference
…
Oct 17, 2006
Voting
Pattern
Sudeshna Sarkar, IIT Kharagpur
36
Bayesian Learning
B
E
A
C
~b
e
a
c
b ~e ~a
~c
………………...
N
Burglary
n
n
Earthquake
Alarm
Call
Newscast
Input : fully or partially observable data cases
Output : parameters AND also structure
Learning Methods:
EM (Expectation Maximisation)
using current approximation of parameters to estimate filled in data
using filled in data to update parameters (ML)
Gradient Ascent Training
Gibbs Sampling (MCMC)
Oct 17, 2006
Sudeshna Sarkar, IIT Kharagpur
37