Introduction to Machine Learning

Download Report

Transcript Introduction to Machine Learning

Lecture Slides for
ETHEM ALPAYDIN
© The MIT Press, 2010
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2
Parametric Estimation
 Parametric (single global model)
 Advantage:

It reduces the problem of estimating a probability density function (pdf ),
discriminant, or regression function to estimating the values of a small
number of parameters.
 Disadvantage:

This assumption does not always hold and we may incur a large error if it
does not.
 Semiparametric (small number of local models)
 Mixture model (Chap 7)
 The density is written as a disjunction (OR) of a small number of
parametric models.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
3
Nonparametric Estimation
 Assumption:
 Similar inputs have similar outputs
 Functions (pdf, discriminant, regression) change
smoothly
 Keep the training data;“let the data speak for itself”
 Given x, find a small number of closest training instances
and interpolate from these
 Nonparametric methods are also called memory-based or
instance-based learning algorithms.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4
Density Estimation
 Given the training set X={xt}t drawn iid (independent and
identically distributed) from p(x)
 Divide data into bins of size h
 Histogram estimator: (Figure 8.1)

# x t in the same bin as x
p̂ x  
Nh

PS. In probability theory and statistics, a sequence or other collection of random variables is
independent and identically distributed (i.i.d.) if each random variable has the same
probability distribution as the others and all are mutually independent.
(http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
5
Histogram estimator

# x t in the same bin as x
p̂ x  
Nh
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
6

Density Estimation
 Given the training set X={xt}t drawn iid from p(x)
 x is always at the center of a bin of size h
 Naive estimator:
pˆ  x  
or


# x  h / 2  xt  x  h / 2
Nh
1 N  x  xt
pˆ  x  
w

Nh t 1  h
1 if u  1 / 2

 ; w u   

0 otherwise
(Figure 8.2)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
7
Naive estimator:
Naïve estimator: h=2
h=1
h=0.5
pˆ  x  


# x  h / 2  xt  x  h / 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Nh
8
Kernel Estimator
 Kernel function, e.g., Gaussian kernel:
 u2 
1
K u  
exp 
2
 2
 Kernel estimator (Parzen windows): Figure 8.3
1 N  x  xt
p̂ x  
K 

Nh t 1  h



 If K is Gaussian, then p̂ will be smooth having all the
derivatives (因連續可微,可求每一點的導數).
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
9
Kernel estimator
1 N  x  xt
p̂ x  
K 

Nh t 1  h
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
10



k-Nearest Neighbor Estimator
 Instead of fixing bin width h and counting the number of
instances, fix the instances (neighbors) k and check bin
width
p̂ x  
k
2Ndk x 
dk(x): distance to kth closest instance to x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
11
k-Nearest Neighbor Estimator
p̂ x  
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
k
2Ndk x 
12
Generalization to Multivariate Data
 Kernel density estimator
1
p̂ x  
Nh d
 x  xt
K 

t 1
 h
N



with the requirement that

Rd
K ( x ) dx  1
Multivariate Gaussian kernel
spheric
ellipsoid
d
 u 2
 1 

K u   
 exp
 2 
 2 
1
 1 T 1 
K u  
exp
1/ 2
 2 u S u 
d/2


2 S
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
13
Nonparametric Classification
 Application: Estimate p(x|Ci) and use Bayes’ rule
 Kernel estimator
1
pˆ  x|Ci  
Ni h d
 x  xt
K

t 1
 h
N
 t ˆ
Ni
r
,
P
C

 i
 i
N

 The discriminant function
gi  x   pˆ  x|Ci  Pˆ  Ci 
1

Nh d
 x  xt  t
K

 ri
t 1
 h 
N
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
14
Nonparametric Classification
k-nn estimator
 For the special case of k-nn estimator
ki
pˆ  x | Ci  
N iV k  x 
where
ki : the number of neighbors out of the k nearest that belong to Ci
Vk(x) : the volume of the d-dimensional hypersphere centered at x,
with radius r  x  x( k )
V k  r d cd cd : the volume of the unit sphere in d dimensions
For example,
d  1  V k  r 1c1  2r;
d  2  V k  r 2c2  r 2 ;
4 3
d  3  V  r c3  r ;
3
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
k
3
15
Nonparametric Classification
k-nn estimator
 From
pˆ x  
k
NV k x 
N
Pˆ Ci   i
N
pˆ  x | Ci  
ki
N iV k  x 
 Then
pˆ x | Ci Pˆ Ci  ki
ˆ
PCi | x  

pˆ x 
k
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
16
Condensed(壓縮的) Nearest Neighbor
 Time/space complexity of k-NN is O (N)
 Find a subset Z of X that is small and is accurate in
classifying X (Hart, 1968)
E' Z | X   E X | Z    Z
Error function
The error on X
storing Z
The cardinality of Z
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
17
Condensed Nearest Neighbor
 Incremental algorithm: Add instance if needed
 A greedy method and a local search
 The results depend on the order of the training instances.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
18
Nonparametric Regression
 Smoothing models:
 Nonparametric regression estimator also called a smoother

Running mean smoother
regressogram vs naive estimator


Kernel smoother
Running line smoother
 Regressogram: Figure 8.7
 b x , x r
ĝ x  
 b x , x 
N
t
t 1
N
t
t
t 1
where

b x ,x
t

1 if x t is in the same bin with x

0 otherwise
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19
Regressogram
gˆ x  
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


t
t
b
x
,
x
r
t 1
N

t
b
x
,
x
t 1
N
20

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
21
Running Mean Smoother
 Running mean smoother: naive estimator
 Figure 8.8
 x  xt  t
t 1w  h  r


ĝ x  
 x  xt 
N
t 1w  h 


where
N
1 if u  1
w u   
0 otherwise
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
Running mean smoother
 x  xt  t
t 1 w h  r


gˆ  x  
 x  xt 
N
t 1 w h 


N
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
23
Kernel Smoother
 Kernel smoother: Figure 8.9
 x  xt  t
t 1 K  h  r


ĝ x  
 x  xt 
N
t 1 K  h 


N
where K( ) : Gaussian kernel
 u2 
1
K u  
exp 
2
 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
24
Kernel smoother
 x  xt  t
t 1 K  h  r


ĝ x  
 x  xt 
N
t 1 K  h 


N
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
25
Running Line Smoother
 Instead of taking an average and giving a constant fit at a
point, we can take into account one more term in the
Taylor expansion and calculate a linear fit.
F(x) = F’(a)/1!+F’’(a) (x-a)/2!+ F’’’(a) (x-a)2/3!+…
 Running line smoother: Figure 8.10
 Use the data points in the neighborhood, as defined by
h or k
 Fit a local regression line
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
26
Running line smoother
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
27
How to Choose k or h?
 When k or h is small:
 Single instances matter;
 Bias is small, variance is large
 Undersmoothing
 High complexity
 As k or h increases:
 Average over more instances
 Variance decreases but bias increases
 Oversmoothing
 Low complexity
 Cross-validation is used to finetune k or h.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
28
Kernel estimate for various bin lengths for a two-class problem. Plotted are the conditional densities,
p(x|Ci). It seems that the top one oversmooths and the bottom undersmooths, but whichever is the best
will depend on where the validation data points are.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
29