Microarray Class Classification by Machine learning Methods

Download Report

Transcript Microarray Class Classification by Machine learning Methods

CZ5225: Modeling and Simulation in Biology
Lecture 7, Microarray Class Classification by
Machine learning Methods
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://bidd.nus.edu.sg
Room 07-24, level 8, S16,
National University of Singapore
Machine Learning Method
Inductive learning:
Example-based learning
Descriptor
Positive
examples
Negative
examples
2
Machine Learning Method
Feature vectors:
A=(1, 1, 1)
B=(0, 1, 1)
C=(1, 1, 1)
D=(0, 1, 1)
E=(0, 0, 0)
F=(1, 0, 1)
Descriptor
Feature vector
Positive
examples
Negative
examples
3
Machine Learning Method
Feature vectors in input space:
Z
Input space
Feature vector
A=(1, 1, 1)
B=(0, 1, 1)
C=(1, 1, 1)
D=(0, 1, 1)
E=(0, 0, 0)
F=(1, 0, 1)
F
E A
B
Y
X
4
Machine Learning Method
Vector A= (a1, a2, a3, …, aN)
Positive
Negative
Task of machine learning transformed into the job for finding of a
border-Line for optimal separation of the known positive and
negative samples in a training-set
5
Classifying Cancer Patients vs. Healthy
Patients from Microarray
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
Cancerous
Healthy
N (number of dimensions) is normally larger
than 2, so we can’t visualize the data.
6
Classifying Cancer Patients vs.
Healthy Patients from Microarray
Up-regulated
For simplicity,
pretend that
we are only
looking at
expression
levels of 2
genes.
Gene_2 expression level
5
Cancerous
0
Healthy
-5
Down-regulated
-5
0
5
Gene_1 expression level
7
Classifying Cancer Patients vs.
Healthy Patients from Microarray
Question:
How can we
build a
classifier for
this data?
Gene_2 expression level
5
Cancerous
0
Healthy
-5
-5
0
5
Gene_1 expression level
8
Classifying Cancer Patients vs.
Healthy Patients from Microarray
IF
gene_1 <0 AND
gene_2 <0
THEN person=healthy
IF
gene_1 >0 AND
gene_2 >0
THEN person=cancerous
Gene_2 expression level
Simple
Classification Rule:
5
Cancerous
0
-5
Healthy
-5
0
5
Gene_1 expression level
9
Classifying Cancer Patients vs.
Healthy Patients from Microarray
Simple
Classification Rule:
IF
gene_1 <0 AND
gene_2 <0 AND
…
gene 5000 < Y
If we move away from our simple
example with 2 genes to a realistic
case with say 5000 genes, then
1. What will these rules look like?
2. How will we find them?
THEN person=healthy
IF
gene_1 >0 AND
gene_2 >0
…
gene 5000 >W
THEN person=cancerous
Gets a little complicated, unwieldy…
10
Classifying Cancer Patients vs.
Healthy Patients from Microarray
SIMPLE RULE:
•If data point lies to the
‘left’ of the line, then
‘healthy’.
•If data point lies to
‘right’ of line then
‘cancerous’
5
Gene_2 expression level
Reformulate the
previous rule
Cancerous
0
Healthy
-5
-5
0
5
Gene_1 expression level
It is easier to generalize this line to 5000 genes than it
is a list of rules. Also easier to solve mathematically.
11
Extension to More Than 2 Genes (dimensions)
•Line in 2D: x1C1 + x2C2 = T
•If we had 3 genes, and needed to build a
‘line’ in 3-dimensional space, then we
would be seeking a plane.
Plane in 3D: x1C1 + x2C2 + x3C3 = T
•If we were looking in more than 3
dimensions, the ‘plane’ is called a
hyperplane. A hyperplane is simply a
generalization of a plane to dimensions
higher than 3.
Hyperplane in N-dimensions:
x1C1 + x2C2 + x3C3 + … + xNCN = T
5
Cancerous
0
Healthy
5
-
0
5
5
12
Classification Methods (1)
13
Classification Methods (1)
14
Classification Methods (2)
15
Classification Methods (2)
16
Classification Methods (2)
17
Classification Methods (2)
18
Classification Methods (3)
19
Classification Methods (3)
K Nearest Neighbor Method
20
Classification Methods (4)
21
Classification Methods (4)
22
Classification Methods (4)
23
Classification Methods (5)
SVM
What is SVM?
• Support vector machines, a machine learning method,
learning by examples, statistical learning, classify objects
into one of the two classes.
Advantages of SVM:
• Diversity of class members (no racial discrimination).
• Low over-fitting risk
• Easier to find “optimal” parameters for better class
differentiation performance
24
Classification Methods (5)
SVM Method
Protein family
members
Border
New border
Protein family
members
Nonmembers
Nonmembers
Project to a higher dimensional space
25
Classification Methods (5)
SVM method
New border
Support vector
Support vector
Protein family
members
Nonmembers
26
What is a good Decision Boundary?
• Consider a two-class, linearly
separable classification
problem
• Many decision boundaries!
– The Perceptron algorithm
can be used to find such a
boundary
– Different algorithms have
been proposed
• Are all decision boundaries
equally good?
Class 2
Class 1
27
Examples of Bad Decision Boundaries
Class 2
Class 1
Class 2
Class 1
28
Large-margin Decision Boundary
• The decision boundary should be as far away from the data
of both classes as possible
– We should maximize the margin, m
– Distance between the origin and the line wtx=k is k/||w||
Class 2
Class 1
m
29
SVM Method
Support vector
Protein family
members
Nonmembers
New border
Support vector
30
SVM Method
Border line is nonlinear
31
SVM method
Non-linear transformation: use of kernel function
32
SVM method
Non-linear transformation
33
Mathematical Algorithm of SVM
A hyperplane is defined by its normal vector w
The euclidean distance between
x *  x 0* 
k
w
x*
x 0*
w x  k
w x  0
Maximize the distance between the hyperplane
and the closest point subject to classes
being separated
y (w  x i )
1
max min i
subject to
The
margin
or
separation
between
classes
:
.
xi
w
w
2w
y i ( w  x i )   for all i
34
Mathematical Algorithm of SVM
An equivalent formulatio n is
min
1
2
w
2
subject to
y i ( w  x i  b)  1 for all i
b is an offset term for the hyperplane .
When the data are not separable
min
1
2
w
2
 C   i subject to
2
i
y i ( w  x i  b)  1   i and  i  0 for all i.
f ( x)  w  x  b
x*
x 0*
i
w x  k
w x  0
y  sign f  x  .
35
Mathematical Algorithm of SVM
C
min w    i2 subject to
 i
y i ( w  x i  b)  1   i and  i  0 for all i
1
2
2
 i  1  y i f  x i  .
min
f F
1
(1  y i f ( x i )) 2

 i
Empirical error

1
C
tradeoff
Pf
2
Complexity
36
Mathematical Algorithm of SVM
Nonlinear decision boundaries
Map data to higher dimensional space, feature space
φ :  n   m where m  n
x  φ(x )
Construct linear classifier in this space
min
1
2
w
2
K
 C   i2 subject to
i
y i ( w  φ( x i )  b)  1   i and  i  0 for all i.
and
f ( x )  w  φ( x )  b.
Which can be written as
f ( x )   y i α i K ( x i , x )  b. where
K ( x, y)  φ( x )  φ( y).
i
37
Mathematical Algorithm of SVM
38
SVM Performance Measure
39
SVM Performance Measure
40
SVM Performance Measure
41
C
TP * TN  FN * FP
(TP  FN )(TP  FP)(TN  FN )(TN  FP)
SVM Performance Measure
• Sensitivity P+ =TP/(TP+FN) accuracy for positive samples
• Specificity P- =TN/(TN+FP) accuracy for negative samples
• Overall prediction accuracy
Q
• Matthews correlation coefficient
TP  TN
TP  TN  FP  FN
C
TP * TN  FN * FP
(TP  FN )(TP  FP)(TN  FN )(TN  FP)
42
Why SVM Works?
• The feature space is often very high dimensional. Why don’t we
have the curse of dimensionality?
• A classifier in a high-dimensional space has many parameters and
is hard to estimate
• Vapnik argues that the fundamental problem is not the number of
parameters to be estimated. Rather, the problem is about the
flexibility of a classifier
• Typically, a classifier with many parameters is very flexible, but there
are also exceptions
– Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
– This 1-parameter classifier is very flexible
43
Why SVM works?
• Vapnik argues that the flexibility of a classifier should not be
characterized by the number of parameters, but by the flexibility
(capacity) of a classifier
– This is formalized by the “VC-dimension” of a classifier
• Consider a linear classifier in two-dimensional space
• If we have three training data points, no matter how those points are
labeled, we can classify them perfectly
44
VC-dimension
• However, if we have four points, we can find a labeling
such that the linear classifier fails to be perfect
• We can see that 3 is the critical number
• The VC-dimension of a linear classifier in a 2D space is
3 because, if we have 3 points in the training set, perfect
classification is always possible irrespective of the
labeling, whereas for 4 points, perfect classification can
be impossible
45
VC-dimension
• The VC-dimension of the nearest neighbor classifier is
infinity, because no matter how many points you have,
you get perfect classification on training data
• The higher the VC-dimension, the more flexible a
classifier is
• VC-dimension, however, is a theoretical concept; the
VC-dimension of most classifiers, in practice, is difficult
to be computed exactly
– Qualitatively, if we think a classifier is flexible, it
probably has a high VC-dimension
46