Bioinformatics Research and Resources at the University of

Download Report

Transcript Bioinformatics Research and Resources at the University of

Introduction to Bioinformatics: Lecture VIII
Classification and Supervised Learning
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org
1
Outline of the lecture

Motivating story: correlating inputs and outputs
 Learning with a teacher
 Regression and classification problems
 Model selection, feature selection and generalization
 k-nearest neighbors and some other classification
algorithms
 Phenotype fingerprints and their applications in
medicine
JM - http://folding.chmcc.org
2
Web watch: an on-line biology textbook by JW Kimball
Dr. J. W. Kimball's Biology Pages
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/
Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene
Story #2: ApoB, cholesterol uptake, LDL and its endocytosis
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl
Complex patterns of mutations in genes related to cholesterol transport and
uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood.
JM - http://folding.chmcc.org
3
Correlations and fingerprints
Instead of often difficult to decipher underlying molecular model, one may
simply try to find correlations between inputs and outputs. If measurements
on certain attributes correlate with molecular processes, underlying genomic
structures, phenotypes, disease states etc., one can use such attributes as
indicators of such “hidden” states and to make predictions for new cases.
Consider for example the elevated levels of the low density lipoprotein (LDL)
particles in the blood, as an indicator (fingerprint) of the atherosclerosis.
JM - http://folding.chmcc.org
4
Correlations and fingerprints: LDL example
80
70
60
50
40
30
20
10
0
100
300
80
250
60
200
150
40
100
50
20
0
Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data);
x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549
JM - http://folding.chmcc.org
5
LDL example: 2D projection
100
90
80
HDL
70
60
50
40
30
20
0
50
100
150
LDL
200
JM - http://folding.chmcc.org
250
300
6
LDL example: regression with binary output and
1D projection for classification
2
1.5
class
1
0.5
0
-0.5
-1
0
50
100
150
LDL
200
JM - http://folding.chmcc.org
250
300
7
Unsupervised vs. supervised learning
In case of unsupervised learning the goal is to “discover” the structure in the
data and group (cluster) similar objects, given a similarity measure. In case
of supervised learning (or learning with a teacher) a set of examples with
class assignments (e.g. healthy vs. diseased) is given and the goal is to
find a representation of the problem in some feature (attribute) space that
provides a proper separation of the imposed classes. Such representations
With the resulting decision boundaries may be subsequently used to make
prediction for new cases.
Class 3
Class 1
Class 2
JM - http://folding.chmcc.org
8
Choice of the model, problem representation and feature
selection: another simple example
adults
children
F
estrogen
weight
M
heights
testosterone
JM - http://folding.chmcc.org
9
11113.3, Pauci , 1_Pauci , MTX 0 , XR_unknow n
878, Pauci , 2_Poly , MTX 1 , XR_erosions
na , MTX 0 , XR_
. ..
na
na
na , MTX 0 , XR_
na , MTX 0 , XR_
1089, Control,
1095, Control,
na
na
na
na , MTX 0 , XR_
na , MTX 0 , XR_
813.3, Control,
7113.3, C ontrol, na , MTX 0 , XR_ na
9150, Spond , JAS , MTX 0 , XR _sclerosis
7108, Syst, 3_Systemic , MTX 1 , XR _normal
1084, Control,
7021.31, C ontrol, na , MTX 0 , XR_ na
18042, Spond , JAS , MTX 0 , XR _sclerosis
7118.3, C ontrol, na , MTX 0 , XR_
9264, Pauci , 1_Pauci , MTX 0 , XR_erosions
824, Poly, 2_Poly , MTX 0 , XR _normal
9245, Spond , JSPA , MTX 0 , XR _space narrow ing
1087ctrl, C ontrol, na , MTX 0 , XR_ na
801, Pauci , 2_Poly , MTX 1 , XR_erosions
na
na
na
na , MTX 0 , XR_
1085, Control,
7149.3, C ontrol, na , MTX 0 , XR_
na
na , MTX 0 , XR_
1082, Control,
817, Pauci , 1_Pauci , MTX 0 , XR _space narrow ing
7206, Pauci , 1_Pauci , MTX 0 , XR_unknow n
7145, Pauci , 1_Pauci , MTX 0 , XR _normal
1083, Control,
976, Pauci , 1_Pauci , MTX 1 , XR _normal
9161, Pauci , 1_Pauci , MTX 1 , XR _normal
7177, Spond , JSPA , MTX 1 , XR _space narrow ing
8003, Pauci , 1_Pauci , MTX 0 , XR_unknow n
993, Syst, 2_Poly , MTX 1 , XR_space narrowing
9137, Poly, 2_Poly , MTX 0 , XR _space narrowing
19, Pauci , 2_Poly , MTX 1 , XR_space narrowing
912, Syst, 2_Poly , MTX 1 , XR_space narrowing
1081, Poly, 2_Poly , MTX 0 , XR _space narrowing
9272, Poly, 2_Poly , MTX 1 , XR _unknown
1087PB, Poly, 2_Poly , MTX 1 , XR _space narrow ing
1073, Syst, 2_Poly , MTX 0 , XR_space narrowing
872, Poly, 2_Poly , MTX 1 , XR_space narrowing
850, Poly, 2_Poly , MTX 1 , XR _erosions
831, Poly, 2_Poly , MTX 1 , XR _erosions
894, Pauci , 2_Poly , MTX 1 , XR_normal
7029, Pauci , 1_Pauci , MTX 0 , XR _normal
18036, Pauci , 1_Pauci , MTX 0 , XR _normal
18057, Spond , JAS , MTX 0 , XR _sclerosis
845, Spond , JAS , MTX 0 , XR_space narrowing
Gene expression example again: JRA clinical classes
Individuals (33 patients + 12 controls)
Individual:
Poly-Articular JRA Course
Controls
105 Genes with
Signif icantly Lower
Expression In
Poly Articular
JRA
242
genes
. ..
137 Genes with
Signif icantly
Higher
Expression In
Poly Articular
JRA
Picture: courtesy of B. Aronow
10
Advantages of prior knowledge, problems with class
assignment (e.g. in clinical practice) on the other hand
GLOBINS
FixL
??
PYP
Prior knowledge – the same class despite low sequence similarity; suggestion
that distance based on sequence similarity is not sufficient – adding structure
derived features might help (“good model” question again).
JM - http://folding.chmcc.org
11
Three phases in supervised learning protocols



Training data: examples with class assignment are given
Learning:
i) appropriate model (or representation) of the problem needs to be
selected in terms of attributes, distance measure and classifier type;
ii) adaptive parameters in the model need to optimized to provide
correct classification of training examples (e.g. minimizing the
number of misclassified training vectors)
Validation: cross-validation, independent control sets and other
measure of “real” accuracy and generalization should be used to
assess the success of the model and the training phase (finding
trade off between accuracy and generalization is not trivial)
JM - http://folding.chmcc.org
12
Training set: LDL example again

A set of objects (here patients) xi , i=1, …, N is given. For each patient a set of
features (attributes and the corresponding measurements on these attributes) are
given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she
belongs to.
Age
41
32
45
LDL
230
120
90
HDL
60
50
70
Sex
F
M
M
{ xi , C k }
Class
healthy (0)
stroke within 5 years (1)
heart attack within 5 years (1)
i=1, …, N
JM - http://folding.chmcc.org
13
Optimizing adaptable parameters in the model


Find a model y(x;w) that describes the objects of each class as a
function of the features and adaptive parameters (weights) w.
Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the
class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a
stroke or heart attack in the next 5 years)
y(x;w)
JM - http://folding.chmcc.org
14
Examples of machine learning algorithms for
classification and regression problems






Linear perceptron, Least Squares
LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non-linear generalizations)
SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non-linear generalizations)
Decision trees (logical rules)
k-NN (k-Nearest Neighbors) (simple non-parametric)
Neural networks (general non-linear models,
adaptivity, “artificial brain”)
JM - http://folding.chmcc.org
15
Training accuracy vs. generalization
JM - http://folding.chmcc.org
16
Model complexity, training set size and generalization
8
data 1
linear
cubic
7th degree
7
6
5
4
3
2
1
0
0
1
2
3
4
5
JM - http://folding.chmcc.org
6
7
8
17
Similarity measures
JM - http://folding.chmcc.org
18
k-nearest neighbors as a simple algorithm for classification




Given a training set of N objects with known class
assignment and k<N find an assignment of new objects
(not included in the training) to one of the classes
based on the assignment of its k neighbors
A simple, non-parametric method that works
surprisingly well, especially in case of low dimensional
problems
Note however that the choice of the distance measure
may again have a profound effect on the results
The optimal k is found by trial and error
JM - http://folding.chmcc.org
19
k-nearest neighbor algorithm
Step 1: Compute pairwise distances and take k closest neighbors
Step2: Assign class based on a simple majority voting, the new point
belongs to the class with most neighbors in this class
JM - http://folding.chmcc.org
20