Transcript Jun-csi7162

Two Classifiers for
Bioinformatics
Course: CSI7162
Present to
Dr. Stan Matwin
Presented by
Jun Ouyang
1
Road Map





Introduction
Discovery of regulatory connections in
Microarray data (Gene level)
Classifying protein fingerprints (Protein
Level)
Conclusion
Research Challenges
2
Introduction to Bioinformatics




An emerging interdisciplinary research area
Interface between biological and computational
sciences
Computational management of all kinds of
biological information
Research scope of bioinformatics:

DNA  mRNA  protein  protein interactions 
informational pathways  informational networks 
cells  tissues or networks of cells  an organism 
populations  ecologies (hierarchical biological
information)
www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
3
Introduction to Bioinformatics




DNA level:
 DNA sequence alignment; gene prediction; gene
evolution;…
RNA level:
 Study of gene expression; transcription mechanism; posttranscription modification;…
Protein level:
 protein 2D and 3D structure prediction; protein active site
prediction; protein-protein interactions; protein-DNA
interactions;…
System level: (pathways, networks)
 Genome (gene-to-gene interactions)
 Ex: use gene chips to study gene regulatory network
 Proteome (protein-protein interactions)
 Ex: use protein chips to study protein interaction network
www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
4
Discovery of Regulatory
Connections in Microarray Data
(Gene level)






What is a microarray?
Obtaining microarray data
Definition of regulatory relations between sets of genes
(class labels)
Reversible Jump Markov Chain Monte Carlo (RJMCMC)
Algorithm to learn dynamic Bayesian network
Use dynamic Bayesian network classifiers to predict
regulatory relations
Conclusion (1)
5
What is a Microarray?



A kind of gene chip used to discover gene function
or gene expression patterns
Allow these patterns to be studied in parallel
Example:
In each location, a
known probe
(cDNA) is placed
with cDNA from a
certain sample

For example, cDNA
from cancerous and
healthy cells with
different probes
(known strands of
cDNA)
Colour indicates the relative abundance of a labeled
cDNA, meaning the gene has been activated
6
Obtaining Microarray Data
What are the steps?
[1] Choose cell population
[or sample for diagnosis]
[2] mRNA extraction, purify
[3] Fluorescent label cDNA
[4] Combine different strands of
cDNA on microarray
6
time
[5] Scan data over time
[6] Interpret time series
to study gene
regulation over time
7
Regulatory Relations
between Sets of Genes


Goal of this work is to identify which genes regulate
each other
We are interested in two types of gene regulation:
 Co-regulation


Two genes are perfectly co-regulated when their relative
abundance functions w.r.t time have the same first-order
derivatives
Control-regulation

Two genes are inversely co-regulated, or control-regulated,
when their relative abundance functions w.r.t time have first-order
derivatives which are inverses of one another
8
Microarray Data Representation


From the microarray, we obtain a time series
describing the gene interactions
At different moments in time the microarray would
show a different colour depending on which gene is
active
9
Discretization and Classifier
Construction





We must discretize the time signal in order to
facilitate learning with a Markov model
For each point in time, the sample value is set to
change, local minimum or local maximum.
These features are used to learn a dynamic
Bayesian classifier using a variant of the Reversible
Jump Markov Chain Monte Carlo (RJMCMC)
technique
The classifier then can identify gene interactions as
co-regulatory and controlled-regulatory
Details provided later…
10
Discretization of Continuous
Measurements
Re-encoding of data using 2
binary variables for each of
the 3 possible values
Time series represented as change,
local minima, local maxima
11
Monte Carlo Principle





If we take a sample every 1/100th of a second and we
measure for 10 minutes, we get 60000 samples per gene
We need a method for reducing the number of samples
without destroying the pertinent details
Given a very large set X and a distribution p(x) over it
We draw an i.i.d. set of N samples
We can then approximate the distribution using these
samples
p(x)
1
p N ( x) 
N
N
1( x
i 1
(i )
 x)
X
 p(x)
N 
An Introduction to Markov Chain Monte Carlo, Teg Grenager
12
Dynamic Bayesian Network
Classifiers


A Bayesian network: a statistical model for capturing
the direct dependencies between discrete stochastic
variables
Train dynamic Bayesian networks to discover
relations between sets of genes
Co-regulated
Control-regulated
time

time
The Bayesian networks are trained using a variant
of the RJMCMC sampling algorithm
13
Learning with Probabilistic
Network Classifiers

Objective function of network classifier
P( L(C ) | X , G,  * , D)   l ( P(c | x, G,  * , d ); ,  )
d D
D: the learning database
X: a set of predictor variables
L(C): a score function
l ( y, ,  ) : a modif. step function
 : the span of indeterminacy
 : the bounding probability
G: a DAG structure.
C: a classification variable
D=(C,X): the learning database is
separated into X and C
Goal: sample models from the above target distribution
P(L(C)|D).
14
Experiments Result
Controlledregulated
Co-regulated
Lag(-1,-2)
Valid
SWI4
CLB2
Y
CDC6
CLB2
Y
AGA1
CLB2
Y
ASH1
CLB2(SWI5)
Y
CDC45
CLB2(MCM1) Y
Target
gene
CLB1
Lag(0)
Valid
Close
CLB2
Y
Y
BUD4
CLB2
Y
Y
CDC47
CDC28
N
N
CTS1
FUS1
MFA2
SWI5(MCM1) N (?)
SWI5
Y
CLN3(CLB2)
N(Y)
N
Testing result of yeast cell-cycle expression dataset
15
Conclusion (1)



A new approach to discovering supposed
regulatory relations between genes
Processing techniques capture dynamic
relations between sets of genes
The results obtained from the microarray data
are promising
16
Classifying Protein
Fingerprints
Motivation
 Task and Data Representation
 Data Preprocessing and Classification
Methods
 Experimentation and Results
 Conclusion (2)

17
Motivation

The need for automated protein fingerprint
labeling



Protein fingerprint: group of amino acid motifs
used to depict protein families
These fingerprints may be useful in grouping
proteins together
Improve on PRECIS (an annotation tool)


This tool performs poorly: 40% error rate
Classifies fingerprints using simple heuristics
18
Task and Data Representation


Goal: replace PRECIS’s handcrafted
heuristics with classification models
extracted from data.
Three distinct kinds of fingerprints



Fingerprint itself
Its component motifs (motif is a common
sequence of amino acids)
Protein
19
Task and Data Representation

Fingerprint





Number of motifs (nmt)
Number of proteins (npr)
True positive rate
Partial positive rate
Motif





Motif length (average, std, etc.)
Motif coverage ( average, stdev, etc.)
Motif entropy ( average, stdev, etc.)
Motif entropy ( average, stdev, etc.)
Intermotif distance (average, stdev, etc.)
20
Task and Data Representation

Protein sequence

SWISS-PROT ID: fraction of proteins with ID




LHS: frac of proteins whose length>=3|4 chars
frac of proteins with common first 1|2|3|4 chars
in LHS
entropy of LHS averaged over first 1|2|3|4chars
RHS: frac of proteins with a common RHS (species)
entropy of RHS taken as a unit
CC-belongs: sequence belongs to family
CC-contains: sequence contains domain
21
Data Preprocessing


Dealt with missing values using a
technique based on KNN
Considered several feature selection
algorithms


Ranking based on information gain
I(X,Y)=H(X)-H(X|Y)= H(Y)-H(Y|X)
Ranking based on mutual information
 H ( X )  H (Y )  H ( X , Y ) 
U ( X , Y )  2

H ( X )  H (Y )


22
Classification Methods


This work compares the performance of
several machine learning algorithms when
combined with a feature selection method
ML Algorithms considered:

Logic-based learning algorithm


Decision trees and rules (J48 and C5.0, etc)
Density-estimation based learners

NBayes, IBL, Lindisc, MLPS, SVM-RBF
23
Experimentation and Results
Method
Defaults
PRECIS
SVM-RBF
RandomForest
C5.0boost
MLP
IBL
Lindisc
LTree
J48
Part
Nbayes
Parameters
G=0.05,C=50
I=100,K=6
B=10,C=0.1
H=10
K=10
C=0.05
C=0.01
C=0.05
K
CV error
45.60.
39.55
14.06
14.59
15.13
15.13
15.47
15.80
16.27
16.48
19.97
23.20
HO error
46.19
40.28
14.65
17.46
18.59
16.62
19.44
17.18
17.46
19.15
21.69
27.07
Best
performance
Error rates on the full 45-feature set
CV: Cross-validation, HO: Holdout
24
Experimentation and Results
Method
Parameters
SVM-RBF
RandomForest
C5.0boost
MLP
IBL
Lindisc
LTree
J48
Part
Nbayes
G=0.05,C=50
I=100,K=6
B=10,C=0.1
H=10
K=10
C=0.05
C=0.01
C=0.05
K
Feature
selector
ReliefF
InfoGain
ReliefF
ReliefF
SymmU
ReliefF
SymmU
SymmU
CFS
CFS
#features CV error
HO error
36
40
32
40
32
40
32
32
7-10
7-10
14.08
16.61
16.90
16.90
17.46
17.18
18.59
19.72
18.03
23.66
14.09
14.19
14.79
14.86
14.93
15.40
15.53
15.53
17.35
18.02
Cross-validation and Holdout error rates after feature
selection
25
Conclusion (2)


SVM does not seem to benefit from the
feature selection process (feature selection
only removed 9 features!)
Using a SVM-RBF learned classifier
achieves a 26% improvement in accuracy
over PRECIS
26
Research Challenges

First paper


Validate the new approach on real data
sets (only simulated data was used)
Second paper


Correcting data imbalance to increase
accuracy
Incorporate available data from other
databases
27
References

M. Egmont-Petersen. W. de Jonge, A. Siebes. "Discovery of regulatory connections in microarray
data," In Proceedings of the 8th European Conference on Principles and Practice of Knowledge
Discovery in Databases (PKDD) 2004:149-160

Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, Terri K. Attwood: Classifying Protein
Fingerprints. PKDD 2004: 197-208

M. Egmont-Petersen. "Discovering possible co-relations and control-regulations between gene pairs in
time series microarray data using salient dynamic features", Presented at the working group
Bioinformatics, Symposium 2004.

M. Egmont-Petersen. "Feature selection by Markov Chain Monte Carlo Sampling - a Bayesian
approach," In Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the Joint IAPR
Workshops SSPR 2004 and SPR 2004, Lecture Notes in Computer Science 3138, Eds. A. Fred et al.,
pp. 1034-1042, 2004.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B.
“Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by
microarray hybridization,” Molecular Biology of the Cell, Vol. 9, No. 12, pp. 3273-3297, 1998.

Green PJ. “Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination,” Biometrika, Vol. 82, No. 4, pp. 711-732, 1995.

An Introduction to Markov Chain Monte Carlo. Teg Grenager. July 1, 2004. Agenda

www.ee.nthu.edu.tw/bschen/files/Bioinformatics.ppt
28
Q&A
Thank you!
29