Transcript FanBLM2

Statistical Classification
for Gene Analysis based
on Micro-array Data
Fan Li & Yiming Yang
[email protected]
In collaboration with Judith Klein-Seetharaman
DNA clones
Principles of cDNA microarray
Treated
Reference sample
PCR
purification
Excitation
Laser 1 Laser 2
Reverse
transcription
Label with
Fluorescent dyes
Emission
Robot
printing
Hybridize target
to microarray
Computer analysis
G. Gibson et al.
Microarray data : how it looks like ?
Expression matrix
G1 G2
Expression level
of a gene
across treatments
GN-1GN
Exp 1
Exp 2
Expression Exp 3
profiles
of genes in
a certain
Exp i
condition
Exp M
Typical examples
Heat shock, G phase in cell cycle, etc …
conditions
Liver cancer patient, normal person, etc … samples
AML/ALL micro-array dataset


This dataset can be downloaded from http://genomewww.standford.edu/clustering
Maxtrix
• Each Row – a gene
• Each column – a patient (a sample)
• Each patient belong to one of two diseases types:
AML(acute myeloid leukemia) or ALL (acute lymph oblastic
leukemia) disease
• The 72 patient samples are further divided into a training
set(including 27 ALLs and 11 AMLs) and a test set(including
20 ALLs and 14 AMLs). The whole dataset is over 7129
probes from 6817 human genes.
Published work on AML/ALL




Classification task: gene expression -> {AML, ALL}
Techniques: Support Vector Machings (SVM), Rocchiostyle and logistic regression classifiers
Main findings: classifiers can get a better
performance when using a small subset (8) of genes,
instead of thousands
Implication: Many genes are irrelevant or redundant?
Possible Relationship (Hypothesis)
disease
Gene1
Gene2
Gene3
Gene4
Gene5
Gene7
Gene8
Gene6
How can find such a structure?

Find the most informative genes (“primary”
ones)


Statistical feature selection (brief)
Find the genes related (or “similar”) to the
primary ones

Unsupervised clustering (detailed)


based on statistical patterns of gene distributed over
microarrays
Bayes network for causal reasoning(future
direction)
Possible Relationship (Hypothesis)
disease
Gene1
Gene2
Gene3
Gene4
Gene5
Gene7
Gene8
Gene6
Feature selection




Feature selection
 Choose a small subset of input variable (a few instead of
7000+ genes, for example)
In text categorization
 Features = words in documents
 Output variables = subject categories of a document
In protein classification
 Features = amino acid motifs …
 Output variables = protein categories
In genome micro-array data
 Features = “useful” genes
 Output variables = diseased or not of a patient
Feature selection on micro-array (ALM vs ALL)




Golub-Slonim: GS-ranking (filtering method)
Ben-Dor TNoM-ranking (filtering method)
Isabelle-Guyon: Recursive SVM(Wrapper method)
 Selected 8 genes (out of 1000+ in that dataset)
 Accuracy 100%
Our work (Fan & Yiming) (best)
 Selected 3 genes (using Ridge regression)
 Accuracy 100%
Feature selection experiments already done
in this micro-array data

The 3 genes we found

Id1882: CST3 Cystatin C(amyloid angiopathy and
cerebral hemorrhage) M27891_at

Id6201: INTERLEUKIN-8PRECURSOR Y00787_at

Id4211: VIL2 Villin 2(ezrin) X51521_at
Some analysis on the result we get


The first two genes are strongly correlated with each
other.
The third gene is very different from the first two
genes.

1st gene + 2nd gene is bad (10/34 errors)

1st gene + 3rd gene is good (1/34 error)
Question:As the next step, Can we
find more gene-gene relationship?
Several techniques available:
 Clustering
 Bayesian network learning
 Independent component analysis
 …
Clustering Analysis in micro-array data


Clustering methods have already been widely used to
find similar genes or common binding sites from
micro-array data.
A lot of different clustering algorithms…





Hierarchical clustering
K-means
SOM
CAST
……
A example of hierarchical clustering analysis(from
Spellman et al.)
Our clustering experiment on AML/ALL dataset

Our clustering result is over the top
1000 genes most relevant to the
disease.
The feature-selection curve
Our clustering result in the top 1000 genes
Some analysis to the clustering result

The first two genes are always clustered
in the same cluster(in hierarchical clustering, they are in
cluster 1. In k-means clustering, they are in cluster 2)

The third gene is always not clustered in
the same group with the first two
genes(in hierarchical clustering, it is in cluster 23. In k-means
clustering, it is in cluster 1)

This validates our previous analysis.
Disadvantage of Clustering

However…




It can not find out the internal relationship inside
one cluster
It can not find the relationship between clusters
genes connected to each other may not be in the
same cluster.
Clustering vs Bayesian network learning(copied
from David K,Gifford, Science, VOL293, Sept,2001)
A counter example of clustering analysis
Bayesian network learning


Thus Bayesian network seems a much better
technique if we want to model the
relationship among genes.
Researcher have done experiments and
constructed bayesian networks from microarray data.


They found there are a few genes which have a lot of
connections with other genes.
They use prior biology knowledge to validate their learned
edges(interactions between genes and found they are
reasonable)
A example of the bayesian network

Part of the bayesian network Nir
Friedman constructed. There are total
800 genes(nodes) in the graph. These
800 genes are all cell-cycle regulated
genes.
Our plan in genetic regulatory network
construction
There are several possible ways



Using feature selection technique to make the network
learning task more robust and with less computational
cost.
Learning gene regulatory networks on microarray
dataset with disease labels(thus we may find pathways
relevant to specific disease).
Using ICA to finding hidden variables(hidden layers) and
check its consistency with bayes network learning
result.
Our plan in genetic regulatory network
construction

Use prior prior biology knowledge in gene network ,like
the “network motifs”. The following example is copied from
Shai S.Shen-Orr, Naturtics ,genetics, 2002. Previous network
learning algorithm have not considered those
characters.
Reference
•Using Bayesnetwork to analyze Expression Data , Nir
Friedman, M.Linial, I.Nachman, Journal of Computational
Biology , 7:601-620, 2000.
•Gene selection for cancer classification using support vector
machines. Guyon,I.et al. Machine Learning,46,389-422.
•Clustering analysis and display of genome-wide expression
patterns, Eisen,M.B. et al. PNAs, 95:14863-14868, 1998
•Clustering gene expression patterns . Ben-Dor, A.,Shamir,R.,
and Yakini,Z., Computational Biology, 6(3/4):281-297, 1999.