Data Mining and Bioinformatics: Some Challenges
Download
Report
Transcript Data Mining and Bioinformatics: Some Challenges
Data Mining and Bioinformatics:
Some Challenges
Qiang Yang, Computer
Science and
Engineering
HKUST
Thanks:
HKUST RPC Project
-Ben Niu,
-Can Yang
-Prof. Hannah Xue
-Prof. W. Yu
http://www.cse.ust.hk/~qyang/
1
State of Art: DM for Bio
We know how to classify biological
sequences
We know how to cluster biological
entities
SVM, Neural Nets, Decision Trees, Rules
Bi-clustering, K-means, hierarchical
We know how to select features
PCA, LDA, SVM-RFE
http://www.cse.ust.hk/~qyang/
2
Data Mining: Challenges in Bio
1.
Non-traditional Feature Selection
2.
Explainable and Accurate Data Mining Methods
3.
NN, SVM Rules?
Transfer Learning
4.
When the number of attributes >> number of samples?
Highly imbalanced
Can knowledge learned from one set of samples help data
mining on another sample?
Exploiting the network structure
Individual i.i.d type of classification vs social networks?
http://www.cse.ust.hk/~qyang/
3
Challenge 1: Non-traditional Feature
Selection: Question: which (few) genes lead to diseases?
# of attributes >> # samples
Dataset name
Train
Test
Attributes
ALL-AML
leukemia
38
34
7129
‘Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Exressoin Monitoring’, Science, Vol. 286, 1999.
Breast cancer
78
19
24481
‘Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer’,
Nature, Vol. 415, 2002.
Central nervous
system
30
30
7192
‘Prediction of Central Nervous System Embryonal Tumour Outcome
Based on Gene Expression’, Nature, Vol. 415, 2002.
MLL_Leukemia
57
15
12582
‘MLL Translocations Specify A Distinct Gene Expression Profile that
Distinguishes A Unique Leukemia’, Nature Genetics, Vol. 30, 2002.
Large B-cell
Lymphoma
24
23
4026
‘Distinct types of diffuse large B-cell lymphoma identified by gene
expression profiling’. Nature, Vol. 403, 2000.
# of positive << # negative
KDD CUP 2002 task 2
Yeast Gene Regulation Prediction
Source
Train
Test
Positive set
Negative set
122
2896
1489
Traditional feature selection methods fail: overfitting, singularity of covariance matrix
http://www.cse.ust.hk/~qyang/
4
Non-traditional Feature Selection (2)
Some potential solutions
‘Characterization of a family of algorithms for generalized
discriminant analysis on undersampled problems’, Journal of
Machine Learning Research. Vol. 6, 2005.
Singularity problem is solved by splitting the subspace into the
regular and the irregular parts.
Irregular part (null space) of the within-class scatter matrix is fully
utilized to extract the discriminant info.
‘Two-Dimensional PCA: A New Approach to AppearanceBased Face Representation and Recognition’, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
Vol. 26, 2004.
High dimensional data in 2D arrays are projected directly onto
the subspaces.
Size of covariance matrix can be reduced significantly.
Singularity is avoided.
http://www.cse.ust.hk/~qyang/
5
Non-traditional Feature Selection (3)
Other approaches:
Manifold learning
Manifold learning methods, e.g., Isomap, LLE, maintain the
local patterns of distribution during transform,
Extract features suitable for k-NN classifiers
Can be used to reduce the dimensionality of Bio. Data.
Semi-supervised learning
What if we have 10% labeled data, but the rest 90% are
unlabelled?
Build clusters around the labeled samples.
Samples in the same cluster are labeled as from the same
class, assuming they follow the normal distributions.
http://www.cse.ust.hk/~qyang/
6
Challenge 2: Explainable and
Accurate Data Mining Methods
Current methods, such as SVMs,
discriminant analysis, neural
networks, are ‘black box’ models.
The learned knowledge is hard to
understand by biologists.
Some potential solutions
Logic based method, e.g.,
decision trees and variants may be
better in giving the ‘IF-THEN’ like
rules that explicitly define the
epigenetic logics in cancer and
stem cell development.
DNA methylation rules can be
learned by using SVM based
recursive feature elimination and
fuzzy logics.
[Gene selection for cancer
classification using support vector
machines’, Machine Learning,
2002.]
http://www.cse.ust.hk/~qyang/
7
Epigenetic Analysis: A Case Study
Epigenetic events dominate
the growth of cancer and
embryonic stem cells
Genes can be turned on/ off
through Cytosine methylation
or Histone modifications
These two type of cells are
of great importance
The logics of DNA
methylation underlie the
cells’ behaviors
Traditional methods, SVMs,
ANNs are
‘black box’ models
Knowledge are trained
connection weights, or
Support Vectors.
Hard to understand for
biologists
Wish to Know: Methylation
status of CpG sites
CpG islands/ promoter regions
in DNA sequence
Cancer prediction
http://www.cse.ust.hk/~qyang/
8
Adaptive Cascade Sharing Trees
(ACS4) Niu et al. 2007 (tmr’s talk)
Objective: learn human
understandable rules that
define the epigenetic process in
cancer and embryonic stem cells
Idea:
Adaptively partition the
numeric attributes into a
set of the linguistic domains,
e.g., ‘high’, ‘very high’,
‘Medium’, ‘Low’, ‘Very Low’
Method: clustering
Train a committee of trees
to select the most salient
features and predict by
voting
Method: tree learning
http://www.cse.ust.hk/~qyang/
9
ACS4 method (2)
http://www.cse.ust.hk/~qyang/
10
ACS4 method (3)
Dataset:
37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines.
1,536 attributes
Result
Just 2 attributes are enough to separate the 3 cell types
No need of 40 attributes by using fisher’s score in [1].
Wet lab cost can be reduced by testing on 2 attributes only, instead of 40.
Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’.
Rules can be easily understood to biologist to conceive new biological experiments
seeking in wet lab proof.
40 attributes: [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006
11
http://www.cse.ust.hk/~qyang/
Challenge 3: Transfer Learning
In real life, data are hard to obtain
Can we leverage the knowledge learned in one
task/domain/data set for prediction of another?
Biological experiments are expensive
However, biological data are related
Humans often do this: having learned one language, find it
easier to learn another
In Web mining, having learned to classify one web site, use
the abstract knowledge to help classify another web site
Challenge: can we leverage the knowledge learned
from one data set to classify/cluster/predict another?
http://www.cse.ust.hk/~qyang/
12
Transfer Learning (Examples)
Problem: how to Propagate the classification
knowledge?
Difficulty: old and new data may have different distributions
t
0
Night time period
t
1
Day time period
http://www.cse.ust.hk/~qyang/
Time
13
Transfer Learning to Classify Web
[Dai,
et al, 2007] 20 newsgroups (20,000
documents, 20 data sets)
Old
comp.graphics (comp)
comp.os.mis-windows.misc (comp)
sci.crypt (sci)
sci.electronics (sci)
SRAA (A.
New
comp.sys.ibm (?)
comp.windows.x (?)
sci.med (?)
sci.space (?)
McCallum, 70,000 articles)
Old
sim-auto (auto)
sim-aviation (aviation)
http://www.cse.ust.hk/~qyang/
New
real-auto (?)
real-aviation (?)
14
Document-word co-occurrence
[Dai, et al. 2007]
Old
Di
Knowledge
transfer
New
Do
http://www.cse.ust.hk/~qyang/
15
Transfer Learning: Related Works
Semi-supervised Learning
[Zhu, Survey, Blum and Mitchell “co-training”, Nigam et al,
“EM-based”, Zeng et al “clustering”, Joachims,
“transductive”]
Distributions of training and test data are usually
assumed to be the same
Multi-task Learning
[Caruana, MLJ]
multiple Dis exist
Domain specific knowledge jointly learned to benefit each
other.
Focused on how multiple tasks helping each other
Semi-supervised
Clustering
Same distribution assumption, but can be relaxed
when must-links are few
http://www.cse.ust.hk/~qyang/
16
Transfer to Classify Web
Co-clustering
is applied between words and
out-of-domain documents (new tasks)
Word clustering is constrained by the labels
of in-domain (Old) documents
The word clustering part in both domains
serve as a bridge
http://www.cse.ust.hk/~qyang/
17
A Biological Transfer Problem
‘Promoter prediction analysis on the whole human
genome’, Nature Biotechnology, Vol. 22, 2004.
Most of the promoter prediction programs are effective on
individual chromosomes, e.g., Chr21, Chr22,
But inadequate to generalize to the whole genome scale
only 65% of accuracy rate on average too low
Can we build a unifying model for transferring the
learned knowledge to other chromosomes
to predict across the whole genome?
to cluster other genes and protein arrays?
to classify related sequences?
http://www.cse.ust.hk/~qyang/
18
Challenge 4: Exploiting the network
structure
We are short of labeled data
The matrix structures are very sparse if we only
have several hundred samples and a huge number
of attributes
Classification accuracy cannot be improved much
Gene expression data: tens or low hundreds of samples,
but tens of thousands of attributes (?)
Accuracy ~ less than 80%
Challenge: can we leverage the network
structure?
http://www.cse.ust.hk/~qyang/
19
Social Network Mining
Citation (Paper 2)
Title
Conference Name
Very large scale computational
analysis of gene and social networks.
Social networks: a social structure
made of nodes (individuals or
organizations) tied by one or more
specific types of relations.
Relations: financial exchanges, friends,
web links, disease transmission
(epidemiology), or gene interactions.
Small world phenomenon: chain of
social acquaintances required to
connect on arbitrary person to another
arbitrary person anywhere in the world
is generally short, five to seven
separation steps are sufficient.
Centrality Eigenvector: measure the
importance of node in a network.
Author (Paper1)
• Collective Classification
• Collective Recommendation
http://www.cse.ust.hk/~qyang/
20
Social Net Mining: Engineering
meets Science
‘Empirical Analysis of an Evolving Social Network’, Science, Vol. 311,
2006.
A dynamic social network comprising 43,553 students, faculty, and staff at a
large University.
Interactions between individuals are inferred from time-stamped e-mail
headers recorded over one academic year and are matched with affiliations
and attributes.
Findings:
when two students are in the same class, they are on average 3 times more likely
to interact if they also share an acquaintance
Netflix Challenge and KDDCUP 2007
Blog Evolution (NEC Work)
Can we deduce the actors’ roles/functions/attributes from the topology
of the network?
http://www.cse.ust.hk/~qyang/
21
Using Network Structure in Biology
‘Adaptive Response of a Gene
Network to Environmental Changes
by Fitness-Induced Attractor
Selection’, Plos One, 2006
The gene network is formulated as
differential equations, given some
initial state the network stabilized at
some attractors, corresponding to
the different cell types.
The complex dynamics of the gene
networks can explain the high
diversity of the species.
Given some perturbations, how will
the state of the gene networks
change to adjust the levels of gene
expression to environment factors?
The dynamics of a gene network are
described by differential equations, e.g., a
simplified network involving only two gene
nodes is formulated as:
where m1 and m2 are the gene
expression levels.
S(A) and D(A) are the rate coefficients of
synthesis and degradation. They depend
on A, which represents cellular activity.
g1 and g2 represent the noises in gene
expression.
http://www.cse.ust.hk/~qyang/
22
‘Adaptive Response of a Gene Network to Environmental Changes
by Fitness-Induced Attractor Selection’, PlosOne, 2006.
Given the initial condition, the
gene expression levels stabilize
at the attractors determined by
the coefficients of equations.
Real world gene networks can
be much more complex,
involving thousands of genes,
leading to the complex patterns
of attractors and cell activities.
http://www.cse.ust.hk/~qyang/
23
Conclusions
I have listed four challenges
1.
2.
3.
4.
Non-traditional Feature Selection
Explainable and Accurate Data Mining Methods
Transfer Learning
Exploiting the network structure
There are more…
Solving these challenges requires biologists
and computer scientists to work hand in
hand
http://www.cse.ust.hk/~qyang/
24