Data Mining and Bioinformatics: Some Challenges

Download Report

Transcript Data Mining and Bioinformatics: Some Challenges

Pattern Detection and Co-methylation
Analysis of Epigenetic Features in Human
Embryonic Stem Cells
Ben Niu,Qiang Yang, Jinyan Li, Hong Xue, Simon Chikeung Shiu, Weichuan Yu, Huiqing Liu, Sankar Kumar Pal
HKPolyU
Computational Epigenetics

An emerging and most exciting area
incorporating the state of the art
Machine learning
 Molecular biology

Aims to understand the epigenetic
process in gene transcriptional regulation
 Advance our knowledge to the medical
arsenal in treating human diseases.

The Research

Human Epigenome project (HEP): the next wave to the Human
Genome Project (HGP)




Started in 2003 after completion of the Human Genome Project.
HEP aims to identify the epigenetic markers associated with human diseases
‘Journal of Epigenetics’ has been released: first journal dedicated to the
communications in Epigenetics, started in 2006.
Series of publications in highly cited journals in 2005-07:

Nature


Cell


Focus issue on epigenetics, Nature Review Genetics, April, 2007.
Special issue on epigenetics, Cell, Feburary, 2007.
J. Bioinformatics

We are jointly invited to write a review paper on computational
epigenetics to the Journal of bioinformatics.
The Industry

Epigenetics open a rapidly growing market of
epigenetic medical services (diagnostic, drugs)

According to 2007 report of MarketResearch, as shown in the figure,
the global market of epigenetic applications (i.e., drug+ diagnostic
services) will be 4 billion US$, by 2012, the annual Growth rate at
present time is 60.4%.
4500
4000
3500
3000
2500
2000
1500
1000
500
0
global Market
(Million U.S.$)
Promising
direction!
2005 2006 2007 2008 2009 2010 2011 2012
What we know

Basically:



Genes can be turned on/ off through Cytosine methylation or
Histone modifications, a reversible process
The epigenetic events is heritable, can change the cell’s
phenotypes without altering its sequence
Functionally:


Dominate the growth of cancer and embryonic stem cells
These two type of cells are of great medical interests



Cancer is the leading cause of human death
hESCs are the answer to the regenerative treatments
For the two points see: Nature Insight: Epigenetics Vol. 447,
2007.
What we don’t know

The logic behind DNA methylation underlying
cells’ behaviors remains unclear
 How DNA methylation concerts the product of
molecular machineries for cell functions
 In the context of epigenetics, we need to
address two issues:


What are the rules of DNA methylation differing the
cancer, the normal, the human ES cells from each
other.
Uncover the interactive patterns of the genes in
these cells. The role of methylation in coordinating
the activities of genes.
State of the art in Methylation Analysis

SVMs, ANNs have been successfully applied to predict the
epigenetic events, for example,

Methylation status of CpG sites


CpG islands/ promoter regions in DNA sequence



CpG island mapping by Epigenome prediction’, Plos Computational Biology, Volume 3(6), 2007.
Promoter prediction analysis on the whole human genome’, Nature Biotechnology, Vol. 22, 2004.
Cancers


Computational prediction of methylation status in human genomic sequence, PNAS, Vol. 103(28),
2006.
Tumour class prediction and discovery by microarray-based DNA methylation analysis, NAR, Vol.
30, 2002.
Co-regulation analysis through clustering

Clustering of methylation arrays

Marjoram P, Chang J, Laird PW, Siegmund KD: Cluster analysis for DNA
having a detection threshold. BMC Bioinformatics Vol. 7, 2006.
methylation profiles
2 Problems
1.
Traditional methods, SVMs, ANNs are



2.
‘black box’ models
Knowledge extracted are characterized by the
connection weights, and Support Vectors.
hard to understand for biologists
Investigate the co-methylation patterns



Cancer cells
human Embryonic stem cells (hESCs)
Co-methylation analysis can help to uncover the
hidden pathways leading to new drug design
Methodogy

Two computational methods proposed
1.
Adaptive Cascade Sharing Trees (ACS4)
for problem 1

2.
To learn the human understandable DNA
methylation rules
Adaptive clustering for problem 2

To highlight the orchestration of genes for
function through the methylation mechanism
ACS4 method (1)

Promoters are regulatory
elements upstream the 5’
end of TSS.
 Methylation of promoter
CpGs remodels the
chromatin structure for
gene expression
Methylated CpG
methyl-binding
proteins (MeCP)
methyltransferase
Histone deacetylases
(HDAC)
ACS4 method (2)



Methylation levels of
promoters can be
measured using
Microarrays
Each spot on the array
corresponds to a
promoter CpG sites.
The methylation
intensity is a numerical
value between 0 and 1.
ACS4 method (3)


Objective: learn human understandable rules that
define the epigenetic process in cancer and
embryonic stem cells
Idea:
 Adaptively partition the numeric attributes into a
set of the linguistic domains, e.g., ‘high’, ‘very
high’, ‘Medium’, ‘Low’, ‘Very Low’ .
 Train a committee of trees to select the most
salient features and predict through voting.
ACS4 method (4)
ACS4 method (5)
ACS4 method (6)
ACS4 method (7)




We have learned k rules
Given a testing sample,
compute pi
Rules are weighted
according to their
Coverage, i.e., the
number of matched
samples
Overall prediction is
made by voting across
the rules.
ACS4 method (8)

Dataset:



37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines.
1,536 attributes
Result





Just 2 attributes are enough to separate the 3 cell types
No need of 40 attributes by using fisher’s score in [1].
Wet lab cost can be reduced by testing on 2 attributes only, instead of 40.
Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’.
Rules can be easily understood to biologist to conceive new biological experiments
seeking in wet lab proof.
[1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006
ACS4:Biological interpretation(1)

Example:
IF PI3-504 is ‘High’ THEN hESC
 IF PI3-504 is ‘Low’ AND NPY-1009 is ‘Low’
THEN Normal
 IF PI3-504 is ‘Low’ AND NPY-1009 is ‘High’
THEN Cancer

ACS4:Biological interpretation(2)

The two marker genes

PI3(PI 3-kinases )-activate the cell
growth, proliferation, differentation,
motility, intracellular trafficking




Down-regulated in hESCs
maintain stable state
Keep from growth, proliferation,
differentiation…
Neuropeptide Y (NPY)- signal
protein produced by nerves



[Immunology:Stress and Immunity,
Science, Vol. 311, 2006.]
Experiment shows deficiency of
NPY cause immune defects
Consistent to our computational
result
ACS4: Biological interpretation(3)

Example:

IF PI3-504 is ‘High’ THEN hESC


IF PI3-504 is ‘Low’ AND NPY-1009 is ‘Low’ THEN
Normal


PI3 gene is silenced to maintain a stable cell context in hESCs
Normal cells can grow, and grow safely with immune defenses
IF PI3-504 is ‘Low’ AND NPY-1009 is ‘High’ THEN
Cancer

Cancer cells grow, and grow out of control, due to the immune
deficiency
Adaptive clustering (1)

Co-methylation of genes are important
Because we want to know how genes are
co-working in the epigenetic framework
 Clustering should reflect the true distribution
of the gene space.

assuming data are normally distributed, which is
usually the case in real world applications
 Fisher’s criterion is computed to validate the
result of clustering, and choose the best one.

Adaptive clustering (2)

For embryonic and cancer cells we
optimally cluster the 1536 genes




for each round of clustering with k-Means, we start from
different # of initial centers.
Candidate clustering result with the largest Fisher’s
discriminant score qualifies for further analysis.
Each cluster of genes can be functionally related, and
participate in the same pathway of DNA methylation.
By further analysis of the sequences, we can find out the
feature binding sites for each cluster of genes, and discover
the epigenetic binding factors unknown before.
Adaptive clustering (3)

For cancer and
hESCs, 41 and
59 clusters
generate the best
separation
 So, 41 and 59
functional
domains are
though to be
underlying the
1536 genes.
Adaptive clustering (4)

In experiments:


The distance measure d is based on Pearson’s
correlation score.
N = 60.
Adaptive clustering (5)

For hESC the formed clusters of the co-methylated genes, e.g.,
MAGEA1, STK23, EFNB1, MKN3, TMEFF2, AR, FMR1, are most
related to differentiation, self-renewal, and migration of hESC activities.
Adaptive clustering (6)

For cancer cells, the formed clusters of the co-methylated genes, e.g.,
RASGRF1, MYC, and CFTR, are highly involved in cell apoptosis, DNA
repair, tumour suppressing, and ion transportation, which are typically
the immunological activities of cells against DNA damages.
Adaptive clustering (7)

Particularly, we discover:





gene CFTR (7q31), long in focus in medical research, is comethylated with MT1A (16q13) and KCNK4 (11q13).
CFTR defects contribute to the disease of Cystic Fibrosis (CF).
One in twenty-two people of European descent carry one gene for
CF, making it the most common and lethal genetic disease of still no
cure at the present time among such people.
The CFTR and KCNK4 proteins form the ion channels across cell
membranes, while MT1A proteins bind with the ions as the
transporters. They are all related to the transportation of ions across
cell membrane, functionally related.
The can participate in the same pathway, the breakdown of which
can explain the process of turmogenesis
Adaptive clustering (8)

Two summarize:




Co-methylation occurs widely across the whole
genome
It dominates the growth and development of
various types of cells
Different cells exhibit different patterns of comethylation
Our adaptive clustering algorithm can naturally
capture the group-wise activities in these cells.
Conclusion


Genome wide Epigenetic analysis: promising direction to
research and industry
The logic of DNA methylation can be learned and interpreted by
using our proposed ACS4 algorithm






Just 2 attributes are good enough to separate the 3 cell types
No need of 40 attributes by using fisher’s score in G.R. paper.
Wet lab cost can be reduced by testing on just 2 attributes, instead of
40, lab cost is significantly reduced, more cost - effective.
More accurate by adaptively partition the attribute domain
Knowledge learned are human understandable, to assist biologist
design in wet lab test for further investigations
Adaptive clustering



Epigenetic events are highly active in cancer and hESCs.
Functionally related genes are co-methylated
patterns of co-methylation are much different in cancer and hESCs,
highlighting the versatile roles of Epigenetic events in cell function.
Thanks!