H. Huang Research Group in Computational Biology

Download Report

Transcript H. Huang Research Group in Computational Biology

Translational Bioinformatics
Haiyan Huang
May 12, 2010
Credits of some slides:
Atul Butte, Stanford U
Russ Altman, Stanford U
Background: bioinformatics in the postgenomic era
• DNA level
– gene annotation, functional elements
identification
• mRNA level
– transcription regulation, gene coexpression, pathways
• Protein level
– protein structure/function; proteinprotein interaction
• System Biology
– complex interactions in biological
system
• Translational bioinformatics
– integrative information retrieval for
aiding disease diagnosis
Central Dogma of Biology
What is Translational
Bioinformatics?
• (by Russ Altman, MD, PhD) Using the toolkit of
bioinformatics to understand the relationship of
molecular information to diseases and
symptoms, in order improve diagnosis,
prognosis, and therapy.
What is translational
bioinformatics?
(by Atul Butte, MD, PhD)
• Translational bioinformatics
– Development of analytic, storage, and interpretive methods
– Optimize the transformation of increasingly voluminous
genomic and biological data into diagnostics and
therapeutics for the clinician
(Research on the development of novel techniques for the
integration of biological and clinical data)
• End product of translational bioinformatics
– Newly found knowledge from these integrative efforts that
can be disseminated to a variety of stakeholders, including
biomedical scientists, clinicians, and patients
Why translational bioinformatics?
• There is an increasing call for translational medicine:
Universities, Congress, NIH, and elsewhere: “What
did we get for our money?”
• Incredible amounts of publicly-available data
– GenBank: Hundreds of organisms have been completely
sequenced
– GEO, ArrayExpress has numerous samples from
thousands of experiments
– NCBI dbGAP for the interaction of genotype and
phenotype. Such studies include genome-wide association
studies, medical sequencing, association between
genotype and non-clinical traits, etc
Example I
Example II
Example III
Translational Bioinformatics
Tasks
• Using high-throughput genomic measurements
for improved diagnosis/prognosis/therapy
– New classifications of disease based on molecular
markers
– Identify new drug targets based on molecular profiling
of disease
• Understanding disease pathology and genetic
pathways in complex multigenic disorders
– Create systems for physician decision support using
genetic information
Transforming Public Gene Expression
Repositories into a Disease Diagnosis
Database
Reference:
Huang H, Liu C, Zhou XJ (2010). Bayesian Approach to Transforming
Public Gene Expression Repositories into Disease Diagnosis
Databases. Proc Natl Acad Sci.USA. 2010 Apr 13;107(15):6823-8.
Example project in translational
bioinformatics
Transforming public expression repositories into a
disease diagnosis database
• The public microarray data increases by 1.5 folds per year
– NCBI Gene Expression Omnibus (GEO): > 330,000 experiments
– Largest database systematically documenting the genome– EBI Array Express: > 115000 experiments
wide molecular basis of diseases
• heart disease, mental illness, infectious disease, and a wide
variety of cancers.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Example project in translational
bioinformatics
Transforming public expression repositories into a
disease diagnosis database
• The public microarray data increases by 1.5 folds per year
– NCBI Gene Expression Omnibus (GEO): > 270000 experiments
– EBI Array Express: > 115000 experiments
• An unprecedented opportunity to study human diseases
• Expression-based-diagnosis would be particularly useful
when the potential disease is not obvious or when the
disease lacks biochemical diagnostic tests
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Our goal: turning the NCBI GEO into an
automated disease diagnosis system
• Thus far, no effective method is available for this
purpose. Existing approaches have been
– of limited scale, i.e., within single laboratories,
– targeting specific types of disease,
– lacking the integration of the heterogeneous datasets
(i.e., from different experimental sources, with diverse
phenotypes, containing the information in different
formats ).
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Integrating public repositories
involves “combining the expression
data and the text information”
Expression data
Microarray Data
Phenotype information
Phenotype Concepts (e.g. diseases, perturbations, tissues )
in Unified Medical Language System (UMLS)
Three challenges towards our goal
• The gene expression data from different
laboratories cannot be compared directly due to
platform differences and systematic variation
• The disease and phenotype annotations of
datasets are heterogeneous and embedded in
text, and thus not in a workable format
• The disease diagnosis approach must robustly
characterize a query expression profile by
jointly utilizing the large amount of noisy
genomic and phenotypic data
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Collecting microarray data sets
Adenocarcinoma
Asthma
…
Glaucoma
1. We initially collected 421 human microarrray datasets of
the platform U95, U133 and U133 plus 2 from NCBI GEO.
• These three major platforms share a large number of
overlapping genes (8,358 genes)
2. We further selected 100 datasets (9169 arrays) having
subset types:
• disease state
• normal, control, non-tumor, healthy, or benign
This serves as the initial database to test our framework
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Method highlights
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
– Phenotypically annotating the collected human
microarray experiments by Unified Language Medical
System (UMLS) (challenge 2).
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Method highlights
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
Disease profile
Control profile
Log-rank-ratio
standardized profile
– Phenotypically annotating the collected human
microarray experiments by Unified Language Medical
System (UMLS) (challenge 2).
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Method highlights
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
Disease profile
Control profile
Log-rank-ratio
standardized profile
– Phenotypically annotating the collected human
microarray experiments by Unified Language Medical
Are the
cross-dataset
by
System
(UMLS)
(challenge comparisons
2).
standardized
profiles
consistent
with 3)
the
• Bayesian
disease
inference
(challenge
phenotype
annotations?
• Bayesian belief
network
for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Data Standardization: an example
Figure. (a) Scatterplot of expression profiles on the same samples GSM21236
Figure
1. (a)Breast
Scatterplot
original
expressionversus
profiles
on the same
biological
(GDS817:
Cancerofcells
MDA-MB-436)
GSM21242
(GDS820:
Breast
samples
(in GDS817)
versus GSM21242
(inrank
GDS820);
Scatterplot
of
CancerGSM21236
cells MDA-MB-436);
(b) Scatterplot
of expression
profiles (b)
of the
same
standardized
(in GDS817)
versus / GSM21236
samples as inprofiles
(a); (c) GSM21236/GSM21240
Scatterplot of expression ratio
profiles GSM21240
GSM21242/GSM21246
(in GDS820);
(c) Scatterplot
of original
(from GDS817) versus GSM21246
/ GSM21242
(from GDS820)
in expression
log scale; (d)profiles
onScatterplot
different of
biological
samples
GSM31102
versus
(in
expression
rank ratio
profiles of(in
theGDS855)
same sample
pairsGSM31127
as in (c) in log
GDS854);
Scatterplot
standardized
profiles
(in GDS855)
scale. It is(d)
obvious
that logofrank
ratio profiles
of the GSM31102/GSM31092
same sample pairs are comparable
versus
(in GDS854).
acrossGSM31127/GSM31117
platforms.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Standardized profiles are consistent
with the phenotype annotations
Red Curve: different labs, same type of biological samples, same platform
(Pearson correlations between GDS1372 and GDS1665)
Blue curve: different labs, different biological samples, same platform
(Pearson correlations between GDS1665 and GDS1917)
Before
Standardization
After
Standardization
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Method highlights
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
– Phenotypically annotating the collected human
microarray experiments by Unified Language Medical
System (UMLS) concepts (challenge 2).
• UMLS also provides the language processing tool MetaMap
to enable the automated mapping of text onto UMLS
concepts
• Processing the metadata has been one of the major efforts in
recent Translational Bioinformatics research
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Unified Medical Language
System (UMLS)
•
UMLS consists of three major
components:
– Metathesaurus: > 1 million biomedical
concepts from over 100 data sources.
– Semantic Network: defines relationships
between concepts.
– Lexical resources: natural language
processing tool. It can process text into the
concepts.
UMLS Metathesaurus
• Cluster synonymous terms into a single UMLS
concept
• Choose the preferred term
• Assign the unique identifier
Addison's disease
Addison's Disease
Addison Disease
Bronzed Disease
Deficiency; corticorenal, primary
Primary Adrenal Insufficiency
Primary hypoadrenalism syndrome, Addison
C0001403
SNOMED CT
MedlinePlus
MeSH
SNOMED Intl 1998
ICPC2-ICD10
MeSH
MedDRA
Addison's disease
363732003
T1233
D000224
DB-70620
MTHU021575
D000224
10036696
Dataset UMLS annotation construction
http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=563
1. Take summary in dataset
2. Take PMID
3. Download MeSH headings from PubMed by PMID as follows:
MH4.- Muscle
Proteins/*genetics
Parse both
summary and MeSH headings by MetaMap to UMLS concepts
MH - Muscle,
Skeletal/cytology/*pathology/physiology/physiopathology
as follows:
MH - Muscular
Dystrophy,
Duchenne/*genetics/*pathology
C0013264
Muscular
Dystrophy, Duchenne
MH - Oligonucleotide Array Sequence Analysis
C0752352 Muscular Disorders, Atrophic
… etc.
C0242692 Skeletal muscle structure
C0027868 Neuromuscular Diseases
… etc.
(1) Nervous system disorder
(2) Neuromuscular Diseases
(3) Myopathy
(4) Musculoskeletal Diseases
(5) Congenital, Hereditary, and Neonatal Diseases
and Abnormalities
(6) Genetic Diseases, Inborn
(7) Genetic Diseases, X-Linked
(8) Muscular Disorders, Atrophic
(9) Muscular Dystrophies
(10) Muscular Dystrophy, Duchenne
Table 1. the phenotype annotation set for the dataset NCBI GEO GDS563
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Processing the metadata
•
Aronson, A.R. (2001) Effective mapping of biomedical
text to the UMLS Metathesaurus: the MetaMap
program. Proc AMIA Symp: 17-21.
•
Has been one of the major efforts in recent
Translational Bioinformatics research
–
–
Butte AJ, Kohane IS. (2006) Creation and implications of a
phenome-genome network. Nat Biotechnol. 2006
Jan;24(1):55-62.
Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen R, Musen
MA. (2009) Ontology-driven indexing of public datasets for
translational bioinformatics. BMC Bioinformatics. 2009 Feb
5;10 Suppl 2:S1.
To integrate the large amount of data on various diseases to build a diagnosis
database such that users can rapidly search the disease profiles for expression
similarities and further for disease annotations to the query sample of interest.
We consider our disease
diagnosis question as a
classification problem, where
each UMLS concept
represents an individual class,
and all of the classes are
organized in a hierarchy.
The outcome of our analysis is
to categorize a standardized
query dataset into several
classes in the hierarchy. This
type of general setting is a socalled hierarchical
multilabel classification
(HMC) in the machine learning
field.
Building Bayesian classifier for each
disease class
We aim to infer P(Qx,k | sx,1 ,..., sx,M , , e1,k ,..., eM ,k ).
Difficulties include:
Database phenotype• The association strength between different microarrays to the same
group M
disease class can vary greatly;
• The distribution of the similarity scores sx,i is non-standard.
To compute P(Qx,k | sx,1 ,..., sx,M , , e1,k ,..., eM ,k ), we need
model
P ( sx ,1 ,..., sx ,M | Qx ,k , e1,k ,..., eM ,k ) or
P ( sx ,1 ,..., sx ,M | Qx ,k , T1,k ,..., TM ,k ) and P(T1,k ,..., TM ,k | e1,k ,..., eM ,k )
Log-linear regression
Instead of modeling the PAlter and PNull , we choose to
model the ratio
PAlter ( Six,1 ,..., Six,ni )
,
PNull ( Six,1 ,..., Six,ni )
guided by the following properties:
When all the rest are the same,
1. Larger means of scores should give larger ratios;
2. Bigger variances should give larger ratios;
3. Less skewness and kurtosis should give smaller ratios.
log(
PAlter ( Six,1 ,..., Six,ni )
PNull ( Six,1 ,..., Six,ni )
)
 C0  C1 * Mean( Six,1 ,..., Six,ni )  C2 * Var( Six,1 ,..., Six,ni )  C3 * Skewness( Six,1 ,..., Six,ni ),
where Skewness 
n(n  1) m3
 3/ 2 . Note that mr is the rth moment about mean.
n2
m2
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Making the annotation predictions
For a query profile x,
we will diagnose it with UMLS concept Uk if
P(Qx ,k  1 | S1,1x ,..., S1,xn1 ,..., S Mx ,1 ,..., S Mx ,nM , e1,k ,..., eM ,k )
P(Qx ,k  1)

In our study, we set λ to be 4.5.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Refining the inferred annotations
1. The UMLS concepts A, B, C, D, …, I are
organized in a directed acyclic graph,
with each node corresponds to a concept.
2. Given a query profile, and for each of the
concept, we use A^, B^, … to denote the
obtained bayesian annotation.
We note that for some nodes, the first
round prediction is missing.
By associating (conditional) probabilities
with the DAG, we formulate our problem as a Bayesian
Belief Network (BBN).
We implemented a popular exact inference method, variable
elimination, to infer
P( A, B, C, D,..., I | A^, B^,..., I ^, DAG)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Validation Results
Diagnosed Results for GDS563
(1) Nervous system disorder
(2) Neuromuscular Diseases
(3) Myopathy
(4) Musculoskeletal Diseases
(5) Congenital, Hereditary, and Neonatal Diseases
and Abnormalities
(6) Genetic Diseases, Inborn
(7) Genetic Diseases, X-Linked
(8) Muscular Disorders, Atrophic
(9) Muscular Dystrophies
(10) Muscular Dystrophy, Duchenne
Table 1. the phenotype annotation set for the dataset NCBI GEO GDS563
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Case study GDS2251: Comparison of myeloid
leukemia cells to normal monocytes.
Chromosome
Bayesian prediction:
abnormality and
Precision 59.1% recall 92.9%
translocation are highly
1. Leukocytes, Mononuclear
related to myeloid
2. monocyte
leukemia.
Subset
normal
3.•Bone
Marrow1:
Cells
4. Dysmyelopoietic Syndromes
• Subset 2: myeloid
leukemia
“Bone Marrow” is not in the
5. Myeloid Leukemia
19
original
set of Bayesian
6.
Nonlymphocytic
Leukemia,
UMLS
annotation: annotations
but “Bone Marrow
1. Acute
Leukocytes, Mononuclear
17
Cells” is. Therefore, “Bone
7.
Acute
2. Leukemia,
monocyteMyelocytic,
Marrow” should be a correct
15
20
8.
Immunoproliferative
Disorders
3. Bone Marrow Cells
3 11
12
annotation
.
1
9.
Disorders
4. Lymphoproliferative
Dysmyelopoietic Syndromes
88
10.
Leukemia
5. Lymphoblastic
Myeloid Leukemia
16
11.
Myeloid
Cells
6. Nonlymphocytic Leukemia,
2
23 18
12. Phagocytes
Acute
4
13
99
13.
7. Leukemia
Leukemia, Myelocytic, Acute
15.
cells
8. Stem
Immunoproliferative
Disorders
16.
stem cells
9. Hematopoietic
Lymphoproliferative
Disorders
14
21
5
10
17.
Bone
Marrow
10. Lymphoblastic Leukemia
18.
Cells
11. Myeloid
Myeloid Progenitor
Cells
19.
12. Chromosomal
Phagocytes translocation
6
22
20.
abnormality
13. Chromosome
leukemia
Myeloproliferative
disease may
21.
14. neutrophil
Malignant Neoplasms
evolve into myeloid leukemia.
22. Myeloproliferative disease
7
…etc.
23. granulocyte
Results: Leave-One-out Cross Validation
Our method achieved an overall accuracy of 95%
(precision 82% and recall 20%)
The problem is analogous to other biological
hierarchical multilabel classification problem, such
as gene function prediction, which has achieved
the best performance in mouse model at the recall
rate of 20% and precision rate 41% (L. PenaCastillo et al., Genome Biol, 2008).
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Reduce the number of datasets
Further accumulation of datasets would increase
the power of our method.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Discussions
• Power of this type of approach will
increase daily with the continuous and
rapid accumulation of genomics data in
the public repositories.
• Our diagnosis system is also promising in
its potential to reveal unexpected disease
connections, and further to construct novel
phenome networks.
• Better tools for text-mining are needed!
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Drugdisease
connectivity
map
Discussions
• Power of this type of approach will
increase daily with the continuous and
rapid accumulation of genomics data in
the public repositories.
• Our diagnosis system is also promising in
its potential to reveal unexpected disease
connections, and further to construct novel
phenome networks.
• Better tools for text-mining are needed!
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
UMLS concepts:
Bone Marrow Cells
Dysmyelopoietic Syndromes
Hematopoietic stem cells
Immunoproliferative Disorders
Leukemia, Myelocytic, Acute
Lymphoproliferative Disorders
Myeloid Leukemia
Nonlymphocytic Leukemia, Acute
Stem cells
leukemia
The authors’ writing style affects the UMLS annotations.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Ongoing Work I
• To further improve the method prediction
power (Ci-Ren Jiang)
– (focusing on the second stage) Applying a
different way for a more thorough
collaborative error correction along the
disease hierarchy
– Previous Bayesian Belief Network model is
time consuming and only allows one-way
information exchange
To integrate the large amount of data on various diseases to build a diagnosis
database such that users can rapidly search the disease profiles for expression
similarities and further for disease annotations to the query sample of interest.
We consider our disease
diagnosis question as a
classification problem, where
each UMLS concept
represents an individual class,
and all of the classes are
organized in a hierarchy.
The outcome of our analysis is
to categorize a standardized
query dataset into several
classes in the hierarchy. This
type of general setting is a socalled hierarchical
multilabel classification
(HMC) in the machine learning
field.
One Example
Draft Idea
Some Results
Ongoing Project II
• Focusing on particular disease diagnosis by
collaborating with specialized medical
doctors/researchers, e.g., comparing bipolar
disorder vs schizophrenia (Wayne Lee; Dr. Fei
Wang, MD, PhD, Psychiatry, Yale University)
– To investigate the usefulness/effectiveness of highthroughput molecular information in distinguishing
between bipolar disorder and schizophrenia
– To compare existing clinical predictors with predictors
from microarray data
Acknowledgements
– Dr. X. Jasmine Zhou (Molecular and Computational
Biology, USC)
– Dr. Jim Chun-Chih Liu (Molecular and Computational
Biology, USC)
– Dr. Ming-Chih Kao (Stanford Hospital, PhD, MD)
– Dr. Ci-Ren Jiang (UCB)
– Wayne Lee (UCB)
– Dr. Fei Wang (Yale U)
Thank You!
Barutcuoglu et al 2006
• Start with independently trained classifiers
(hard-margin linear SVMs) for each class without
thresholding the outputs
• Assume the aggregate classifier outputs to have
Gaussian distributions for positive and negative
examples
• Design a Bayesian hierarchical combination
scheme to allow collaborative error-correction
over all nodes
Clus-HMC (Vens et al 2008)
• Apply predictive clustering tree (PCT) to
hierarchical multi-label classification
• The example labels are represented with
Boolean components
• Weighted Euclidean distance is used to measure
similarities
• The class weights decrease with the depth of the
class in the hierarchy
Clus-HMC (Vens et al 2008)
References
• Zafer Barutcuoglu, Robert E. Schapire and Olga
G. Troyankaya (2006), “Hierarchical multi-label
prediction of gene function,” Bioinformatics 22,
830-836
• Celine Vens, Jan Struyf, Leander Schietgat,
Saso Dzeroski, and Hendrik Blockeel (2008),
“Decision trees for hierarchical multi-label
classification,” Machine Learning 73, 185-214
Background: bioinformatics
Bioinformatics is an interdisciplinary research area which may
be defined as the interface between biological and mathematical
(i.e., math, statistics, computing) sciences.
In bioinformatics research,
• Biological data are often high dimensional, complex and noisy,
e.g.,
– NCBI GEO contains the microarray data produced by thousands of
research teams (cross-platform/laboratory variations )
– Some investigations may rely upon a variety of experimental
technologies (heterogeneous data types)
• The demands on statisticians are substantial
– precise understanding of the underlying biological principles
– strong analytical, modelling and data manipulation skills