Systems Biology Approach to Biomarker Discovery

Download Report

Transcript Systems Biology Approach to Biomarker Discovery

Integrative Omics for Cancer
Biology
Xiang Zhang, PhD
Department of Chemistry
Center for Regulatory and Environmental Analytical Metabolomics
University of Louisville, Louisville, KY 40292
[email protected]
Systems Biology
is a field in biology aiming at systems level understanding of
biological processes, where a bunch of parts that are connected to
one another and work together. It attempts to create predictive
models of cells, organs, biochemical processes and complete
organisms.
•Integrative systems biology
Extracting biological knowledge from
the ‘omics through integration
•Predictive systems biology
Predicting future of biosystem using
‘omics knowledge, e.g. in-silico
biosystems
Davidov, E.; Clish, C. B.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 267-288.
Clish, C. B.; Davidov, E.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 3-13.
Omics Space
Differential omics is
the beginning of
Systems Biology
molecule
cell
tissue
organism
…
Differential Proteomics &
Metabolomics
1. Differential proteomics and metabolomics are qualitative and
quantitative comparison of proteome and metabolome under
different conditions that should unravel complex biological
processes
2. It can be used to study any scientific phenomena that may
change the proteome and/or metabolome of a living system.
NIH
 Cancer Biomarker Discovery
 Nano-medicine
 Environment
 Food and nutrition
preventative medicine
Biomarker Discovery is Major
Research Field of Differential Omics
Biomarkers are naturally occurring biomolecules useful
for measuring the prognosis and/or progress of diseases
and therapies.
 These substances may be normally present in small amounts
in the blood or other tissues
 When the amounts of these substances change, they may
indicate disease.
 Valid biomarkers should




demonstrate drug activity sooner
facilitate clinical trial design by defining patient populations
optimize dosing for safety and efficacy
be sensitive and easy to assay to speed drug development
What Types of Change Are
Expected?
Protein
structure
unchanged
degradation
Prote in
stru ctu re is
change d
posttranslational
modif ication
concentration
sequence
(mutation)
•Sensing structural
change is a major element
of comparative
proteomics
•Most of metabolomics
works focus on
concentration change
only.
Challenges in Proteomics
 Sample complexity


About 25K types of protein coding-genes present in
human. IPI human database (v3.25) has 67,250 entries,
which could generate about 106-8 peptides
More than one hundred post translational modifications
(PTMs) could happen in a proteome
Body Fluid profiling: biomarker platform
Generic
Sample prep.
High concentration
compounds
g/ml
 Large protein concentration difference


107-8 in human cells, and at least 1012 in human plasma
Dynamic range of a LC-MS is about 104-6
ng/ml
 The top 12 high abundant proteins constitute
approximately 95% of total protein mass of
plasma/serum

Albumin, IgG, Fibrinogen, Transferrin, IgA, IgM,
Haptoglobin, alpha 2-Macroglobulin, alpha 1-Acid
Glycoprotein, alpha 1-Antitrypsin and HDL (Apo A-I &
Apo A-II).
 Dynamic system, large subject variation
pg/ml
Focused
Sample prep.
Low concentration
compounds
Challenges in Metabolomics
•Metabolites have a wide range of molecular
weights and large variations in concentration
•The metabolome is much more dynamic
than proteome and genome, which makes the
metabolome more time sensitive
•Metabolites can be either polar or nonpolar,
as well as organic or inorganic molecules.
This makes the chemical separation a key
step in metabolomics
•Metabolites have chemical structures, which
makes the identification using MS an
extreme challenge
cholesta-3,5-diene
Differential Omics
biomarker discovery
Diseased
Healthy
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
S1
S2
S3
S4
S5
S6
S7
S8
Informatics Platform
data re-examination
Protein
Function
Molecular
networks
LIMS
Interaction
Pathway
modeling
Quality control
Molecular
identification
Significance
test
Regulated
peaks
Pattern
recognition
Cluster
loadings
Knowledge
assembling
Peak
normalization
Peak alignment
Spectrum
deconvolution
Raw data
transformation
Experiment
execution
Experiment
design
Sample
information
Correlation
Regulated
molecules
Molecular
validation
Unidentified
molecules
targeted tandem MS
Roadmap
Systems Biology
1.
2.
3.
4.
5.
6.
Differential omics
Experimental design
Molecular identification
Data preprocessing
Statistical significance test
Pattern recognition
Molecular networks
MDLC Platforms
Sample
•
MudPIT, i.e. SCX followed by RP
•
•
•
•
The proteome is split into 10-20X more
fractions
There is carry-over between fractions
LC fractions generally still are too complex
for MS
APR
AP
AP
Digestion
Affinity Selection
•
•
•
•
Avidin selection of Cys-containing peptides
Cu-IMAC for His-containing peptides
Ga-IMAC for phosphorylated peptides
Lectins for glycosylated peptides
SCX
F1
F2
F2
F2
RPC-MS
Qiu, R.; Zhang, X. and Regnier, F. E. J. Chromatogr. B. 2007, 845, 143-150.
Wang, S.; Zhang, X.; and Regnier, F. E. J. Chromatogr. A 2002, 949, 153-162.
Regnier, F. E.; Amini, A.; Chakraborty, A.; Geng, M.; Ji, J.; Sioma, C.; Wang, S.; and Zhang, X. LC/GC 2001, 19(2), 200-213.
Geng, M.; Zhang, X.; Bina, M.; and Regnier, F. E. J. Chromatogr. B 2001, 752, 293-306.
In-Gel Stable Isotope Labeling
a sample gel based platform
•Avoiding gel-to-gel variability
•Only labeling K-containing peptides
•Accurate quantification
d)
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. Nature Protocols, 2006, 1, 46-51. .
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. J. Proteome Res., 2006, 5, 155-163.
Ji, J.; Chakraborty, A.; Geng M.; Zhang, X.; Amini, A.; Bina, M.; and Regnier, F. E. J. Chromatogr. A 2000, 745, 197-210.
Roadmap
Systems Biology
Differential omics
1. Experimental design
2. Molecular identification
protein identification
metabolite identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Protein Identification
database searching
The database searching approach uses a protein database to
find a peptide for which a theoretically predicted spectrum best
matches experimental data.
Protein
Peptide
Mass
matched
peptide
Protein Identification
database searching
More than 20 algorithms have been developed.





Sequest
Spectrum Mill
Mascot
X! Tandem
OMSSA
1. About 20% of tandem ms spectra
could provide confident peptide
identification
2. < 50% of peptides can be
identified by all algorithms
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Protein Identification
de novo sequencing
de novo sequencing
reconstructs the
partial or complete
sequence of a
peptide directly from
its MS/MS spectrum.
Performance of de novo
method is limited by low mass
accuracy, mass equivalence,
and completeness of
fragmentation.
Pevtsov, S.; Fedulova, I.; Mirzaei, H.; Buck, C.; Zhang, X. Journal of Proteome Research. 2006, 5, 3018-3028.
Fedulova, I.; Ouyang, Z.; Buck, C.; Zhang, X. The Open Spectroscopy Journal 2007, 1, 1-8.
Incorporating Peptide Separation
Information for Protein Identification
structure of pattern classifier
VSFLSALEEYTKK
Input
layer
LSPLGEEMR
Hidden
layer
Feature 1
DYVSQFEGSALGKQLNLK
DSGRDYVSQFEGSALGK
AKPALEDLRQGLLPVLESFK
Flow
through
Feature
Extraction
Feature 3
Partition
DLATVYVDVLKDSGR
wo
Feature N
THLAPYSDELR
Output
layer
Feature 2
xl
wh
ym
QGLLPVLESFKVSFLSALEEYT
K
VQPYLDDFQKK
QGLLPVLESFK
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
zn
Elution
Training the ANNs with Generic
Algorithm
Initial candidate solutions
Crossover
Optimal solution
whji wokj thj tok
whji wokj thj tok
Encoding
Initial population
Mutation
Best
chromosome
Selection
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
Protein Identification Using Multiple
Algorithms and Predicted Peptide
Separation in HPLC
PIUMA architecture
Unknown modification
search
Unmatched spectra
2
Raw LC/MS/MS
data
Database
seraching
Sequest
3
X! Tandem
Unmatched spectra
Lutefisk
De novo sequencing
novoHMM
Peptide List
Peptide List
Peaks
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E. and Zhang, X. Bioinformatics, 2007, 23, 114-118.
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Report
1
machine
learning
Processed MS/
MS data
consensus
mzData or
mzXML format
Chromatography
Modeling based
Validation
Protein List
Mascot
Color legend
existing algorithms
algorithms to be developed
method descriptions
Roadmap
Systems Biology
Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
Spectrum deconvolution
Quality control
Alignment
Normalization
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Spectrum Deconvolution
GISTool, single sample analysis
1. To differentiate signals arising from the real analytes as opposed
to signals arising from contaminants or instrument noise
2. To reduce data dimensionality, which will benefit down stream
statistical analysis.
Functionality
•Smoothing and centralization
•Peak cluster detection
•Charge recognition
•De-isotope
•Peak identification at LC level
•Doublet recognition
•Doublet quantification
GISTool Algorithm
Deconvoluting MS spectra
748.97
748.97
+3 pep
748.64
+2 pep
748.97
100
749.47
749.62
intensity (%)
80
748.6354 3+
748.9694 2+
749.29
749.97
749.97
750.50
60
40
748.64
749
749.29
749.47
750
749
750
749.62
749.97
20
750.50
0
747
748
749
750
Single sample
analysis
751
m/z
Zhang, X.; Hines, W.; Adamec, J.; Asara, J.; Naylor, S.; and Regnier, F. E. J. Am. Soc. Mass
Spectrom. 2005, 16, 1181-1191.
Quality Assessment / Control
0.08
Biological Sample QA/C
•
protein assay
D value
•
0.06
0.04
0.02
Experimental Data QA/C
1
2
3
4
5
6
7
8
9
10
3
4
5
sample ID
1
•
•
2D K-S test
Percentile of detected peaks
Percentile of aligned peaks
Retention time variance vs.
retention time
m/z variance vs. retention time
Frequency distribution of RT & m/z
variance
2
•
•
•
•
0
retention time variation (%)
•
20
30
40
retention time (min)
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.
Bioinformatics, 2005, 21, 4054-4059.
50
60
Data Alignment
To recognize peaks of the same molecule occurring in different
samples from the thousands of peaks detected during the
course of an experiment.
1. MS to MS data alignment
•Referenced alignment
•Blind alignment
•Quality depending on the information of peak detection
2. MS to MS/MS data alignment
•Depends on experimental design
LC-MS Data Alignment
XAlign software for proteomics & metabolomics
data
0.8
retention time difference (min)
•Detecting median sample
Mj =  Ii,jMi,j /  Ii,j
Tj =  Ii,jTi,j /  Ii,j
s
Di =  |Ti,j -µj|
j=1
0.4
0
-0.4
-0.8
10
20
30
40
50
60
70
retention time (min)
•Aligning samples to the median sample
intensity of aligned peaks (sample 2)
10000
1000
100
y = 1.3636x + 16.511
R2 = 0.9475
10
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.
Bioinformatics, 2005, 21, 4054-4059.
10
100
1000
intensity of aligned peaks (sample 1)
10000
Chromatogram of Serum Analyzed on GCGC/TOF-MS
GCxGC-MS Data Alignment
metabolite component of human serum
•Four dimension
•1535 peaks have
been detected
GCxGC/TOF-MS Data Alignment
MSort software for metabolomics
Criteria for alignment
•1st dim. rt
•2nd dim. rt
•spec. correlation
Features
*peak entry merging
*cont. exclusion
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Analysis Results of MAlign
1000
10
800
8
600
6
Peak area
The number of rows in the alignment table
53 standard acids
400
4
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
The number of peak entries in a row of alignment table
1.
2.
5
2
200
0
x 10
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
The number of peak entries in a row of alignment table
8 [OA + FA] samples and 8 [AA + FA] samples
derivatization reagent: (N-Methyl-N-t-butyldimethylsilyl)-trifluoroacetamide (MTBSTFA)
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Methods
1.
2.
3.
4.
5.
6000
4000
0
2000
intensity
To reduce concentration effect and
experimental variance to make the
data comparable.
8000
Normalization
Log linear model xij = ai  rj  eij
Reference sample normalization
Auto-scaling
Constant mean / trimmed constant mean
Constant median / trimmed constant median
0
200
400
600
peak index
800
1000
CV Distribution of Peak Intensities
human serum sample
Intensity Variation
80
60
40
0
0
20
rel peak no (%)
150
20.7%
50
Frequency
250
100
Before Normalization
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
CV
CV
After Normalization
Intensity Variation
1.0
80
60
40
20
rel peak no (%)
150
50
17.3%
0
Frequency
250
100
0.0
0.2
0.4
0.6
CV
0.8
1.0
0.2
0.4
0.6
CV
0.8
1.0
Log linear model:
xij = ai  rj  eij
log(xij) = log(ai) + log(rj) + log(eij)
Roadmap
Systems Biology
1.
2.
3.
4.
5.
6.
Differential omics
Experimental design
Molecular identification
Data preprocessing
Statistical significance test
Pattern recognition
Molecular networks
Statistical Significance Tests
To find individual peaks for which there are significant
differences between groups.
Methods
1. Pair-wise t-test (diff. mean?)
2. Mann-Whitney U test (diff. median?)
3. Kolmogorov-Smirnov test (diff. population?)
4. Kruskal-Wallis analysis of variance
Statistical Significance Tests
8
metabolome of great blue heron fertilized eggs
contaminated by PCBs
4
up-regulated
fold change = I_c / I_n
blue line: p=0.05
dashed line: fold change = 0
2
down-regulated
0
p (-log)
6
PCBs: polychlorinated biphenyls
-3
-2
-1
0
fold change (log)
1
2
3
Roadmap
Systems Biology
1.
2.
3.
4.
5.
6.
Differential omics
Experimental design
Molecular identification
Data preprocessing
Statistical significance test
Pattern recognition
Molecular networks
Clustering or Classification
Resulting pattern recognition provides the first glimpse of
improvement in understanding the underlying biology.
Unsupervised Methods
Principle component analysis (PCA)
Linear Discriminant Analysis (LDA)
Clustering objects on subsets of attributes (COSA)
Supervised Methods
Support vector machine (SVM)
Artificial neural network (ANN)
Cross Species Comparison
27 of the 28 control humans and all 8 control rats cluster to one group
11 of the 14 diseased human and all diseased rats cluster to second group
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Roadmap
Systems Biology
Differential omics
1. Experimental design
2. Protein identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
correlation network
interaction network
regulation network
pathway analysis
Molecular Correlation Analysis
pair wised correlation of proteins and metabolites
Healthy
Diseased
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
A
B
C
D
…
Z
S1
S2
S3
S4
S5
S6
S7
S8
Molecular Correlation Network
an example of drug effect on disease state
ApoE_1
L-5a
L-5b
SerPI_II_2
L-11b
L-11a
L-18a
L-18b
L-28b
L-26a
L-12a
Emb
•Reveal important relationships
among the various
components
L-27b
L-1b
L-26b
L-20a
L-27a
L-6a
L-13b
GP-1b
L-1a
L-20b
L-16a
ApoA1_6
L-13a
Unkn1
C33:1 PC
FBGB
AMBP
C32:0 PC
C24:1 SPM
C30:0 PC
C24:0 SPM
A1MG_5
C54:5 TG
C54:5 TG
L-8b
C52:1 TG
L-17a
C18:2 LPC
L-10a
a-glucose
TP
L-17b
L-10b
C52:2 TG
ALB
L-22a
C50:4 TG
L-23b
FetuinA_2
L-28a
C52:5 TG
A1MG_2
•Complimentary to abundance
level information
K
L-23a
leucine
valine
C34:2 PC
L-22b
L-19a
formate
L-16b
C32:1 PC
C52:3 TG
A1I3_3
C34:1 PC
C36:2 PC
C52:4 TG
isoleucine
C20:4 CE
TRIG
leucine
NEFA
GLYC
ApoA1_5
BUN
C54:3 TG
C54:2 TG
HDL
C54:1 TG
C54:6 TG
C58:5 TG
C36:1 PC
glutamine
glutamine
glutamine
L-9b
L-15a
L-14b
L-14a
L-12b
phenylalanine
alanine
GLUC
tyrosine
C16:1 CE
•Provides information about
the biochemical processes
underlying the disease or drug
response
C18:1 L-9a
LPC
ALP L-24b L-24a L-19b L-15b
GP-1a Tiss
L-6b
L-7a
L-7b
L-8a
L-21b
L-21a
C60:3 TG
TG
C52:6 TG
C22:5 CE C58:3 TG C56:3 TG
lactate
ITIH3_1
Afamin_2
valine
glutamine
valine
C16:1 LPC
C58:2
C58:4 TG C60:4 TG
C56:2 TG
valine
tyrosine
alanine
creatine
lactate
tyrosine
lactate
acetate
lactate
tyrosine
alanine
creatine
C38:4 PC
tyrosine
C46:1 TG
isoleucine
C48:1 TG
lactate
C18:0 CE
lactate
C16:0 CE
phenylalanine
C18:1 CE
C20:5 CE
C20:2 CE
C36:4 PC
C20:3 CE
C18:2 CE
Plasminogen
a-glucose
C19:0 LPC
C22:6 CE
C56:4 TGC18:3 CE
Lipids (LCMS)
NMR (DE)
NMR (CPMG)
diffusion
Proteins
Peptides
Clinical
= positive correlation
= negative correlation
phenylalanine
phenylalanine
C18:1 LPC
leucine
A1I3_4 TT_2
phenylalanine
ApoA1_7
b-glucose
C20:4 LPC
C18:0 LPC
Hemopex_1
b-glucose
PlasPre_2
ApoA1_3
= higher in treated g roup
= lower i n treated g roup
LD
TT_1
FG
A2GC
Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T.; Lavine, G.; Londo, T. R.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.;
Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J, T.W.E.; Havekes, L. M.; Afeyan, N.; Regnier, F. E.; Greef, J.;
Naylor, S. Omics: A Journal of Integrative Biology 2004, 8, 3-13.
SysNet: Interactive Visual Data
Mining of Molecular Correlation
Network
a)
An interactive integration and
visualization environment for
molecular correlation of ‘omics data.
•Integrating molecular expression
information generated in different ‘omics
•Visualizing molecular correlation in
interactive mode
b)
•Enabling time course data visualization and
analysis
•Automatically organizing molecules based
on their expression pattern in time course.
Zhang, M.; Ouyang, Q.; Stephenson, A.; Salt, D.; Kane, D. M.; Burgner J.; Buck, C. and Zhang, X. BMC
Systems Biology. Accepted by BMC Systems Biology.
Biomarker Verification
 Wet-lab verification
 AQUA
 MRM
 Antibody
 In-silico verification
 tracing lineage
 pathway analysis
Automated Lineage Tracing
•Developed based on dynamic slicing
techniques used in debugging
•Applicable to any arbitrary
function
Zhang, M.; Zhang, X.; Zhang, X. and Prabhakar, S. 33rd International
Conference on Very Large Data Bases (VLDB 2007), 2007.
Lineage Tracing
•Tracing of fine-grained lineage
through run-time analysis
Analysis Software
•Interested in identifying the
connections between input and
output data for a program
Summary
• Informatics platform developed in my group can be used to analyze
protein and metabolite profiling data to differentiate disease and
normal samples for biomarker discovery
• Groups identified using clustering analysis reflected the phenotypic
categories of cancer and control samples, the animal and human
subjects, etc. with high degree of accuracy
• The application of SysNet using an interactive visual data mining
approach integrates omics data into a single environment, which
enables biologists performing data mining
• Lineage tracing technology is an efficient and effective approach for
in-silico biomarker verification. This technique will significantly
reduce the false discovery rate (FDR) of biomarker discovery
Acknowledgements
Irina Fedulova
Dr. Hamid Mirzaei
Dr. Cheolhwan Oh
Sergey E. Pevtsov
Ouyang Qi
Alan Stephenson
Mingwu Zhang
Dr. John Burger
Dr. Michael D. Kane
Dr. Fred E. Regnier
Dr. David Salt
Dr. Mohammad Sulma
Dr. Daniel Raftery
Dr. Sunil Prabhakar
Dr. David Clemmer
Dr. John Asara
Dr. Mu Wang
Dr. Jake Chen
Dr. Steve Valentine
Dr. Steve Naylor
Postdoc Positions
Posting Title:
Work Location:
Job Type:
Starting Date:
Industrial Postdoctoral Fellow - Bioinformatician
University of Louisville, KY
Full time
Position immediately available
Job Description: Predictive Physiology and Medicine (PPM) Inc. is an exciting
health and life sciences company based in Bloomington, Indiana focused on
developing analytical systems for the individualized health and wellness industry.
We have an immediate opening for a postdoctoral fellow. The successful
candidate will develop bioinformatics systems for mass spectrometry based
quantitative proteomics and metabolomics.
Requirements: The position requires a bioinformatician with strong
computational background. Priority will be given to the candidate with a PhD in
bioinformatics, computer science, statistics, engineer, or computational physics.
The successful candidate should have strong understanding of statistics and
pattern recognition. Programming skills using Matlab, Microsoft .NET, or Java to
accomplish analyses is required. Experience in analyzing biological data is not
required; however, interest in multidisciplinary research is a must.