The Illinois Bio-Grid: A Software Framework for Industry

Download Report

Transcript The Illinois Bio-Grid: A Software Framework for Industry

Illinois Bio-Grid
Grid Computing
The Illinois Bio-Grid
Alexander B. Schilling, Ph.D.
University of Chicago
Proteomics Core Lab
[email protected]
Outline
Illinois Bio-Grid
• Bio-Medical Informatics
– Show how computability is growing exponentially
• Illinois Bio-Grid
– Describe this Grid founded at DePaul
• IBG Workbench
– Describe these grid enabled BioInformatics tools
• Mass Spec Toolkit in Cactus
– Describe plans to implement tools for spectral interpretation in Cactus
BioInformatics and Computability
Illinois Bio-Grid
• Growth of data in
GenBank is exponential
and doesn't show signs of
slowing down yet.
– Source GenBank/NCBI
• Compute time to process
data growing equivalently
– Twice Moore's law
• Biologists don't have
access to supercomputers
for everyday work
• Grid computing gives
Biologists more computing
power affordably
Illinois Bio-Grid
Illinois Bio-Grid
• A consortium of
–
–
–
–
–
Educational Institutions
National Labs
Private Industry
City & State entities
Museums
Goals
Illinois Bio-Grid
1. Provide an infrastructure of computational (and other)
resources to Biological and Medical researchers
2. Provide an infrastructure of computational (and other)
resources to Computer Scientists working on BioMedical
problems
3. Provide a tool suite of BioMedical software for BioMedical
researchers to use on the IBG computational resources
– Also for open source distribution worldwide
4. Provide an environment for CS researchers to work with
BioMedical researchers
5. Try to solve some computationally intense BioMedical
Informatics problems
6. Create a workbench of BioMedical software modules in open
source distribution to facilitate more rapid BioMedical
Informatics research by researchers worldwide
Illinois Bio-Grid
Illinois Bio-Grid Infrastructure
DePaul
Chicago Technology
Park
(Supercomputing Center
Of Chicago)
Argonne
MCS
U Chicago
Canadian
NRC
Field
Museum
IIT
Bio-Grid Workbench
Illinois Bio-Grid
• Consists of many applications important to Biological and
Medical Researchers
• All Grid enabled to provide enhanced computational power
• Genomics
• Proteomics
• Phylogenetics
• Computational Fluid Dynamics / Medical Imaging
• Cell membrane modeling
• Data Modeling LSG-RG in
GGF Reference Implementation
Genomics and Proteomics 1
Illinois Bio-Grid
• Homology Searching
– Searching for proteins with the same evolutionary "ancestor"
– Smith-Waterman / Blast / FastA
– Database against database searches (instead of single sequence
against database searches)
– Allow groups of input sequences to search for homologous sequences to
all in the set
• Mass Spec Data Interpretation
– Ionize peptides and fragment them inside mass spectrometer
– Measure charge/mass ratio of peptide ions and fragments
– Interpret resulting spectra
Intens.
Al l , 0.0-0.5mi n (#1-#10)
1250
1000
1479.9
1640.0
1305.9
750
1420.8
1249.9
500
441.1
1882.9
1163.9
1780.8
250
562.1
0
400
600
800
1000
1200
1400
1600
1800
2000
m/z
Genomics and Proteomics 2
Illinois Bio-Grid
• Mass Spec Based Protein Identification
– Conduct “In Silico” Digestion of protein database
– Predict charge/mass ratio of all possible peptide ions resulting from
database
– Search actual ions in spectra against predicted ions
– Return identifications of proteins based on scoring match
Genomics and Proteomics 3
Illinois Bio-Grid
•
•
•
•
Predict 3D Protein folding given sequence of amino acids
Solution to Schrödinger equation is intractable
Search space of possible folds is immense
Current methods of searching
–
–
–
–
–
ab-initio
AI
Lego
Monte Carlo
Lattice
• On Grids can run
multiple searches
– In parallel
– In series
• On Grids can run at
higher resolutions
Phylogenetics
Illinois Bio-Grid
• Sequence various taxa
(individuals or species)
– Frequently sequence mitochondrial
DNA
– Mitochondrial DNA much like
prokaryote DNA
• Compare sequences
– Form hypothetical evolutionary tree
– Each branch is a mutation
– Shows mutations from hypothetical
ancestor
• Search space is immense
– Runs for 6 months on a single
processor
– Then crashes!
Computational Fluid Dynamics / Medical Imaging
Illinois Bio-Grid
• Monitor and collect real time CAT scan data
– Arterial blood flow
• Use Grid to interpret data
–
–
–
–
Use Computational Fluid Dynamics to model blood flow
Produce real time imaging
Locate aneurisms and other anomalies
Aid in diagnosis and
decision making for
surgical procedures
– Non-invasive
Cell membrane modeling
Illinois Bio-Grid
• Run simulations using both
– Configurational Bias Monte Carlo Method (CBMC)
– Molecular Dynamics (MD)
• Current simulations being done involve the properties of
cholesterol in lipid membranes
– Cholesterol is known to be an essential component of mammalian cell
membranes
– Its exact role is not well understood
• Previous simulations have been run
– Up to 1600 lipid or cholesterol molecules
– And 52,000 water molecules
• We're increasing these simulations by
– An order of magnitude in the physical dimensions
– And 2 to 3 orders of magnitude in time
Data Modeling
Illinois Bio-Grid
• Data Modeling LSG-RG in GGF
Reference Implementation
– Automatic Data Synchronization
– Flagging "dirty" data
– Flagging data sources (including
versioning)
IBG Workbench
Illinois Bio-Grid
Phylogenetic
Trees
Mass
Spec
Proteomics
Homology Searching
DB Access
Grid Services (Middleware)
Grid Fabric (Resources)
Membrane
Modeling
CFD
Illinois Bio-Grid
The Purpose of Mass Spectrometry in
Proteomics
• Identify and sequence all proteins involved in an organism’s
biology.
• Use this knowledge to identify proteins (or peptides) that can be
used to study and understand different biological states.
• Correlate protein expression levels to biological function. Use
protein or peptide biomarkers to identify disease states in
patients.
• Use the structure of the relevant proteins as targets for
developing new therapeutic techniques (drugs etc..).
Illinois Bio-Grid
Mass Spectrometers in Proteomics
•
•
•
•
•
•
Mass spectrometers measure the masses of proteins and peptides by moving their
ions through the instrument in a controlled way.
Proteins can be degraded using enzymes and the peptides produced can be
analyzed by the mass spectrometer.
A MS/MS instrument can cause the peptide ions to fragment into smaller pieces
which can be used to deduce the peptide’s sequence.
Once the sequence of the peptides has been determined, the protein’s complete
sequence can be reassembled from the peptide sequences.
The intensity of peaks can be used to determine the expression level of a protein in a
sample.
Samples from healthy and diseased tissue can be compared to locate biomarkers for
disease.
Illinois Bio-Grid
•
•
•
The MS/MS Experiment Produces
Multidimensional Data
Chromatograms (Time vs Intensity)
Precursor Ion Spectra of Peptides (Mass vs Intensity)
Product Ion Spectra of Peptides(*(Precursor Mass), Mass vs Intensity)
+MS,
+MS2(1535.8),
4.7min
5.6min (#44)
38. (#36)
Intens.
1000
500
MS TIC
000
400
000
300
000
200
000
100
000
1+
1479.8
MS
800
1+
1640.0
1+
646.4
600
1+
1305.8
400
200
0
150
417.1
1+
1578.7
1+
1163.6
562.2
1+
927.7
476.1
725.2
1710.0
1074.6
2076.8
845.1
2169.4
38.
MS/MS of m/z
125
y7
100 1535.8
b6
1153.1
y11
1+
1516.6
842.5
75
y5
50
659.9
599.3
25
0
400
1389.3
y6
600
1304.2
964.4
727.2
800
1000
1200
1400
1600
1800
2000
m/z
Illinois Bio-Grid
What the tandem mass spectrum of a
peptide looks like.
Y-ions from C to N terminus
Y3 ion
Y1 ion
Y2 ion
R2 O
R3 O
R4 O
R1 O
C N
NH
2 C
H
H
B1 ion
C C
H
B2 ion
N
H
C C N C C OH
H
H H
B3 ion
B-ions from N to C terminus
Illinois Bio-Grid
Important Issues In Computation for Proteomics
•
DeNovo Sequencing
–
–
–
–
–
•
Many computationally efficient algorithms exist
Many times algorithms produce incorrect results very quickly!
Issue of posttranslational modifications introduces complexity into interpretations
Much data must be discarded to accommodate workstation based computational capacity
A strong desire exists to use intensity data as well as mass data in interpretations
Database Search (Protein ID)
– Most packages are commercial, few open source (BLAST based only)
– The more posttranslational modifications you allow for, the longer the searches take. Area
is ripe for parallelism.
– Serious problems with false positive identifications
• Many active in research to address this problem
• Could be reduced by more front end interpretation before search
• Could combine spectra from multiple MS types before search instead of correlating ID
results after searches
•
Datamining
– What do you do with all the identifications? Systems Biology!
• Create models for signal pathways using protein id and expression data
Illinois Bio-Grid
GridProt: A Cactus Based Proteomics Tool Kit
Thorns:
GridMass – handles basic data extraction,
chromatographic peak integration, mass
detection
GridTAG - partial sequence mass
tag extraction
GridID - grid based database
search using mass spec data
GridDeNS - grid based denovo
sequencing
Visualization – OpenDX
Data Storage – mzXML and HDF-5
Conclusions
Illinois Bio-Grid
• Illinois Bio-Grid
– Excellent resource for Biological and Medical researchers
• IBG Workbench
– Excellent software architecture for compute intensive applications
– Will be source of BioMedical Informatics software sharing for a plethora
of different research areas
– Will be source of workbench tools for researchers in other related
Informatics software creation
• Cactus is an ideal platform for HPC of Mass Spec data
– Modular thorns allow generalization for MS, specialization for Proteomics
– Ideal base for open source, extendable software ready for HPC as
Proteomics data sets grow.
• http://facweb.cs.depaul.edu/bioinformatics
• http://facweb1.cs.depaul.edu/~dangulo
Illinois Bio-Grid
Acknowledgements
University of Chicago
Howard Hughes Medical Institute
Ben May Cancer Center
Pfizer Inc.
Illinois Biogrid:
Dave Angulo, DePaul University
Gregor von Laszewski, ANL
Kevin Drew, Tim Freeman