Bioinformatics in Glycobiology

Download Report

Transcript Bioinformatics in Glycobiology

Bioinformatics:
Glycomics
A post-genomic paradigm
Ram Sasisekharan
Biological Engineering Division
Massachusetts Institute of Technology
Cambridge, MA
Outline
•
•
•
•
•
•
Overview – “omics”
Systems Biology
Pre- and Post- Genomic bioinformatics
Issues with Glycomics
Addressing the Challenges
New Research Models for Post Genomics
Bioinformatics – the Glue Grants
• Consoritum for Functional Glycomics
• Conclusions
Central dogma in biology and the age of
the “’omics”
Proteomics
RNA
Genotype
(DNA Sequence)
Translation of
Protein Sequence
Genomics
PHENOTYPE
Glycomics
Posttranslational
Modification
An emerging
paradigm
Information Content in Biological Systems
Sequence
DNA: Nucleotides
Protein: Amino Acids
Carbohydrates:Monosaccharides
Interactions between
molecules
Biological Activity
Molecular
Cellular
Tissue
Structure
Secondary
Tertiary
Quaternary
Systems Biology
Model
Manipulate
Bayesian Networks
Boolean Networks
Molecular Genetics
Chemical Genetics
Cell Engineering
Mine
Measure
Bioinformatics
Biochemistry
Imaging
Bioelectronics
What is Bioinformatics ?
•
•
•
•
•
Assimilation
Cataloging
Classification of biological information for
Model Creation
Prediction of behavior of a biological system for a given set of inputs
Data Acquisition
Web Interface
Data Curation
Knowledge base
Tools for
Data Analysis
Statistical Analysis
Comparison
Scoring functions
Data Storage
Data Dissemination
Database Infrastructure
Database Design
Search Engines
Simple/Advanced Queries
Model Creation
Network of relationships
between structural and
functional attributes of
biological macromolecules
Prediction
System
Behavior to set
of Inputs
Landmarks in Genomics
1970s Advent of DNA sequencing
1980s DNA sequencing automated
1990s Era of Bioinformatics : Rapid computational
manipulation, storage and dissemination of
sequence information
1995
First whole genome sequenced
1999
First human chromosome sequenced
2001
Draft of human genome
Evolving framework for Bioinformatics
Pre-genomics Bioinformatics
• Representing sequence information - single alphabet code
– DNA: {A,T,G,C}
– Proteins: {A,C,D-I,K-N,P-T,V,W,Y}
– Carbohydrates: not well defined
• Storing Information – simple flat file databases
– Sequence Databases – GenBank, SwissProt: Flat file databases without
any annotation or structuring of gene and protein sequence information
– Structure Databases – Protein Databank: Flat file database. Structural
annotations like classification of structural superfamilies (SCOP) was
created from PDB entries
– Biological Activity – there was no real database that catalogued the
important biological roles of biopolymers. Part of this information were
stored as additional text fields in the sequence and structure databases
Limited development of bioinformatics platforms for carbohydrates
Evolving framework for Bioinformatics
Post-genomics Bioinformatics – Proteomics, Glycomics
•
Types of information
•
Types of Databases – Complex relational databases
– Data sets from high throughput experiments – Microarray, Mass
spectroscopy and other analytical tools
– Data sets from diverse experiments – mouse models to study the
biological macromolecule in vivo, sensitive assays for studying
interactions between proteins in a biological pathway
– Relational databases store different attributes obtained from high
throughput experimental data and relationships between these
attributes
Increasing awareness of importance of carbohydrates in
fundamental biological functions, yet little development on the
bioinformatics applications to represent, store and manipulate
information in carbohydrates
Glycomics
Types of Carbohydrates
Branched Sugars: N-Glycans
N-glycan diversity
P
Asn-XThr/Ser
P
OST
Cytosol
Nascent
Polypeptide
ER
Golgi
Types of Carbohydrates
Linear Sugars: Glycosaminoglycans
Cell
GAGs are the most acidic
and information dense linear
sugars
Representing Information in
Carbohydrates
Proteins and DNA – Backbone is mostly fixed, variations in building blocks is
primarily due to variations in the side chain R groups
R=4
R=20
In the case of carbohydrates, there are variations in the chemical configuration
of the monosaccharide building blocks, linkage between monosaccharides and
variations in the exocyclic substituitions (R groups) thereby making them
highly information dense – both linear and branched sugar structures
X: SO3- ; Y: Ac/SO3- - variation in the
chemical configuration (I/G) and exocyclic
sulfation pattern gives 32 building blocks –
in comparison with 20 amino acids and 4
bases.
High information density makes representation of information content in
carbohydrates a challenging task – simple alphabetic codes don’t efficiently
capture the information content
Carbohydrate – Protein Interaction:
•
Carbohydrate – Protein
interactions are key in modulating
cell-cell communication
•
Glycosylation on cell surface
proteins act as recognition motifs
for proteins on mutiple cell types
including immune cells and
pathogens
•
Due to multivalent interactions the
binding between a single
carbohydrate and lectin is weak
and thus is hard to characterize
Multivalent interactions between carbohydrates and proteins
complicate the relationship between these interacting species
Glycosaminoglycan Paradigm
IL-8
TGF-b
FGF
INF-
VEGF
TNF
Chemokine
Enzymes
Integrins
Pathogens
•
Cell surface proteoglycans
comprise of long GAG
polysaccharides that provide
the cell with a “sugar coat”
•
GAGs interact with a multitude
of signaling molecules in a
sequence specific manner and
modulate important biological
processes
•
Different GAG sequences have
differential affinities for a
particular signaling molecules
and this gradient in affinity
plays a key role in “analog”
regulation of biological function
Characterization of Carbohydrate Structures
2OST
Complex Biosynthesis
6OST
NDST
Epimerase
3OST
– Biosynthesis is not template based and it involves several
enzymes
– There are multiple isoforms of these enzymes with different
substrate specificities further increasing the diversity of
structures
– It is not possible at this time to amplify tissue derived
carbohydrates due to their complex biosynthesis – low amounts
of biological sample
Characterization of Carbohydrate
Structures
Challenges in Isolation and Purification
–
Due to the chemical heterogeneity, it is difficult to get pure
amounts of homogeneous samples.
–
Often the sample analyzed is a mixture, therefore the sequence
information in many cases cannot be fully determined – non
deterministic system
Partial information on carbohydrate structure due to limitations
in their structural characterization poses significant
challenges in storing and manipulating information content in
carbohydrates
Advancing Glycomics – Key Issues
• Representing Information in Carbohydrates is complicated –
alphabetic codes are too cumbersome to handle information
density
• Dealing with analysis of low amounts of tissue derived
material
• As a result of the challenges in the structural analysis of
carbohydrates, there is a need to develop tools to represent
and manipulate partial/non-deterministic information on
carbohydrate structures
• “Analog” regulation of biological function by carbohydrates
poses a challenge in providing functional attributes to specific
carbohydrate structures
Addressing the Challenges
Representing information in Carbohydrates – HSGAG as model
system
•
Property encoded nomenclature (PEN)
–
–
Numerical scheme that optimally allocates bits to encode “properties” and the
identity of the building block of biopolymers
Facilitates the use of mathematical operations to manipulate the information.
Features of PEN framework
• Derived from an internal logic that is based on the chemical
nature of the building block
• Uses a numerical system and mathematical operations to
perform manipulations
• Can be easily extended to encode more variations either by
using more bits or higher numerical base due to the
flexibility of the number system
• Facilitates comparison of “properties” directly since property
encoded is a function of the chemical identity of the building
block.
Dealing with low sample amounts –
Sensitive MALDI-MS analysis
• Matrix – Caffeic acid
• Complex with Basic peptide
– (RG)nR detected
• Laser induced ionization
leads to formation of
molecular ions
• Mass of saccharide is
obtained to an accuracy of
<1 Dalton, by subtracting
mass of peptide from mass
of complex
• Accurately determine
masses of picomolar
amounts of sample typical
of biologically important
HSGAG oligosaccharides
Applications of PEN
Mass Composition relationship
The length, number
of sulfates and
acetates of a
HLGAG oligomer can
be unambiguously
assigned for
oligomers up to
tetradecasaccharide
Applications of PEN:
PEN-MALDI Sequencing Strategy
Formalism:
Hexadecimal
binary notation
based mass-line
MADLI-MS
All Possible
Sequences
iterative
Experimental
composition,chemical
enzymatic
Unique
Solution
Sequencing HSGAGs: Example
New Research Models for Post Genomics
Bioinformatics – the Glue Grants
• Alliance for Cellular Signaling (AfCS):
To understand as
completely as possible the relationships between sets of inputs and
outputs in signaling cells that vary both temporally and spatially,
i.e. how cells interpret signals in a context-dependent manner
• Cell Migration Consortium:
To accelerate progress in cell
migration-related research by fostering interdisciplinary research
activities and producing novel reagents and information
• Consoritum for Functional Glycomics:
Define the
paradigms by which carbohydrate binding proteins function in
cellular communication
• Inflammation and Host Response Consortium: It
is
designed to acquire new scientific knowledge about the biological
basis for different outcomes in injured patients.
Consoritum for Functional Glycomics
Organization of the Core Facilites
Data Organization: Databases
Data Storage: Database Design
•
Overview
–
Classification of data
•
•
–
6 key identifiers (name tags for data) – CBP ID, GT ID,
Carb ID, Project ID, Microarray ID, Mouse Strain ID
Data Fields – provide structure to the type of data being
entered. Selection of the appropriate data fields depends on
what kinds of data will be entered
Linking data
•
•
•
Data fields pertaining to a specific attribute are stored in a
table
Each table will be linked to other tables via common data
fields or identifiers.
The data tables and their links form an “Ontology Diagram”
Data Storage: Database Ontology
CBP-Carbohydrate Interaction
Core C
Core H
PI
CBP
Core F
Core C
Core G
Core D
PI
Expression
Mouse Studies
PI
Data Storage: Relational Databases
Author
Name
XYZ, …
Email
[email protected]
Institution
ABC
Protein
characterized
…
CBP ID
CBP001
GenBank
GB0001
SwissProt
SP00001
PDB
1XXX
...
Characterization
Carb ID
was expressed using
interacts with
Protein Expression
Cell Line
BL21
Gel Image
Img.jpg
cDNA clone
GB0002
...
Carbohydrate
Carb ID
Structure
Carb000
1
Mass Spec
MS-1.jpg
NMR
NMR.jpg
CBP001
Structure
characterized using
notation
...
notation
Carb DB
characterized
Carb000
1
PDB
...
Author
1XYZ
Name
MNO, …
Email
[email protected]
Institution
IJK
…
Sample Ontology from AfCS
Conclusions
•
In the post – genomics era, high through put experimental methods
are generating large data sets pertaining to multiple sequence,
structure and functional attributes of genes and proteins – Transition
from Traditional Biology  Information driven “Systems Biology”
•
With constantly increasing computational power, there has been a big
leap in development of bioinformatics tools to deal with large data
sets
•
Increasing awareness of the role of carbohydrates in fundamental
biological processes modulating cell-cell and cell-matrix interactions
•
Development of bioinformatics applications for carbohydrates has
many challenges due to their complexity and heterogeneity
•
Addressing these challenges would enable the development of
bioinformatics for glycobiology to provide a more comprehensive
description of the “state” of a biological system and to better predict
the “response” of a biological system to a given set of “inputs”