Document Here - What is BioInformatics?

Download Report

Transcript Document Here - What is BioInformatics?

Bioinformatics &
Computational Biology
Thanks to Mark Gerstein (Yale)
& Eric Green (NIH)
for many borrowed & modified PPTs
Drena Dobbs
1
Iowa State University
What is Bioinformatics?
(& What is Computational Biology?)
Wikipedia:
• Bioinformatics & computational biology involve the
use of techniques from mathematics, informatics,
statistics, and computer science (& engineering) to
solve biological problems
Gerstein:
• (Molecular) Bioinformatics is conceptualizing biology in
terms of molecules & applying “informatics”
techniques - derived from disciplines such as mathematics,
computer science, and statistics - to organize and
understand information associated with these molecules,
on a large scale
2
What is the Information?
Biological Sequences, Structures, Processes
Central Dogma
of Molecular Biology
Central Paradigm
for Bioinformatics
• DNA sequence
-> RNA
-> Protein
-> Phenotype
• Genomic (DNA) Sequence
• Molecules
 Sequence, Structure, Function
• Processes
-> mRNAs & other RNA sequences
-> Protein sequences
-> RNA & Protein Structures
-> RNA & Protein Functions
-> Phenotype
• Large Amounts of Information
 Mechanism, Specificity, Regulation
Modified from Mark Gerstein idea from D Brutlag, Stanford, graphics from S Strobel)
 Standardized
 Statistical
3
Explosion of "Omes" & "Omics!"
Genome, Transcriptome, Proteome
• Genome - the complete collection
of DNA (genes and "non-genes") of
an organism
• Transcriptome - the complete
collection of RNAs (mRNAs &
others) expressed in an organism
• Proteome - the complete
collection of of proteins expressed
in an organism
4
Genome = Constant
Transcriptome & Proteome = Variable
• Genome - the complete collection
of DNA (genes and "non-genes") of
an organism
• Transcriptome - the complete
collection of RNAs (mRNAs &
others) expressed in an organism*
• Proteome - the complete
collection of proteins expressed in
* Note:
Although the
DNA is "identical" in all
cells of an organism, the
sets of RNAs or proteins
expressed in different
cells & tissues of a single
organism vary greatly -and depend on variables
such as environmental
conditions, age.
developmental stage
disease state, etc.
an organism*
5
Molecular Biology Information:
DNA & RNA Sequences
Functions:
•
•
•
•
Genetic material
Information transfer (mRNA)
Protein synthesis (tRNA/mRNA)
Catalytic & regulatory activities
(some very new!)
DNA sequence:
atggcaattaaaattggtatcaatggttttggtcgtat
gcacaacaccgtgatgacattgaagttgtaggtattaa
atggcttatatgttgaaatatgattcaactcacggtcg
aaagatggtaacttagtggttaatggtaaaactatccg
Gcaaacttaaactggggtgcaatcggtgttgatatcgctttaactg
atgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt
Information:
RNA sequence has "U" instead of "T"
• 4 letter alphabet
 (DNA nucleotides: AGCT)
• ~ 1,000 base pairs in a small gene
• ~ 3 X 109 bp in a genome (human)
Modified from Mark Gerstein
•
•
•
•
Where are the genes?
Which DNA sequences encode mRNA?
Which DNA sequences are "junk"?
Which RNA sequences encode protein?
6
Molecular Biology Information:
Protein Sequences
Functions: Most cellular functions are performed or facilitated
by proteins
•
•
•
•
•
Protein sequences:
Biocatalysis
d1dhfa_
Cofactor transport/storage
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTT
Mechanical motion/support
d8dfr__
Immune protection
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTS
d4dfra_ ISLIAALAVDRVIGMENAMPWNRegulation of growth and differentiation
Information:
• 20 letter alphabet (amino acids)
 ACDEFGHIKLMNPQRSTVWY
(but not BJOUXZ)
• ~ 300 aa in an average protein
(in bacteria)
• ~ 3 X 106 known protein sequences
Modified from Mark Gerstein
LPADLAWFKRNTL
d3dfr__ TAFLWAQDRDGLIGKDGHLPWHLPDDLHYFRAQTV
• What is this protein?
• Which amino acids are most
important -- for folding, activity,
interaction with other proteins?
• Which sequence variations are
harmful (or beneficial)?
7
Molecular Biology Information:
Macromolecular Structures
DNA/RNA/Protein Structures
• How does a protein (or RNA)
sequence fold into an active
3-dimensional structure?
• Can we predict structure
from sequence?
• Can we predict function from
structure (or perhaps, from
sequence alone?)
Modified from Mark Gerstein
8
We don't yet understand the protein folding
code - but we try to engineer proteins anyway!
Modified from Mark Gerstein
9
Molecular Biology Information:
Biological Processes
Functional Genomics
• How do patterns of gene
expression determine
phenotype?
• Which genes and proteins are
required for differentiation
during during development?
• How do proteins interact in
biological networks?
• Which genes and pathways have
been most highly conserved
during evolution?
10
On a Large Scale?
Whole Genome
Sequencing
Genome sequences now
accumulate so quickly that,
in less than a week, a single
laboratory can produce
more bits of data than
Shakespeare managed in a
lifetime, although the latter
make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
Modified from Mark Gerstein
11
Automated Sequencing for Genome Projects
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Another recent improvement: rapid & high resolution separation of
fragments in capillaries instead of gels (E Yeung,Ames Lab, ISU)
More recently?
Modified from Eric Green
Pyro-sequencing
454 sequencing http://www.454.com/
$ 1000 genomes?
12
1st Draft Human Genome - "Finished" in 2001
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Modified from Eric Green
13
Human Genome Sequencing
Two approaches:
• Public (government) - International Consortium
(6 countries, NIH-funded in US)
• "Hierarchical" cloning & BAC-by-BAC sequencing
• Map-based assembly
• Private (industry) - Celera (Craig Venter)
• Whole genome random "shotgun" sequencing
• Computational assembly
(took advantage of public maps & sequences,too)
Guess which human genome they sequenced?
How many genes?
~ 20,000
Craig's
(Science May 2007)
14
Public Sequencing - International Consortium
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Modified from Eric Green
15
Comparison of Sequenced Genome Sizes
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Plants? Some have much larger genomes than human!
Modified from Eric Green
16
"Complete" Human Genome Sequence - What next?
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
from Eric Green
17
Next Step after the
Sequence?
Understanding Gene
Function on a Genomic
Scale
• Expression Analysis
• Structural Genomics
• Protein Interactions
• Pathway Analysis
• Systems Biology
Evolutionary Implications of:
• Introns & Exons
• Intergenic Regions as "Gene Graveyard"
Modified from Mark Gerstein
18
Interpreting the Human Genome Sequence!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
from Eric Green
19
Comparative Genomics:
compare entire genomic sequences
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
from Eric Green
20
Comparing Genomes: Functional Elements
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
from Eric Green
21
Gene Expression Data:
the Transcriptome (& Proteome)
MicroArray Data
Yeast Expression Data:
• Levels for all 6,000 genes!
• Experiments to investigate
how genes respond to
changes in environment or
how patterns of expression
change in normal vs
cancerous tissue
Modified from Mark Gerstein (courtesy of J Hager)
ISU's Biotechnology Facilities
include state-of-the-art
Microarray & Proteomics
instrumentation
22
Other
Whole-Genome
Experiments
Systematic Knockouts:
Make "knockout" (null)
mutations in every gene
- one at a time - and
analyze the resulting
phenotypes!
For yeast:
6,000 KO mutants!
Modified from Mark Gerstein
2-hybrid Experiments:
For each (and every)
protein, identify every
other protein with which it
interacts!
For yeast: 6000 x 6000 / 2
~ 18M interactions!!
23
Molecular Biology Information:
Integrating Data
• Understanding the function of genomes
requires integration of many diverse and
complex types of information:
 Metabolic pathways
 Regulatory networks
 Whole organism physiology
 Evolution, phylogeny
 Environment, ecology
 Literature (MEDLINE)
Modified from Mark Gerstein
24
Storing & Analyzing Large-scale Information:
Exponential Growth of Data Matched by
Development of Computer Technology
CPU vs Disk & Net
• Both the increase in
computer speed and the
ability to store large
amounts of information on
computers have been
crucial
• Improved computing
resources have been a
driving force in
Bioinformatics
ISU's supercomputer "CyBlue" is among
100 most powerful in the world
Modified from Mark Gerstein
(Internet picture adaptedfrom D Brutlag, Stanford)
25
from Mark Gerstein
Weber Cartoon
26
Challenges in Organizing & Understanding Highthroughput Data:
Redundancy and Multiplicity
• Different sequences can have the
same structure
• Organism has many similar genes
• Single gene may have multiple
functions
• Genes and proteins function in genetic
and regulatory pathways
• How do we organize all this
information so that we can make
sense of it?
Integrative Genomics:
genes >< structures <> functions <> pathways <> expression
<>regulatory systems <> ….
Modified from Mark Gerstein
27
"Simple" example? Proteins
Molecular Parts = Conserved Domains
Modified from Mark Gerstein
28
"Parts List" approach to bike maintenance:
How many roles
can these play?
How flexible and
adaptable are they
mechanically?
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - types of parts
(nuts & washers)?
Where are
the parts
located?
Modified from Mark Gerstein
29
World of protein structures is also finite,
providing a valuable simplification!
(human)
1
2
3
4
5
6
7
8
9
10 11
12 13
14 15 16
17 18 19
20
…
~20,000 genes
~2,000 folds
(T. pallidum)
1
2
3
4
5
6
7
8
9
10 11
12 13
14 15 …
~2,000 genes
Global Surveys of a Finite
Set of Parts from Many
Perspectives
Same logic for pathways, functions,
sequence families, blocks, motifs....
Modified from Mark Gerstein
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related
resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....
30
BUT, what actually happens in cells & in
whole organisms is much more complex!
providing a challenging complication!!
Exploring the Virtual Cell at ISU
Virtual Cell projects elsewhere...
NCBI's Bookshelf - a great resource!
31
So, having a list of parts is not
enough!
BIG QUESTION?
How do parts work together to form a
functional system?
SYSTEMS BIOLOGY
What is a system? Macromolecular complex, pathway,
network, cell, tissue, organism, ecosystem…
32
Is this Bioinformatics?
(#1,with Answers)
• Creating digital libraries
YES
• Motif discovery using Gibb's sampling
• Methods for structure determination
YES
 Automated bibliographic search and textual comparison
 Knowledge bases for biological literature
 Computational X-ray crystallography
 NMR structure determination
• Distance Geometry
• Metabolic pathway simulation
Modified from Mark Gerstein
YES
YES
33
Is this Bioinformatics? #2
• Gene identification by sequence inspection
YES
• DNA methods in forensics
• Modeling populations of organisms
YES
 Prediction of splice sites, promoters, etc.
 Ecological Modeling
YES
 Assembling contigs
 Physical and genetic mapping
YES
• Genomic sequencing methods
• Linkage analysis
 Linking specific genes to various traits
Modified from Mark Gerstein
YES
34
Is this Bioinformatics? #3
• Rational drug design
• RNA structure prediction
• Protein structure prediction
YES
• Radiological image processing
 Computational representations for human anatomy
• (e.g., Visible Human)
• Artificial life simulations
 Artificial immunology
 Virtual cells
Modified from Mark Gerstein
Maybe
Yes
35
So, this is Bioinformatics
What is it good for?
36
EXAMPLES OF
BIOINFORMATICS RESEARCH
A few general ones
&
a few personal favorites!
37
Designing New Drugs
• Understanding how proteins bind other molecules
• Structural modeling & ligand docking
• Designing inhibitors or modulators of key proteins
Modified from Mark Gerstein
Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web
page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). 38
Finding homologs of "new" human genes
Modified from Mark Gerstein
39
Finding WHAT?
Homologs - "same genes" in different organisms
(actually, orthologs)
• Human vs. Mouse vs. Yeast
 Much easier to do experiments on yeast to determine function
 Often, function of an ortholog in at least one organism is known
Best Sequence Similarity Matches to Date Between Positionally Cloned
Human Genes and S. cerevisiae Proteins
Human Disease
MIM #
Human
Gene
GenBank
BLASTX
Acc# for
P-value
Human cDNA
Yeast
Gene
GenBank
Yeast Gene
Acc# for
Description
Yeast cDNA
Hereditary Non-polyposis Colon Cancer
Hereditary Non-polyposis Colon Cancer
Cystic Fibrosis
Wilson Disease
Glycerol Kinase Deficiency
Bloom Syndrome
Adrenoleukodystrophy, X-linked
Ataxia Telangiectasia
Amyotrophic Lateral Sclerosis
Myotonic Dystrophy
Lowe Syndrome
Neurofibromatosis, Type 1
120436
120436
219700
277900
307030
210900
300100
208900
105400
160900
309000
162200
MSH2
MLH1
CFTR
WND
GK
BLM
ALD
ATM
SOD1
DM
OCRL
NF1
U03911
U07418
M28668
U11700
L13943
U39817
Z21876
U26455
K00065
L19268
M88162
M89914
9.2e-261
6.3e-196
1.3e-167
5.9e-161
1.8e-129
2.6e-119
3.4e-107
2.8e-90
2.0e-58
5.4e-53
1.2e-47
2.0e-46
MSH2
MLH1
YCF1
CCC2
GUT1
SGS1
PXA1
TEL1
SOD1
YPK1
YIL002C
IRA2
M84170
U07187
L35237
L36317
X69049
U22341
U17065
U31331
J03279
M21307
Z47047
M33779
DNA repair protein
DNA repair protein
Metal resistance protein
Probable copper transporter
Glycerol kinase
Helicase
Peroxisomal ABC transporter
PI3 kinase
Superoxide dismutase
Serine/threonine protein kinase
Putative IPP-5-phosphatase
Inhibitory regulator protein
Choroideremia
Diastrophic Dysplasia
Lissencephaly
Thomsen Disease
Wilms Tumor
Achondroplasia
Menkes Syndrome
303100
222600
247200
160800
194070
100800
309400
CHM
DTD
LIS1
CLC1
WT1
FGFR3
MNK
X78121
U14528
L13385
Z25884
X51630
M58051
X69208
2.1e-42
7.2e-38
1.7e-34
7.9e-31
1.1e-20
2.0e-18
2.1e-17
GDI1
SUL1
MET30
GEF1
FZF1
IPL1
CCC2
S69371
X82013
L26505
Z23117
X67787
U07163
L36317
GDP dissociation inhibitor
Sulfate permease
Methionine metabolism
Voltage-gated chloride channel
Sulphite resistance protein
Serine/threoinine protein kinase
Probable copper transporter
Modified from Mark Gerstein
40
Comparative Genomics
Genome/Transcriptome/Proteome/Metabolome
Databases, statistics
• Occurrence of a specific
genes or features in a
genome
 How many kinases in yeast?
• Compare Tissues
 Which proteins are expressed in
cancer vs normal tissues?
• Diagnostic tools
• Drug target discovery
Modified from Mark Gerstein
41
Molecular Recognition:
Analyzing & Predicting Macromolecular Interfaces
(in DNA, RNA & protein complexes)
Drena Dobbs, GDCB
Jae-Hyung Lee
Michael Terribilini
Jeff Sander
Pete Zaback
Vasant Honavar, Com S
Feihong Wu
Cornelia Caragea
Robert Jernigan, BBMB
Taner Sen
Andrzej Kloczkowski
Kai-Ming Ho, Physics
42
Designing Zinc Finger DNA-binding proteins to
recognize specific sites in genomic DNA
Drena Dobbs, GDCB
Jeff Sander
Pete Zaback
Dan Voytas, GDCB
Fenglli Fu
Les Miller, ComS
Vasant Honavar, ComS
Keith Joung, Harvard
Structure & function of human telomerase:
Predicting structure & functional sites in a clinically
important but "recalcitrant" RNP
Cell Biologist:
www.intl-pag.org/
Biochemist:
www.chemicon.com
Imagined structure:
Lingner et al (1997) Science 276: 561-567.
How would a systems biologist study telomerase?
44
Resources for Bioinformatics &
Computational Biology
• Wikipedia:
•
•
•
•
Bioinformatics
NCBI - National Center for Biotechnology Information
ISCB - International Society for Computational Biology
JCB - Jena Center for Bioinformatics
UBC - Bioinformatics Links Directory
45
ISU Resources & Experts
ISU Research Centers & Graduate Training Programs:
BCB - Bioinformatics & Computational Biology
Baker Center - Bioinformatics & Biological Statistics
CIAG - Center for Integrated Animal Genomics
CILD - Computational Intelligence, Learning &
Discovery
ISU Facilities:
Biotech - Instrumentation Facilities
CIAG - Center for Integrated Animal Genomics
PSI - Plant Sciences Institute
PSI Centers
46
For fun: DNA Interactive: "Genomes"
A tutorial on genomic sequencing, gene structure,
genes prediction
Howard Hughes Medical Institute (HHMI)
Cold Spring Harbor Laboratory (CSHL)
http://www.dnai.org/c/index.html
47