Bioinformatics: Overview

Download Report

Transcript Bioinformatics: Overview

Mark Gerstein, Yale University
gersteinlab.org/courses/452
(last edit in fall 2005)
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
BIOINFORMATICS
Introduction
Biological
Data
+
Computer
Calculations
2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Bioinformatics
What is Bioinformatics?
Cor
e
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• (Molecular) Bio - informatics
What is the Information?
Molecular Biology as an Information Science
DNA
-> RNA
-> Protein
-> Phenotype
-> DNA
• Molecules

Sequence, Structure, Function
• Processes

Mechanism, Specificity, Regulation
• Central Paradigm
for Bioinformatics
Genomic Sequence Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Phenotype
• Large Amounts of Information


Standardized
Statistical
•Most cellular functions are performed or
facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Genetic material
•Information transfer (mRNA)
•Protein synthesis (tRNA/mRNA)
•Some catalytic activity
•Immune protection
•Control of growth/differentiation
(idea from D Brutlag, Stanford, graphics from S Strobel)
4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Central Dogma
of Molecular Biology
Molecular Biology Information - DNA
 Coding or Not?
 Parse into genes?
 4 bases: AGCT
 ~1 K in a gene,
~2 M in genome
 ~3 Gb Human
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca
gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac
atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg
aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca
gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc
ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact
ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca
ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca
tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct
gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt
gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc
aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact
gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca
gacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . .
caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg
cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt
gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc
aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc
ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa
aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Raw DNA Sequence
Molecular Biology Information:
Protein Sequence
 ACDEFGHIKLMNPQRSTVWY
but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria),
~200 aa in a domain
• ~200 K known protein sequences
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI
TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI
TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA
---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
-PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
-PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
-G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA
-P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• 20 letter alphabet
Molecular Biology Information:
Macromolecular Structure
 Almost all protein
(RNA Adapted From D Soll Web Page,
Right Hand Top Protein from M Levitt web page)
7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• DNA/RNA/Protein
Molecular Biology Information:
Protein Structure Details
 200 residues/domain -> 200 CA atoms, separated by 3.8 A
 Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A
• => ~1500 xyz triplets (=8x200) per protein domain
 10 K known domain, ~300 folds
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
12
C
O
CH3
N
CA
C
O
CB
OG
N
CA
C
ACE
ACE
ACE
SER
SER
SER
SER
SER
SER
ARG
ARG
ARG
0
0
0
1
1
1
1
1
1
2
2
2
9.401
10.432
8.876
8.753
9.242
10.453
10.593
8.052
7.294
11.360
12.548
13.502
30.166
30.832
29.767
29.755
30.200
29.500
29.607
30.189
31.409
28.819
28.316
29.501
60.595
60.722
59.226
61.685
62.974
63.579
64.814
63.974
63.930
62.827
63.532
63.500
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
49.88
50.35
50.04
49.13
46.62
41.99
43.24
53.00
57.79
36.48
30.20
25.54
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
1GKY
67
68
69
70
71
72
73
74
75
76
77
78
1444
1445
1446
1447
1448
1449
1450
CB
CG
CD
CE
NZ
OXT
LYS
LYS
LYS
LYS
LYS
LYS
LYS
186
186
186
186
186
186
186
13.836
12.422
11.531
11.452
10.735
16.887
22.263
22.452
21.198
20.402
21.104
23.841
57.567
58.180
58.185
56.860
55.811
56.647
1.00
1.00
1.00
1.00
1.00
1.00
55.06
53.45
49.88
48.15
48.41
62.94
1GKY1510
1GKY1511
1GKY1512
1GKY1513
1GKY1514
1GKY1515
1GKY1516
...
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
TER
8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Statistics on Number of XYZ triplets
Molecular Biology
Information:
Whole Genomes
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F.,
Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K.,
Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A.,
Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D.,
Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D.,
Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A.,
Venter, J. C. (1995). "Wholegenome random sequencing and assembly of Haemophilus influenzae rd."
Genome sequence now
Science 269: 496-512.
accumulate so quickly that,
(Picture adapted from TIGR website,
in less than a week, a single
http://www.tigr.org)
laboratory can produce
• Integrative Data
more bits of data than
1995, HI (bacteria): 1.6 Mb & 1600 genes done Shakespeare managed in a
1997, yeast: 13 Mb & ~6000 genes for yeast
lifetime, although the latter
1998, worm: ~100Mb with 19 K genes
make better reading.
Small, K. V., Fraser, C. M., Smith, H. O. &
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
-- G A Pekso, Nature 401: 115-116 (1999)
9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• The Revolution Driving Everything
Bacteria,
1.6 Mb,
~1600 genes
[Science 269: 496]
1997
Eukaryote,
13 Mb,
~6K genes
[Nature 387: 1]
Genomes
highlight
the
Finiteness
of the
“Parts” in
Biology
1998
real thing, Apr ‘00
Animal,
~100 Mb,
~20K genes
[Science 282:
1945]
2000?
Human,
~3 Gb,
~100K
genes [???]
‘98 spoof
10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
1995
Young/Lander, Chips,
Abs. Exp.
Brown, marray,
Rel. Exp. over
Timecourse
Also: SAGE;
Samson and
Church, Chips;
Aebersold,
Protein
Expression
Snyder,
Transposons,
Protein Exp.
11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Gene Expression
Datasets: the
Transcriptome
Yeast Expression Data in
Academia:
levels for all 6000 genes!
Can only sequence genome
once but can do an infinite
variety of these array
experiments
at 10 time points,
6000 x 10 = 60K floats
telling signal from
background
(courtesy of J Hager)
12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Array Data
Systematic Knockouts
Winzeler, E. A., Shoemaker, D. D.,
Astromoff, A., Liang, H., Anderson, K.,
Andre, B., Bangham, R., Benito, R.,
Boeke, J. D., Bussey, H., Chu, A. M.,
Connelly, C., Davis, K., Dietrich, F., Dow,
S. W., El Bakkoury, M., Foury, F., Friend,
S. H., Gentalen, E., Giaever, G.,
Hegemann, J. H., Jones, T., Laub, M.,
Liao, H., Davis, R. W. & et al. (1999).
Functional characterization of the S.
cerevisiae genome by gene deletion and
parallel analysis. Science 285, 901-6
2 hybrids, linkage maps
Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. &
Zhu, L. (1998). Construction of a modular yeast twohybrid cDNA library from human EST clones for the
human genome protein linkage map. Gene 215,
143-52
For yeast:
6000 x 6000 / 2
~ 18M interactions
13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Other WholeGenome
Experiments
• Information to
understand genomes
 Metabolic Pathways
(glycolysis), traditional
biochemistry
 Regulatory Networks
 Whole Organisms
Phylogeny, traditional
zoology
 Environments, Habitats,
ecology
 The Literature
(MEDLINE)
• The Future....
(Pathway drawing from P Karp’s EcoCyc, Phylogeny
from S J Gould, Dinosaur in a Haystack)
14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Molecular Biology Information:
Other Integrative Data
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What is Bioinformatics?
GenBank
Growth
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
GenBank Data
Base Pairs
Sequences
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486
78608
157152442
143492
217102462
215273
384939485
555694
651972984
1021211
1160300687
1765847
2008761784
2837897
3841163011
4864570
8604221980
7077491
16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Large-scale
Information:
Internet
Hosts
 As important as the
increase in computer
speed has been, the
ability to store large
amounts of
information on
computers is even
more crucial
(Internet picture adapted
from D Brutlag, Stanford)
Num.
Protein
Domain
Structures
Structures in PDB
• Driving Force in
Bioinformatics
1979
1981
1983
1985
1987
1989
1991
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1980
1993
1995
140
120
100
80
60
40
20
0
1985
1990
1995
CPU Instruction
Time (ns)
• CPU vs Disk & Net
17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Large-scale Information:
Explonential Growth of Data Matched by
Development of Computer Technology
3000
2500
2000
Per Year
Cumulative
1500
1000
500
0
1998
2000
2002
2004
18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Number of Papers
PubMed publications with title
“microarray”
Features per chip
transistors
oligo features
19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Features per Slide
(courtesy of Finn Drablos)
20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Bioinformatics is born!
21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Weber
Cartoon
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What is Bioinformatics?
• Different Sequences Have the
Same Structure
• Organism has many similar genes
• Single Gene May Have Multiple
Functions
• Genes are grouped into Pathways
• Genomic Sequence Redundancy
due to the Genetic Code
• How do we find the
similarities?.....
Cor
e
Integrative Genomics genes  structures 
functions  pathways 
expression levels 
regulatory systems  ….
23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Organizing
Molecular Biology
Information:
Redundancy and
Multiplicity
24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Molecular Parts = Conserved
Domains, Folds, &c
Extra
25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
A Parts List Approach to Bike Maintenance
How many roles
can these play?
How flexible and
adaptable are they
mechanically?
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - types of parts
(nuts & washers)?
Extra
Where are
the parts
located?
26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
A Parts List Approach to Bike Maintenance
Total in Databank
New Submissions
New Folds
27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Vast Growth in (Structural)
Data...
but number of
Fundamentally New (Fold)
Parts Not Increasing that
Fast
World of Structures is even more Finite,
providing a valuable simplification
2
3
4
5
6
7
8
9
10 11
12 13
14 15 16
17 18 19
20
…
~100000 genes
~1000 folds
(human)
(T. pallidum)
1
2
3
4
5
6
7
8
9
10 11
Same logic for pathways, functions,
sequence families, blocks, motifs....
Global Surveys of a
Finite Set of Parts from
Many Perspectives
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from,
ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom,
Pfam, Blocks, Domo, WIT, CATH, Scop....
12 13
14 15 …
~1000 genes
28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
1
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What is Bioinformatics?
• Databases
 Building, Querying
 Object DB
• Text String Comparison




Text Search
1D Alignment
Significance Statistics
Alta Vista, grep
• Finding Patterns
 AI / Machine Learning
 Clustering
 Datamining
• Geometry
 Robotics
 Graphics (Surfaces, Volumes)
 Comparison and 3D Matching
(Vision, recognition)
• Physical Simulation




Newtonian Mechanics
Electrostatics
Numerical Algorithms
Simulation
30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
General Types of
“Informatics” techniques
in Bioinformatics
Bioinformatics as New Paradigm for
Scientific Computing
 Prediction based on physical
principles
 EX: Exact Determination of
Rocket Trajectory
 Emphasizes: Supercomputer,
CPU
• Biology
Cor
e
 Classifying information and
discovering unexpected
relationships
 EX: Gene Expression Network
 Emphasizes: networks,
“federated” database
31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Physics
Vs.
Chemical
Understanding,
Mechanism,
Molecular Biology
32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Statistical
Analysis
vs.
Classical
Physics
Bioinformatics, Genomic
Surveys
• Finding Genes in Genomic
DNA
 introns
 exons
 promotors
• Characterizing Repeats in
Genomic DNA
 Statistics
 Patterns
• Duplications in the Genome
 Large scale genomic alignment
33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Bioinformatics Topics -Genome Sequence
 non-exact string matching, gaps
 How to align two strings optimally
via Dynamic Programming
 Local vs Global Alignment
 Suboptimal Alignment
 Hashing to increase speed
(BLAST, FASTA)
 Amino acid substitution scoring
matrices
• Multiple Alignment and
Consensus Patterns
 How to align more than one
sequence and then fuse the
result in a consensus
representation
 Transitive Comparisons
 HMMs, Profiles
 Motifs
Bioinformatics
Topics -Protein Sequence
• Scoring schemes and
Matching statistics
 How to tell if a given alignment or
match is statistically significant
 A P-value (or an e-value)?
 Score Distributions
(extreme val. dist.)
 Low Complexity Sequences
• Evolutionary Issues
 Rates of mutation and change
34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Sequence Alignment
• Secondary Structure
“Prediction”
 via Propensities
 Neural Networks, Genetic
Alg.
 Simple Statistics
 TM-helix finding
 Assessing Secondary
Structure Prediction
• Tertiary Structure Prediction
 Fold Recognition
 Threading
 Ab initio
• Function Prediction
 Active site identification
• Structure Prediction:
Protein v RNA
• Relation of Sequence Similarity to
Structural Similarity
35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Bioinformatics
Topics -Sequence /
Structure
• Basic Protein Geometry and
Least-Squares Fitting
 Distances, Angles, Axes,
Rotations
• Calculating a helix axis in 3D
via fitting a line
 LSQ fit of 2 structures
 Molecular Graphics
• Calculation of Volume and
Surface




How to represent a plane
How to represent a solid
How to calculate an area
Docking and Drug Design as
Surface Matching
 Packing Measurement
• Structural Alignment
 Aligning sequences on the basis
of 3D structure.
 DP does not converge, unlike
sequences, what to do?
 Other Approaches: Distance
Matrices, Hashing
 Fold Library
36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Topics -- Structures
 Keys, Foreign Keys
 SQL, OODBMS, views, forms,
transactions, reports, indexes
 Joining Tables, Normalization
• Natural Join as "where"
selection on cross product
• Array Referencing (perl/dbm)
 Forms and Reports
 Cross-tabulation
• Protein Units?
 What are the units of biological
information?
• sequence, structure
• motifs, modules, domains
 How classified: folds, motions,
pathways, functions?
Topics -Databases
• Clustering and Trees
 Basic clustering
• UPGMA
• single-linkage
• multiple linkage
 Other Methods
• Parsimony, Maximum
likelihood
 Evolutionary implications
• Visualization of Large
Amounts of Information
• The Bias Problem
 sequence weighting
 sampling
37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Relational Database
Concepts and how they
interface with Biological
Information
• Expression Analysis
 Time Courses clustering
 Measuring differences
 Identifying Regulatory Regions
• Large scale cross referencing
of information
• Function Classification and
Orthologs
• The Genomic vs. Singlemolecule Perspective
• Genome Comparisons






Ortholog Families, pathways
Large-scale censuses
Frequent Words Analysis
Genome Annotation
Trees from Genomes
Identification of interacting
proteins
• Structural Genomics
 Folds in Genomes, shared &
common folds
 Bulk Structure Prediction
• Genome Trees
38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Topics -- Genomics
• Molecular Simulation
 Geometry -> Energy -> Forces
 Basic interactions, potential
energy functions
 Electrostatics
 VDW Forces
 Bonds as Springs
 How structure changes over
time?
• How to measure the change
in a vector (gradient)
 Molecular Dynamics & MC
 Energy Minimization
•
•
•
•
Parameter Sets
Number Density
Poisson-Boltzman Equation
Lattice Models and
Simplification
39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Topics -- Simulation
40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Bioinformatics
Spectrum
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What is Bioinformatics?
Cor
e
• Understanding How Structures Bind Other Molecules (Function)
• Designing Inhibitors
• Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from
Computational Chemistry Page at Cornell Theory Center).
42 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Major Application I:
Designing Drugs
43 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Major Application II: Finding Homologs
Cor
e
• Find Similar Ones in Different Organisms
• Human vs. Mouse vs. Yeast
 Easier to do Expts. on latter!
(Section from NCBI Disease Genes Database Reproduced Below.)
Best Sequence Similarity Matches to Date Between Positionally Cloned
Human Genes and S. cerevisiae Proteins
Human Disease
MIM #
Human
Gene
GenBank
BLASTX
Acc# for
P-value
Human cDNA
Yeast
Gene
GenBank
Yeast Gene
Acc# for
Description
Yeast cDNA
Hereditary Non-polyposis Colon Cancer
Hereditary Non-polyposis Colon Cancer
Cystic Fibrosis
Wilson Disease
Glycerol Kinase Deficiency
Bloom Syndrome
Adrenoleukodystrophy, X-linked
Ataxia Telangiectasia
Amyotrophic Lateral Sclerosis
Myotonic Dystrophy
Lowe Syndrome
Neurofibromatosis, Type 1
120436
120436
219700
277900
307030
210900
300100
208900
105400
160900
309000
162200
MSH2
MLH1
CFTR
WND
GK
BLM
ALD
ATM
SOD1
DM
OCRL
NF1
U03911
U07418
M28668
U11700
L13943
U39817
Z21876
U26455
K00065
L19268
M88162
M89914
9.2e-261
6.3e-196
1.3e-167
5.9e-161
1.8e-129
2.6e-119
3.4e-107
2.8e-90
2.0e-58
5.4e-53
1.2e-47
2.0e-46
MSH2
MLH1
YCF1
CCC2
GUT1
SGS1
PXA1
TEL1
SOD1
YPK1
YIL002C
IRA2
M84170
U07187
L35237
L36317
X69049
U22341
U17065
U31331
J03279
M21307
Z47047
M33779
DNA repair protein
DNA repair protein
Metal resistance protein
Probable copper transporter
Glycerol kinase
Helicase
Peroxisomal ABC transporter
PI3 kinase
Superoxide dismutase
Serine/threonine protein kinase
Putative IPP-5-phosphatase
Inhibitory regulator protein
Choroideremia
Diastrophic Dysplasia
Lissencephaly
Thomsen Disease
Wilms Tumor
Achondroplasia
Menkes Syndrome
303100
222600
247200
160800
194070
100800
309400
CHM
DTD
LIS1
CLC1
WT1
FGFR3
MNK
X78121
U14528
L13385
Z25884
X51630
M58051
X69208
2.1e-42
7.2e-38
1.7e-34
7.9e-31
1.1e-20
2.0e-18
2.1e-17
GDI1
SUL1
MET30
GEF1
FZF1
IPL1
CCC2
S69371
X82013
L26505
Z23117
X67787
U07163
L36317
GDP dissociation inhibitor
Sulfate permease
Methionine metabolism
Voltage-gated chloride channel
Sulphite resistance protein
Serine/threoinine protein kinase
Probable copper transporter
44 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Major Application II:
Finding Homologues
•
•
•
•
Cross-Referencing, one thing to another thing
Sequence Comparison and Scoring
Analogous Problems for Structure Comparison
Comparison has two parts:
(1)
(2)
Optimally Aligning 2 entities to get a Comparison Score
Assessing Significance of this score in a given Context
• Integrated Presentation
 Align Sequences
 Align Structures
 Score in a Uniform Framework
45 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Major Application II:
Finding Homologues (cont.)
• Overall Occurrence of a
Certain Feature in the
Genome
 e.g. how many kinases in Yeast
• Compare Organisms and
Tissues
 Expression levels in Cancerous vs
Normal Tissues
• Databases, Statistics
(Clock figures, yeast v. Synechocystis,
adapted from GeneQuiz Web Page, Sander Group, EBI)
46 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Cor
Major Application I|I:
e
Overall Genome Characterization
EX-2: Occurrence of 1-4
salt bridges in genomes
of thermophiles v
mesophiles
0.70
EK(3)
EK(4)
0.60
0.50
LOD value
EX-1: Occurrence of
functions per fold &
interactions per fold
over all genomes
0.40
0.30
0.20
0.10
0.00
MP
MG
EC
SC
HP
10 to 45
Mesophile
SS
HI
MT
MJ
AF
AA
OT
65
85
83
95
98
Thermophile
Physiological temperature in C
47 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What do you get from large-scale
datamining? Global statistics on
the population of proteins
(Bioinfo-1)
[next class joins intro & seqs.]
48 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
End of class 2005,10.23
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry) and
then applying “informatics” techniques (derived
from disciplines such as applied math, CS, and
statistics) to understand and organize the
information associated with these molecules, on a
large-scale.
• Bioinformatics is a practical discipline with many
applications.
49 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What is Bioinformatics?
• Digital Libraries
 Automated Bibliographic Search of the biological literature and
Textual Comparison
 Knowledge bases for biological literature
• Motif Discovery Using Gibb's Sampling
• Methods for Structure Determination
 Computational Crystallography
• Refinement
 NMR Structure Determination
• Distance Geometry
• Metabolic Pathway Simulation
• The DNA Computer
50 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#1)
• (YES?) Digital Libraries
 Automated Bibliographic Search and Textual Comparison
 Knowledge bases for biological literature
• (YES) Motif Discovery Using Gibb's Sampling
• (NO?) Methods for Structure Determination
 Computational Crystallography
• Refinement
 NMR Structure Determination
• (YES) Distance Geometry
• (YES) Metabolic Pathway Simulation
• (NO) The DNA Computer
51 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#1, Answers)
• Gene identification by sequence inspection
 Prediction of splice sites
• DNA methods in forensics
• Modeling of Populations of Organisms
 Ecological Modeling
• Genomic Sequencing Methods
 Assembling Contigs
 Physical and genetic mapping
• Linkage Analysis
 Linking specific genes to various traits
52 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#2)
• (YES) Gene identification by sequence inspection
 Prediction of splice sites
• (YES) DNA methods in forensics
• (NO) Modeling of Populations of Organisms
 Ecological Modeling
• (NO?) Genomic Sequencing Methods
 Assembling Contigs
 Physical and genetic mapping
• (YES) Linkage Analysis
 Linking specific genes to various traits
53 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#2, Answers)
• RNA structure prediction
Identification in sequences
• Radiological Image Processing
 Computational Representations for Human Anatomy (visible human)
• Artificial Life Simulations
 Artificial Immunology / Computer Security
 Genetic Algorithms in molecular biology
• Homology modeling
• Determination of Phylogenies Based on Nonmolecular Organism Characteristics
• Computerized Diagnosis based on Genetic Analysis
(Pedigrees)
54 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#3)
• (YES) RNA structure prediction
Identification in sequences
• (NO) Radiological Image Processing
 Computational Representations for Human Anatomy (visible human)
• (NO) Artificial Life Simulations
 Artificial Immunology / Computer Security
 (NO?) Genetic Algorithms in molecular biology
• (YES) Homology modeling
• (NO) Determination of Phylogenies Based on Nonmolecular Organism Characteristics
• (NO) Computerized Diagnosis based on Genetic
Analysis (Pedigrees)
55 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Are They or Aren’t They
Bioinformatics? (#3, Answers)
• Issues that were uncovered
 Does topic stand alone?
 Is bioinformatics acting as tool?
 How does it relate to lab work?
• Relationship to other
disciplines
 Medical informatics
 Synthetic biology
 Systems biology
• Biological question is
important, not the specific
technique -- but it has to be
computational
 Using computers to understand
biology vs using biology to inspire
computation
• Some new ones (2005)
 Disease modeling [are you
modeling molecules?]
 Enzymology (kinetics and rates?)
[is it a simulation or is it
interpreting 1 expt.? ]
 Genetic algs used in gene finding
HMMs used in gene finding
• vs. Genetic algs used in
speech recognition
HMMs used in speech
recognition
 Semantic web used for
representing biological
information
56 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Further Thoughts in 2005 on the
"Boundary of Bioinformatics"