RISE AND FALL OF GENE FAMILIES Dynamics of Their Expansion
Download
Report
Transcript RISE AND FALL OF GENE FAMILIES Dynamics of Their Expansion
Genomics and Bioinformatics
The "new" biology
What is genomics
Genome
All the DNA contained in the cell
of an organism
Genomics
The comprehensive study of the
interactions and functional
dynamics of whole sets of genes
and their products. (NIAAA,
NIH)
A "scaled-up" version of genetics
research in which scientists can
look at all of the genes in a living
creature at the same time.
(NIGMS, NIH)
Which organism’s genome was
sequenced first?
Genome sequencing chronology
Genome
size (bp)
Number
of genes
Year
Organism
Significance
1977
Bacteriophage
fX174
First genome
ever!
1981
Human
mitochondria
First
organelle
1995
Haemophilus
influenzae Rd
First freeliving
organism
1,830,137 ~3,500
1996
Saccharomyces
cerevisiae
First
eukaryote
12,086,000 ~6,000
5,386 11
16,500 37
http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpg
http://www.alsa.org/research/article.cfm?id=822
http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpg
http://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg
Genome sequencing chronology
Genome size
(bp)
Number
of genes
Year
Organism
Significance
1998
Caenorhabditis elegans
First multicellular
organism
97,000,000 ~19,000
1999
Human
chromosome
22
First human
chromosome
49,000,000 673
2000
Arabidopsis
thaliana
First plant
genome
2001
Human
First human
genome
150,000,000 ~25,000
3,000,000,000 ~30,000
http://www.sih.m.u-tokyo.ac.jp/chem1.gif
http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg
Genome sequencing projects (as of 1/26,2007)
Sequencing strategies: Hierarchical shotgun sequencing
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Genome size range
What’re there in the genomes? Why are there such a big
difference?
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
amphibians
reptiles
birds
mammals
104
105
106
107
108
109
1010
1011
Information contents in a genome
Gene
Protein coding genes
RNA genes
Regulatory elements
Gene expression control
Chromatin remodeling
Matrix attachment sites
“Non-functional” elements
Selfish elements
“Junk” DNA
??
The “central dogma” of molecular biology
Central dogma
Replication
DNA
Transcription
RNA
Translation
Protein
Expanded “central dogma” of molecular biology
A more comprehensive view
Replication
DNA
Transcription
RNA
Translation
Phenotype
Protein
Metabolite
New disciplines due to the advance in genomics
Omics
Replication
DNA
Genomic DNA
sequences
Structural
genomics
Transcription
RNA
Translation
Phenotype
Genetic interactions
Systematic KO
Disease information
Transcript seq
Microarray data
Cis-elements
TF binding sites
Epigenetic regulation
Protein
Shotgun protein seq
Subcellular location
Post-translational mod
Protein interaction
Protein structure
Metabolite
Metabolite concn
Metabolic flux
Transcriptomics
Proteomics
Metabolomics
Nature omics gateway
http://www.nature.com/omics/subjects/index.html
Three perspectives of our biological world
The cellular level, the individual, the tree of life
~3x104 genes
~1014 cells per individual
2-100x106 species
Rosenzweig et al., 2002. Conservation Biol.
Image: htto://www.tolweb.org/tree/
Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html
Further complications
Cell-cell interactions
Cell types
Environmental conditions
Developmental
programming
Interactions at the
organismal level
Interactions at the
population, ecosystem level
Definition of bioinformatics
Bioinformatics
Research, development, or application of
Computational tools and approaches for expanding the use of
Biological, medical, behavioral or health data, including those to
Acquire, store, organize, archive, analyze, or visualize such data.
Computational biology
The development and application of
Data-analytical and theoretical methods, mathematical modeling and
computational simulation techniques to
The study of biological, behavioral, and social systems
Q: What kinds of data are we taking about?
http://www.bisti.nih.gov/
Example: Sequence assembly
Cut into ~150kb pieces
Clone into Bacterial
Artificial Chromosome
(BAC)
Mapped to determine
order of the BAC
clones (golden/tiling
path)
Shear a BAC clone
randomly
Sequencing
Assembie sequence
reads
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Sequence assembly
Challenges
The presence of gaps
Due to incomplete coverage
Sequencing error and quality issue: worse at the end of reactions
So can’t rely on perfectly identical sequences all the time
Sequences derived from one strand of DNA
Need to take orientations of reads into account
Non-random sequencing of DNA
Presence of repeats
Correct layout
Mis-assembly
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-layout consensus
The relationships between reads can be represented as a graph
Nodes (vertices): reads
Edges (lines): connecting “overlapping reads”
Genome
1
2
3
4
2
1
4
3
Goal: identifying a path through that graph that visits each node
exactly once
http://en.wikipedia.org/wiki/Image:Hamilton_path.gif
Example: Gene prediction
How can we identify functional elements in the genomes?
How can we assign functions to these elements?
How can we determine/predict the structures of these elements?
How can we reconstruct networks describing the relationships
and dynamics between these elements?
How can we link genotypes to phenotypes?
Characteristic of protein coding genes
Similarity to other genes
Assuming there is some level of conservation.
Substitutions that change amino acids vs. those that won’t.
http://www.mun.ca/biology/scarr/MGA2_03-20.html
Hidden Markov Model and gene finding
Goal:
Choose a path that maximize the probability that you will enjoy the trip
(or the other way around if you wish)
How is the probability determined?
p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2
Example: Sequence alignment
Align retinol-binding protein and b-lactoglobulin
>RBP
MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRL
LNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPN
GLPPEAQKIVRQRQEELCLARQYRLIV
>lactoglobulin
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
GECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKA
LPMHIRLSFNPTQLEEQCHI
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Goal of PSA
Find an alignment between 2 sequences with the maximum score
Extreme value distribution
Normal vs. extreme value distribution
0.40
normal
distribution
0.35
0.30
probability
extreme
value
distribution
0.25
0.20
0.15
0.10
0.05
0
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
Example: Microarray
A solid support (e.g. a membrane or glass slide) on which DNA of
known sequence is deposited in a grid-like fashion
http://shadygrove.umbi.umd.edu/microarray/Microarray.gif
Microarray data analysis
A simplified pipeline
http://www.microarray.lu/images/overview_1.jpg
What’s in the cel files
Intensities of perfect and mismatch probes
#### Dimension of the data matrix
nrow(M); ncol(M)
### Perfect match
pm <- pm(M)
dim(pm)
pm[1:5,]
summary(pm)
[1,]
[2,]
[3,]
[4,]
[5,]
#
#
#
#
perfect match intensities
dimension of the pm matrix
the first five columns
summary stat for the pm matrix
GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL
252.5
267.0
349.0
424.8
213.5
237.8
138.0
129.8
147.5
335.5
215.3
142.3
172.3
155.5
174.8
411.8
241.0
128.3
163.3
142.8
155.5
494.3
225.5
119.5
259.5
257.3
245.3
505.5
308.8
217.0
GSM131151.CEL
Min.
:
56.3
1st Qu.: 144.3
Median : 212.5
Mean
: 423.1
3rd Qu.: 383.5
Max.
:39818.5
GSM131152.CEL
Min.
:
67.5
1st Qu.: 143.3
Median : 215.0
Mean
: 437.5
3rd Qu.: 397.8
Max.
:39268.0
GSM131153.CEL
Min.
:
69.5
1st Qu.: 157.3
Median : 234.8
Mean
: 458.4
3rd Qu.: 426.0
Max.
:28628.0
GSM131160.CEL
Min.
:
96.0
1st Qu.: 303.6
Median : 414.5
Mean
: 648.2
3rd Qu.: 637.0
Max.
:24854.5
Probe intensity behaviors between arrays
Distributions vary widely between experiments
### Summarize the intensity
par(mfrow=c(1,2))
# get a plotting region with 1 row, 2 col
hist(M)
# generate log2 histograms
boxplot(M)
# generate log2 boxplots
log intensity
Example: Identification of cis-elements
The on-off switches and rheostats of a cell operating at the gene
level.
They control whether and how vigorously that genes will be
transcribed into RNAs.
http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml
Motif model: Position Frequency Matrix (PFM)
fb,i : freuqnecy of a base b occurred at the i-th position
D’haeseleer (2006) Nature Biotech. 24:423
Motif model: Position Weight Matrix (PWM)
Suppose pA,T = 0.32 and pG,C = 0.18 (Arabidopsis thaliana)
Wb,i
n
ln
b ,i
pb /( N 1)
pb
Position Frequency Matrix
Position Wight Matrix
1
2
3
4
5
1
2
3
4
5
A
8
0
4
4
2
A
1.1
-2.2
0.4
0.4
-0.2
T
0
0
0
2
2
T
-2.2
-2.2
-2.2
-0.2
-0.2
G
0
8
4
2
2
G
-2.2
1.6
1.0
0.3
0.3
C
0
0
0
0
2
C
-2.2
-2.2
-2.2
-2.2
0.3
Example: Cis-regulatory logic
Based on a high confidence
set of binding sites:
3,353 interactions between
116 regulators and
1,296 promoters
Harbison et al. (2004) Nature 43:99
Identification of putative cis elements
Pearson's correlation coefficient as the similarity measure.
k-mean clustering to identify co-regulated genes.
Motifs identified only with AlignACE
Beer and Tavazoie (2004) Cell 117:185
Bayesian network
Bayes' theorem
P( A | B)
P( B | A) P( A)
P( B)
n
Bayesian network
P X 1 ,..., X n P X i | parents X i
i 1
Charniak (1991) Bayesian networks without tears
Final example: Relationships between sequences
Sanger and colleagues (1950s): 1st sequence
Insulin from various mammals
Trees
An acyclic, un-directed graph with nodes and edges
External
branch
Operational
taxonomic unit
Ancestral
taxonomic units
1
2
1
1
B
G
I
Internal
branch
F
2
H
6
1
C
D
A
2
2 A
1
2
2
C
2
1
D
6
E
time
B
E
one unit
Li 1997. Molecular Evolution. p101
Enumerating trees
Suppose there are n OTUs (n ≥ 3)
Bifurcating rooted trees:
NR
Unrooted trees:
NU
(2n 3)!
2 n 3 (n 3)!
(2n 5)!
2 n 3 (n 3)!
For 10 OTUs
3.4x107 possible rooted trees
2.0x106 possible unrooted trees
http://w3.uniroma1.it/cogfil/philotrees.jpg
Impacts of genomics and bioinformatics
New ways to ask and answer question?
Hypothesis driven vs. data driven
A matter of scale
A matter of integration
Quantitative emphasis
Multi-displinary approaches
How is genomics different from genetics?
Whole genome approach versus a few genes
Investigations into the structure and function of very large numbers of
genes undertaken in a simultaneous fashion.
Genetics looks at single genes, one at a time, as a snapshot.
Genomics is trying to look at all the genes as a dynamic system, over
time, and determine how they interact and influence biological
pathways and physiology, in a much more global sense
The END
...