RISE AND FALL OF GENE FAMILIES Dynamics of Their Expansion

Download Report

Transcript RISE AND FALL OF GENE FAMILIES Dynamics of Their Expansion

Genomics and Bioinformatics
The "new" biology
What is genomics

Genome
 All the DNA contained in the cell
of an organism

Genomics
 The comprehensive study of the
interactions and functional
dynamics of whole sets of genes
and their products. (NIAAA,
NIH)
 A "scaled-up" version of genetics
research in which scientists can
look at all of the genes in a living
creature at the same time.
(NIGMS, NIH)

Which organism’s genome was
sequenced first?
Genome sequencing chronology
Genome
size (bp)
Number
of genes
Year
Organism
Significance
1977
Bacteriophage
fX174
First genome
ever!
1981
Human
mitochondria
First
organelle
1995
Haemophilus
influenzae Rd
First freeliving
organism
1,830,137 ~3,500
1996
Saccharomyces
cerevisiae
First
eukaryote
12,086,000 ~6,000
5,386 11
16,500 37
http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpg
http://www.alsa.org/research/article.cfm?id=822
http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpg
http://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg
Genome sequencing chronology
Genome size
(bp)
Number
of genes
Year
Organism
Significance
1998
Caenorhabditis elegans
First multicellular
organism
97,000,000 ~19,000
1999
Human
chromosome
22
First human
chromosome
49,000,000 673
2000
Arabidopsis
thaliana
First plant
genome
2001
Human
First human
genome
150,000,000 ~25,000
3,000,000,000 ~30,000
http://www.sih.m.u-tokyo.ac.jp/chem1.gif
http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg
Genome sequencing projects (as of 1/26,2007)
Sequencing strategies: Hierarchical shotgun sequencing
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Genome size range

What’re there in the genomes? Why are there such a big
difference?
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
amphibians
reptiles
birds
mammals
104
105
106
107
108
109
1010
1011
Information contents in a genome

Gene
 Protein coding genes
 RNA genes

Regulatory elements
 Gene expression control
 Chromatin remodeling
 Matrix attachment sites

“Non-functional” elements
 Selfish elements
 “Junk” DNA
 ??
The “central dogma” of molecular biology

Central dogma
Replication
DNA
Transcription
RNA
Translation
Protein
Expanded “central dogma” of molecular biology

A more comprehensive view
Replication
DNA
Transcription
RNA
Translation
Phenotype
Protein
Metabolite
New disciplines due to the advance in genomics

Omics
Replication
DNA
Genomic DNA
sequences
Structural
genomics
Transcription
RNA
Translation
Phenotype
Genetic interactions
Systematic KO
Disease information
Transcript seq
Microarray data
Cis-elements
TF binding sites
Epigenetic regulation
Protein
Shotgun protein seq
Subcellular location
Post-translational mod
Protein interaction
Protein structure
Metabolite
Metabolite concn
Metabolic flux
Transcriptomics
Proteomics
Metabolomics
Nature omics gateway
http://www.nature.com/omics/subjects/index.html
Three perspectives of our biological world

The cellular level, the individual, the tree of life
~3x104 genes
~1014 cells per individual
2-100x106 species
Rosenzweig et al., 2002. Conservation Biol.
Image: htto://www.tolweb.org/tree/
Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html
Further complications

Cell-cell interactions

Cell types

Environmental conditions

Developmental
programming

Interactions at the
organismal level

Interactions at the
population, ecosystem level
Definition of bioinformatics

Bioinformatics
 Research, development, or application of
 Computational tools and approaches for expanding the use of
 Biological, medical, behavioral or health data, including those to
 Acquire, store, organize, archive, analyze, or visualize such data.

Computational biology
 The development and application of
 Data-analytical and theoretical methods, mathematical modeling and
computational simulation techniques to
 The study of biological, behavioral, and social systems

Q: What kinds of data are we taking about?
http://www.bisti.nih.gov/
Example: Sequence assembly

Cut into ~150kb pieces

Clone into Bacterial
Artificial Chromosome
(BAC)

Mapped to determine
order of the BAC
clones (golden/tiling
path)

Shear a BAC clone
randomly

Sequencing

Assembie sequence
reads
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Sequence assembly

Challenges
 The presence of gaps
 Due to incomplete coverage
 Sequencing error and quality issue: worse at the end of reactions
 So can’t rely on perfectly identical sequences all the time
 Sequences derived from one strand of DNA
 Need to take orientations of reads into account
 Non-random sequencing of DNA
 Presence of repeats
Correct layout
Mis-assembly
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-layout consensus

The relationships between reads can be represented as a graph
 Nodes (vertices): reads
 Edges (lines): connecting “overlapping reads”
Genome
1
2
3
4

2
1
4
3
Goal: identifying a path through that graph that visits each node
exactly once
http://en.wikipedia.org/wiki/Image:Hamilton_path.gif
Example: Gene prediction

How can we identify functional elements in the genomes?

How can we assign functions to these elements?

How can we determine/predict the structures of these elements?

How can we reconstruct networks describing the relationships
and dynamics between these elements?

How can we link genotypes to phenotypes?
Characteristic of protein coding genes

Similarity to other genes
 Assuming there is some level of conservation.
 Substitutions that change amino acids vs. those that won’t.
http://www.mun.ca/biology/scarr/MGA2_03-20.html
Hidden Markov Model and gene finding

Goal:
 Choose a path that maximize the probability that you will enjoy the trip
(or the other way around if you wish)

How is the probability determined?
p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2
Example: Sequence alignment

Align retinol-binding protein and b-lactoglobulin
>RBP
MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRL
LNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPN
GLPPEAQKIVRQRQEELCLARQYRLIV
>lactoglobulin
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
GECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKA
LPMHIRLSFNPTQLEEQCHI
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Goal of PSA

Find an alignment between 2 sequences with the maximum score
Extreme value distribution
Normal vs. extreme value distribution
0.40
normal
distribution
0.35
0.30
probability

extreme
value
distribution
0.25
0.20
0.15
0.10
0.05
0
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
Example: Microarray

A solid support (e.g. a membrane or glass slide) on which DNA of
known sequence is deposited in a grid-like fashion
http://shadygrove.umbi.umd.edu/microarray/Microarray.gif
Microarray data analysis

A simplified pipeline
http://www.microarray.lu/images/overview_1.jpg
What’s in the cel files

Intensities of perfect and mismatch probes
#### Dimension of the data matrix
nrow(M); ncol(M)
### Perfect match
pm <- pm(M)
dim(pm)
pm[1:5,]
summary(pm)
[1,]
[2,]
[3,]
[4,]
[5,]
#
#
#
#
perfect match intensities
dimension of the pm matrix
the first five columns
summary stat for the pm matrix
GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL
252.5
267.0
349.0
424.8
213.5
237.8
138.0
129.8
147.5
335.5
215.3
142.3
172.3
155.5
174.8
411.8
241.0
128.3
163.3
142.8
155.5
494.3
225.5
119.5
259.5
257.3
245.3
505.5
308.8
217.0
GSM131151.CEL
Min.
:
56.3
1st Qu.: 144.3
Median : 212.5
Mean
: 423.1
3rd Qu.: 383.5
Max.
:39818.5
GSM131152.CEL
Min.
:
67.5
1st Qu.: 143.3
Median : 215.0
Mean
: 437.5
3rd Qu.: 397.8
Max.
:39268.0
GSM131153.CEL
Min.
:
69.5
1st Qu.: 157.3
Median : 234.8
Mean
: 458.4
3rd Qu.: 426.0
Max.
:28628.0
GSM131160.CEL
Min.
:
96.0
1st Qu.: 303.6
Median : 414.5
Mean
: 648.2
3rd Qu.: 637.0
Max.
:24854.5
Probe intensity behaviors between arrays
Distributions vary widely between experiments
### Summarize the intensity
par(mfrow=c(1,2))
# get a plotting region with 1 row, 2 col
hist(M)
# generate log2 histograms
boxplot(M)
# generate log2 boxplots
log intensity

Example: Identification of cis-elements


The on-off switches and rheostats of a cell operating at the gene
level.
They control whether and how vigorously that genes will be
transcribed into RNAs.
http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml
Motif model: Position Frequency Matrix (PFM)

fb,i : freuqnecy of a base b occurred at the i-th position
D’haeseleer (2006) Nature Biotech. 24:423
Motif model: Position Weight Matrix (PWM)
Suppose pA,T = 0.32 and pG,C = 0.18 (Arabidopsis thaliana)

Wb,i

n
 ln
b ,i
 pb  /( N  1)
pb
Position Frequency Matrix
Position Wight Matrix
1
2
3
4
5
1
2
3
4
5
A
8
0
4
4
2
A
1.1
-2.2
0.4
0.4
-0.2
T
0
0
0
2
2
T
-2.2
-2.2
-2.2
-0.2
-0.2
G
0
8
4
2
2
G
-2.2
1.6
1.0
0.3
0.3
C
0
0
0
0
2
C
-2.2
-2.2
-2.2
-2.2
0.3
Example: Cis-regulatory logic

Based on a high confidence
set of binding sites:
 3,353 interactions between
 116 regulators and
 1,296 promoters
Harbison et al. (2004) Nature 43:99
Identification of putative cis elements



Pearson's correlation coefficient as the similarity measure.
k-mean clustering to identify co-regulated genes.
Motifs identified only with AlignACE
Beer and Tavazoie (2004) Cell 117:185
Bayesian network

Bayes' theorem
P( A | B) 
P( B | A) P( A)
P( B)
n

Bayesian network
P X 1 ,..., X n    P X i | parents X i 
i 1
Charniak (1991) Bayesian networks without tears
Final example: Relationships between sequences

Sanger and colleagues (1950s): 1st sequence

Insulin from various mammals
Trees

An acyclic, un-directed graph with nodes and edges
External
branch
Operational
taxonomic unit
Ancestral
taxonomic units
1
2
1
1
B
G
I
Internal
branch
F
2
H
6
1
C
D
A
2
2 A
1
2
2
C
2
1
D
6
E
time
B
E
one unit
Li 1997. Molecular Evolution. p101
Enumerating trees

Suppose there are n OTUs (n ≥ 3)
 Bifurcating rooted trees:
NR 

Unrooted trees:
NU 

(2n  3)!
2 n 3 (n  3)!
(2n  5)!
2 n 3 (n  3)!
For 10 OTUs
 3.4x107 possible rooted trees
 2.0x106 possible unrooted trees
http://w3.uniroma1.it/cogfil/philotrees.jpg
Impacts of genomics and bioinformatics

New ways to ask and answer question?
 Hypothesis driven vs. data driven
 A matter of scale
 A matter of integration
 Quantitative emphasis
 Multi-displinary approaches

How is genomics different from genetics?
 Whole genome approach versus a few genes
 Investigations into the structure and function of very large numbers of
genes undertaken in a simultaneous fashion.
 Genetics looks at single genes, one at a time, as a snapshot.
 Genomics is trying to look at all the genes as a dynamic system, over
time, and determine how they interact and influence biological
pathways and physiology, in a much more global sense
The END

...