Genome evolution: a sequence

Download Report

Transcript Genome evolution: a sequence

Genome Evolution © Amos Tanay, The Weizmann Institute
Genome evolution
Lecture 1: Introduction (and some
background on Markov Processes)
Amos Tanay, Ziskind 204, ext 3579
‫עמוס תנאי‬
[email protected]
http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/
Genome Evolution © Amos Tanay, The Weizmann Institute
Evolution
Code
Genome
Genome Evolution © Amos Tanay, The Weizmann Institute
(Most of us) don’t think we evolved to be optimal..
Talking and eating…
Retina in human and octopus
octopus
human
retina
Nerve fibers
The Appendix
Evolving tiny iphone fingers?
Genome Evolution © Amos Tanay, The Weizmann Institute
But when looking at the genome..
• Very commonly,
evolution is assigned
with super-powers
• Any non random pattern
in the genome is
assumed to have a
“meaning”
• ..because otherwise
“evolution would have
eliminated it”
• In a way – the current
attitude to evolution
among many biologists
is semi-religious
Mathematics helps demystifying the inner works of evolution
Genome Evolution © Amos Tanay, The Weizmann Institute
The Lamarckian view – directed evolution
Jean Baptiste Lamarck
Genome Evolution © Amos Tanay, The Weizmann Institute
The Darwinian view – natural selection
Darwin
Genome Evolution © Amos Tanay, The Weizmann Institute
Statistics, Genetics and Molecular Biology
Frequency of recessive allele (blue flower color) in “desert snow” flowers (Lynanthus parruae)
FSR  0.1589
0.717
0.005
0.657
Haldane
FRT  0.3299
0.000 0.000
0.032
0.573
Fischer
0.000
0.009
0.302
0.000
0.007
0.002
0.004
0.000
0.000
0.504
0.005
0.008
0.126
0.000
0.339
0.000
H  0.4995
0.010
0.106
0.224
0.068
0.000
H  0.0272
Dobzhansky
0.014
0.411
H  0.3062
Mayr
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling evolution – take I
Blue allele
A
A
Generations/time
a
Yellow allele
a
Modeling the dynamics of allele frequencies
AA
Generations/time
AA
Aa
aa
Aa
aa
Modeling the dynamics of allele frequencies
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling evolution – take I
Blue allele
A
Generations/time
a
Yellow allele
A
a
Modeling the dynamics of allele frequencies
t
t+1
Genome Evolution © Amos Tanay, The Weizmann Institute
Try it at home I:
1. Simulate a population of 10,000 “genomes”, all having the same
genotype AA
2. Replace 10 individuals with an “aA” allele
3. Simulate a new generation: synthesize 20,000 random pairs and
select 10,000 out of them
4. Plot the fraction of the “a” allele in the population
5. Repeat the experiment many times: can you say something
(empirical/analytic) about the probability of ending up with a
population lacking “a” completely?
Genome Evolution © Amos Tanay, The Weizmann Institute
Statistics, Genetics and Molecular Biology
The code – Genomic sequences
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
Monod
Jacob
Crick
The machine – Protein networks in cells
Genome Evolution © Amos Tanay, The Weizmann Institute
Statistics, Genetics and Molecular Biology
A. Quantitative description of
populations their genes and their
evolution
1920
Fischer
1930
1950
B. Molecular understanding of the
mechanisms that store, transmit
and process biological information
Haldane
Dobzhansky
Crick
A+B = The (only?) real quantitative theory of biology!
Mayr
Monod
Jacob
Genome Evolution © Amos Tanay, The Weizmann Institute
Neutral Evolution
Selectionists: Mutations are occurring by chance - some
get selected and these are the changes we see between
genomes
Kimura et al.: Most of the changes between
genomes are neutral - not a result of selection!
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
…ACGAATAGCAAAAGGGCAGATGGCATTCTAGATCGAAAGCATGAAACTAGATAGCAT…
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
Kimura
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
Genome Evolution © Amos Tanay, The Weizmann Institute
Neutral Evolution
Kimura’s analytic achievement was the solution of a certain class of Partial
Differential Equations that describe the dynamic of allele frequencies under
neutral evolution


1 
V ( x) ( x, t )
 ( x, t )   M ( x) ( x, t ) 
t
x
2 x 2
But we can try and understand the essence of neutral evolution even without
fancy mathematics:
Neutral changes
Along the path are fixated
Last common ancestor
t=1
Coalescent
time
t=n
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling evolution – take II
One letter:
t=1
t=2
t=3
t=4
t=5
A
A
A
G
G
t=6
t=7
G
Markov
A
pAG
A Markov Chain:
A
pGA
G
A set of states (A,C,G,T)
Transition probabilities

 1 P0 i
P01

 i 0
1 P1i
 P10

i 1

..
..

..
..

 P
Pn 0
n0



C
P02 ..
P0 n
P12 ..
P1 n
.. ..
..
.. ..
..
Pn 0 .. 1 Pni

in











T
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling evolution – take II
Pij (t )  0,
 P (t )  1
ij
Kolmogorov
j
 P (t ) P (h)  P (t  h)
ik
kj
ij
t, h  0
A Markov
process:
k
1 i  j
lim Pij (t )  
t 0
0 i  j
Theorem:
1  Pii (t )
 Pii ' (0)  lim
 qii
t 0
t
Pij (t )
Pij ' (0)  lim
 qij
t 0
t
P' (t )  QP (t )  P(t )  exp( Qt )
A set of states
(A,C,G,T)
Transition rates
rates
Genome Evolution © Amos Tanay, The Weizmann Institute
From hundreds to billions loci….
Genome = many
independent
nucleotides
x1
Universal
x2
x3
x4
x5
x6
Q
1960
Multiple copies of the same Markov process
1970
1980
Protein analysis
Phylogenetic reconstruction
1990
2000
2010
Genome Evolution © Amos Tanay, The Weizmann Institute
From hundreds to billions loci….
Genome = many
independent
nucleotides
x1
Universal
x2
x3
x4
x5
x6
Q
Multiple copies of the same Markov process
1960
1970
1980
1990
2000
2010
Genome Evolution © Amos Tanay, The Weizmann Institute
“Its conserved”
Don’t change!
Warn you!
…ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
The everlasting power of conservation
Conserved ==
Functional?
Conserved ==
because it was selected for?
Genome Evolution © Amos Tanay, The Weizmann Institute
Mutations are not simple
Mutations
cannot be
simple!
DNA replication – wiki version..
Try it at home II:
1. Download human-chimp pairwise alignment:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/
2. Count how many time a human T is align to chimp A,C,G,T
3. Recount, this time classifying according to the bases before and
after (16 possibilities)
4. Do you see differences in the fraction of conserved T’s?
Genome Evolution © Amos Tanay, The Weizmann Institute
Evolution of an AAAA..AA sequence
X’t
Maximum
Likelihood
flanking
aware
model
Xt
Xt+
1
X’’t
Xt
Xt+
1
Maximum
Likelihood
independent
loci model
Movie removed to save space
Simulation based on real parameters from yeast genomes
Genome Evolution © Amos Tanay, The Weizmann Institute
Evolution of an AAAA..AA sequence
X’t
Maximum
Likelihood
flanking
aware
model
Xt
Xt+
1
X’’t
Xt
Xt+
1
Maximum
Likelihood
independent
loci model
Movie removed to save space
Simulation based on real parameters from yeast genomes
Genome Evolution © Amos Tanay, The Weizmann Institute
What happened? stationary distributions
TCT
TTT
TAT
TGT
TTT
Stationary distribution condition
pTTT ( pTTT TCT  pTTT TAT  pTTT TGT )
 pTAT pTAT TTT  pTCT pTCT TTT  pTGT pTGT TTT
QTXT
CTT
Theorem: (under reasonable
conditions..) - A Markov process will
converge to a stationary distribution
Theorem: the stationary distribution
will be uniform iff the transition
probabilities are symmetric
QCXT
CTT
Genome Evolution © Amos Tanay, The Weizmann Institute
Try it at home III:
1. Download a human chromosome:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/
2. Count the number of appearances of each word on 6 DNA bases
3. Now do it again on windows of 1MB.
4. Are the statistics always the same? How can this be tested? What
can explain possible differences?
Genome Evolution © Amos Tanay, The Weizmann Institute
Example: CpG islands
CG
1% diverged
40% diverged
>99% of the
genome: CpG is
methylated
<1% of the
genome CpG is
unmethylated
CG
CG
CA
TG
CA
TG
CpG islands: high CpG stationary distribution
Due to variable mutability
Got very famous: (3.4M hits in google..) – and are still very confusing
May play a role in cancer – no optimality here
Genome Evolution © Amos Tanay, The Weizmann Institute
More complex example: codon bias
nucleotides
Amino-acids
Ala/A
GCU, GCC, GCA, GCG
Leu/L
UUA, UUG, CUU, CUC, CUA, CUG
Arg/R
CGU, CGC, CGA, CGG, AGA, AGG
Lys/K
AAA, AAG
Asn/N
AAU, AAC
Met/M
AUG
Asp/D
GAU, GAC
Phe/F
UUU, UUC
Cys/C
UGU, UGC
Pro/P
CCU, CCC, CCA, CCG
Gln/Q
CAA, CAG
Ser/S
UCU, UCC, UCA, UCG, AGU, AGC
Glu/E
GAA, GAG
Thr/T
ACU, ACC, ACA, ACG
Gly/G
GGU, GGC, GGA, GGG
Trp/W
UGG
His/H
CAU, CAC
Tyr/Y
UAU, UAC
AUU, AUC, AUA
Val/V
GUU, GUC, GUA, GUG
AUG
STOP
UAG, UGA, UAA
Ile/I
START
The genetic code
is degenerate:
More than one
codon (triplet) for
each amino acid
Codon bias: codons are not appearing uniformly in the genome
Popular theory: codon usage is optimized to maximize protein yield
Genome Evolution © Amos Tanay, The Weizmann Institute
Evolution
Genome
(Genetic
code)
CODE1
CODE2
(Codon
bias)
Function
Function2
(Protein coding sequence)
(Translation control)
Genome Evolution © Amos Tanay, The Weizmann Institute
The sequence context affect
the probability of a short
insertion or deletion within
>100 fold
Insertion/Deletion usually break
the gene frame and destroy the
derived protein
Sequence context
Around deletions l=2
Translating garbage
A
C
G
T
Codons are selected to reduce the
potential of frame shifting mutations!
Distance from
deletion junction
Sequence context
Around inserts l=2
Relative Number of
Insert susceptible loci
Relative Number of
delete susceptible loci
Randomized exons vs. Real exons
A
C
G
T
Insert length
Distance from insert
Deletion length
Genome Evolution © Amos Tanay, The Weizmann Institute
Evolution
Mutability
Genome
(Genetic
code)
CODE1
CODE2
(Codon
bias)
Function
Function2
(Protein coding sequence)
(Translation control)
Genome Evolution © Amos Tanay, The Weizmann Institute
Genome sequences are something new
and exciting
We can approach them using the old
techniques and methodologies…
• Focusing on one feature at a time
• Looking at the static picture
• Boring ourselves to death
Or we can take new bold
approaches…
This is what Lamark, Darwin, Myer,
Crick or Kimura would have
done…
Jacob
Genome Evolution © Amos Tanay, The Weizmann Institute
Humans and Chimps
~5-7 million years
3X109
{ACGT}
3X109
{ACGT}
Genome alignment
• Where are the “important” differences?
• How did they happen?
Genome Evolution © Amos Tanay, The Weizmann Institute
9%
1.2%
0.8%
3%
1.5%
0.5%
Human
Chimp
Gorilla
Orangutan
Gibbon
Baboon
Macaque
Where are the
“important”
differences?
How did new
features were
gained?
Marmoset
0.5%
Genome Evolution © Amos Tanay, The Weizmann Institute
Antibiotic resistance: Staphylococcus aureus
Timeline for the evolution of bacterial resistance in an S. aureus
patient (Mwangi et al., PNAS 2007)
•Skin based
•killed 19,000 people in the US during 2005 (more than AIDS)
•Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today
•2.9MB genome, 30K plasmid
How do bacteria become resistant to antibiotics?
Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process?
Genome Evolution © Amos Tanay, The Weizmann Institute
Ultimate experiment: sequence the entire genome of the evolving S. aureus
Mutations
Resistance to Antibiotics
Vanco.
Rifampi
Oxacili
Dapto.
20/7
1
0.012
0.75
0.01
20/9
4
16
25
0.05
1/10
6
16
0.75
0.05
6/10
8
16
1.5
1.0
13/10
8
16
0.75
1.0
1 2 3 4-6 7 8 9 10 11 12 13 14 15…18
S. Aureus got found just few “right” mutations and survived multi-antibiotics
Genome Evolution © Amos Tanay, The Weizmann Institute
Yeast Genome duplication
• The budding yeast S. cerevisiae genome have extensive duplicates
• We can trace a whole genome duplication by looking at yeast
species that lack the duplicates (K. waltii, A. gosypii)
• Only a small fraction (5%) of the yeast genome remain duplicated
Genome Evolution © Amos Tanay, The Weizmann Institute
•
How can an organism tolerate genome duplication and massive gene loss?
•
Is this critical in evolving new functionality?
Genome Evolution © Amos Tanay, The Weizmann Institute
“Junk” and ultraconservation
Baker’s yeast
12MB
~6000 genes
The worm
c.elegans
100MB ~20,000 genes
Humans
3GB ~27,000 genes
1 cell
~1000 cells
~50 trillions cells
Genome Evolution © Amos Tanay, The Weizmann Institute
From: Lynch 2007
Genome Evolution © Amos Tanay, The Weizmann Institute
intergenic
exon
intron exon intron exon
intron
ENCODE Data
exon
intergenic
Genome Evolution © Amos Tanay, The Weizmann Institute
Things you need to know or catch up with:
•
•
•
Basic discrete probability, std
distributions
Basic graphs and combinatorics
A clue on biology will be helpful,
not vital
What you’ll learn:
•
•
•
•
•
Foundations: Population genetics
Foundations: Molecular evolution
inference in graphical models
comparative genomics
Intro to genome organization and key concepts in evolution
Books:
Graur and Li, Molecular Evolution
Lynch, Origin of genome architecture
Hartl and Clark, Population genetics
Durbin et al. Biological sequence analysis
Karlin and Taylor, Markov Processes
Freidman and Koller draft textbook (handouts)
Papers as we go along..
N. Friedman
D. Koller
BN and
beyond
Genome Evolution © Amos Tanay, The Weizmann Institute
Course duties
• 4 exercises, 40% of the grade
• 1 Genomic exercise (in pairs) for 10% of the grade
– Compare two genomes of your choice: mammals, worms, flies,
yeasts, bacteria, plants
• Exam: 60%
– Master key computational results
Genome Evolution © Amos Tanay, The Weizmann Institute
Probabilities
•
Our probability space:
– DNA/Protein sequences: {A,C,G,T}
– Time/populations
•
Queries:
– If a locus have an A at time t, what is the chance it will be C at time t+1?
– If a locus have an A in an individual from population P, what is the chance it will
be C in another individual from the same population?
– What is the chance to find the motif ACGCGT anywhere in a random individual
of population P? what is the chance it will remain the same after 2m years?
P(   )
P( )
Conditional Probability:
P(  |  ) 
Chain Rule:
P(   )  P(  |  ) P( )
Bayes Rule:
P(  |  ) 
A
B
P( |  ) P(  )
P( )
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling processes
Modeling
the process
t
Comparing end points
(probability distribution
Genome Evolution © Amos Tanay, The Weizmann Institute
Poisson process
0
1
2
3
4
0
1
2
3
Random walk
-1
Markov chain
A
Brownian motion
B
t
C
D
Discrete time
T=1
T=2
T=3
T=4
T=5
Continuous time
X  R1 , R 2
Genome Evolution © Amos Tanay, The Weizmann Institute
Markov chains
General Stochastic process:
X s, Pr( state, time)
The Markov property: u  t  s, knowing X t makes X s and X u indepdent
A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T})
Discrete time: T=0,1,2,3,….
P( x, s; t , A)  Pr( X t A | X s  x)
Transition probability
P( x, s; t , A)  P( x, t  s, A)
Stationary transition probabilities
Pij  Pr( X t 1 j | X t  i )
( X t1 h , X t 2 h,..., X tn h) ~ ( X t1 , X t 2,..., X tn)
One step transitions
Stationary process
Genome Evolution © Amos Tanay, The Weizmann Institute
Markov chains
4 Nucleotides
The loaded coin
A
B
T=1
A
G
C
T
A
B
T=2
A
G
C
T
A
B
T=3
A
G
C
T
A
B
T=4
A
G
C
T
20 Amino Acids
pab
ARNDCEQGHILKMFPSTWYV
ARNDCEQGHILKMFPSTWYV
1-pab
A
B
pba
1-pba
ARNDCEQGHILKMFPSTWYV
ARNDCEQGHILKMFPSTWYV
Genome Evolution © Amos Tanay, The Weizmann Institute
Markov chains

Transition matrix P:
 1 P0 i
P01

 i 0
1 P1i
 P10

i 1

..
..

..
..

 P
Pn 0
n0



P02 ..
P0 n
P12 ..
P1 n
.. ..
..
.. ..
..
Pn 0 .. 1 Pni

in











A discrete time Markov chain is completely defined given an initial condition
and a probability matrix.
The Markov chain graph G is defined on the states. We connect (a,b)
whenever Pab>0
( P )T x
Matrix power
Distribution after T time steps given x as an initial condition
Genome Evolution © Amos Tanay, The Weizmann Institute
Spectral decomposition
T=1
A
T=2
A
paa
B
pab

T=3
A
paa paa  pab pba
B
paa pab  pab pbb
P
 1 P0 i
P01

 i 0
1 P1i
 P10

i 1

..
..

..
..

 P
Pn 0
n0



Right,left eigenvector:
Px  x, xP  x
When an eigen-basis exists
x (1) , x ( 2 ) ,..., x ( n )
We can find right eigenvectors:
 (1) ,..,  ( n )
And left eigenvectors:
With the eigenvalue spectrum:
P02 ..
P0 n
P12
P1 n
..
.. ..
..
.. ..
..
Pn 0 .. 1 Pni

in











 (1) ,.., ( n )
1 ,.., n
Which are bi-orthogonal:
( , )  k 1ik jk   ij
And define the spectral decomposition:
P  
(1)
(1)
n
Genome Evolution © Amos Tanay, The Weizmann Institute
Spectral decomposition
P  
PT  ()T  ...  T 
 1T

0
T
 
 
0

0
T2

0
0 
0 

  
 Tn 


To compute transition probabilities:
O(|E|)*T ~ O(N2)*T per initial condition
T matrix multiplications to preprocess for time T
Using spectral decomposition:
O(Spectral pre-process) + 2 matrix multiplications per condition
Genome Evolution © Amos Tanay, The Weizmann Institute
Convergence
Spec(P) = P’s eignvalues, 1  2...
1 = largest, always = 1.
A Markov Chain is irreducible if its underlying graph is connected. In that
case there is a single eigenvalue that equals 1.
What does the left eigenvector corresponding to 1 represent?
Fixed point:
out  (i , j )E Pi Pij  ( j ,i )E Pj Pji  in
2 = second largest eigenvalue. Controlling the rate of process convergence