Transcript ppt6

Genome Evolution. Amos Tanay 2012
Genome evolution
Lecture 5: Selection vs.
mutation/recombination.
Species
Genome Evolution. Amos Tanay 2012
Mutation-Selection balance
When an allele is weakly deleterious, mutations can play a major role in driving allele
frequencies
Genotype
New allele frequency,
without mutation
pqw12  p 2 w11
p'  2
p w11  2 pqw12  q 2 w22
AA Aa
aa
1 1  hs 1  s
Fitness
2
2 pq
q2
Frequency(HW) p
New allele frequency,
assuming mutation
pq(1  hs)  p 2
p'  2
(1   )
p  2 pq(1  hs)  q 2 (1  s)
What is the equilibrium frequency of the deleterious allele?
h  0, q' 
h  0, q ' 

s

hs

A
a
ignore (q<<1)
Genome Evolution. Amos Tanay 2012
Mutation-Selection balance: Huntington disease
a neurological genetic disease appearing after age 35
Resulting from a dominant mutation – how does this disease survive
in the human population?
Although it may be fatal, the fitness is not very low due to the late
age of onset (estimated w12=0.81)
Human population: 70 per million (Europe) to 1 per million (Africa)
h>0, and we can estimate the mutation rate at the Huntington locus,
as hsq’ = 10-6 (1-0.81) = 1.9x107 to 70x10-6 (1-0.81) = 1.3x10-6
h  0, q' 
h  0, q ' 

s

hs
Genome Evolution. Amos Tanay 2012
Mutation-Selection balance: Haldane-Muller
h  0, q' 
The average fitness of the population, given recurrent mutations in rate  at a
locus with negative fitness s.
Assume perfect recessivity (h=0):
Assuming partial dominance (h>0)
1  qˆ 2 s  1 

s
h  0, q ' 

s

hs
s  1 
 

1  2 pˆ qˆhs  qˆ 2 s  1  2(1  ) hs    s  1  2 
hs hs
 hs 
The Haldane-Muller principle: the effect of mutation on the
average population fitness depends only on the mutation
rate, not on the fitness of the alleles!!
2
Genome Evolution. Amos Tanay 2012
Overdominance
A SNP affecting the beta-globin gene make the encoded protein defected. The resulted red
blood cells are curved and elongated, and are removed from the circulation
Homozygous for the mutation will usually die from anemia without intensive care
Heterozygous individual will have mild anemia, but will deal better with the malaria parasite
Plasmodium fliciparum (maybe because infected red cells become sickled)
(historical) Malaria distribution
Sickle-cell anemia
wiki
Genome Evolution. Amos Tanay 2012
Other types of selection
Different fitness for different individuals. e.g., male vs. female
For example male genes that take up female resources in
mammals
This was suggested to lead to the phenomenon of imprinting
where cells are expressing only the maternal or paternal allele
Imprinted genes are much like haploids
Genome Evolution. Amos Tanay 2012
Other types of selection
Frequency-, Density-dependent selection: when the fitness depend on the frequency of
the allele or the population size.
Fecundity selection: different reproductive potential for mating pairs.
Effects of heterogeneous environment
Effects that apply directly to the haplotype: gametic selection/meiotic drive (e.g., killing
your homologous chromosome reproductive potential)
Sexual selection: male advertising the reproductive potential, or confronting other
males
Kin selection: (“origin of altruism”)
Genome Evolution. Amos Tanay 2012
Recombination and selection
Genome Evolution. Amos Tanay 2012
Linkage and selection
Linkage interfere with the purging of deleterious mutations and reduce the
efficiency of positive selection!
Beneficial
Beneficial
Beneficial
Weakly deleterious
Selective sweep or
Hitchhiking effect or
genetic draft (Gillespie)
Hill-Robertson effect
Genome Evolution. Amos Tanay 2012
Linkage and selection
The variance in allele frequency is used to
define the effective population size
V ( p)  p(1  p) /( 2 N e )
Simplistically, assume a neutral locus is evolving such that a selective sweep is affecting
a fully linked locus at rate . A sweep will fixate the allele with probability p, and we
further assume that the sweep happens instantly:
 1  
Ne
V ( p)  p(1  p)  

N

l

2
N
1  2 N e
e


This is very rough, but it demonstrates the basic intuition here: sweeps reduce the
effective selection in a way that can be quantified through reduction in the effective
population size.
Nl 
Ne
1 2 N eC
C – the average frequency of the
neutral allele after the sweep
Genome Evolution. Amos Tanay 2012
Cost of sex
Wasting half of your genes on non-reproductive individuals
Selective advantage of an asexual gene = 2 fold!
Still sex is prevalent among complex species
It even persists when both asexual and sexual reproduction is available as in S.
cerevisae:
•
Mating locus MAT type a and alpha
•
Haploids are growing quickly when all is well
•
Mating is occurring when time is rough
•
Meiosis take the diploid back to haploids…
Genome Evolution. Amos Tanay 2012
Benefits of sexual reproduction
Fighting genetic draft: clearing deleterious mutations
•Can this add up to a factor of 2?
•(Alexey Kondrashov theory: epistatsis of deleterious alleles
make sex beneficial)
Buffering variation
DNA repair through recombination (even in somatic tissues)
Fighting mutation interference: more effective/rapid adaptation
•
The red queen hypothesis
Genome Evolution. Amos Tanay 2012
Moran et al., Running with the Red Queen: HostParasite Coevolution Selects for Biparental Sex
Science 8 July 2011:
vol. 333 no. 6039 216-218
Genome Evolution. Amos Tanay 2012
What is a species?
•
•
•
Multiple definitions..
free flow of genetic information within population
Weak (or zero) flow of information across species barriers
Strain 1
Strain 2
We change wright-fischer’s or
Moran model, by removing the
assumption of random mixing.
Instead, we can assume
subpopulations are more likely
to mate among themselves.
Different models are possible,
all end up increasing the
genetic distance between
subpopulations
Species 1
Species 2
Genome Evolution. Amos Tanay 2012
Speciation
The Phenomenon of new species emergence is called speciation
It is well accepted that speciation is driven by the formation of reproductive
barriers
Allopatric speciation – occurs through geographical separation
Parapatric speciation – occurs without geographical separation but with weak
flow of genetic information
Sympatric speciation – occurs while information is flowing
Barriers can genetic, physical, and behavioral
Genome Evolution. Amos Tanay 2012
Allopatric speciation
“Finally, then, I suppose that a large number of closely allied
or representative species... were originally formed in parts
formerly isolated" (Darwin)
Åland Islands, Glanville fritillary population:
same species
Charis Butterflies in South America:
different species
Genome Evolution. Amos Tanay 2012
Reproductive barriers
Factors that limit gene flows:
geography
Habitat
Sexual preferences
Season
Pollinator
Many factors can contribute to form a barrier:
Physical incompatibility,
Hybrid sterility (mule),
pre-zygotic infertility
post-zygotic lethality
Genome Evolution. Amos Tanay 2012
Sympatric speciation
Following Darwin, and prior to population genetics and genetics in general
evolutionary biologists considered sympatric speciation as the leading factor
generating new species.
The idea was that species are adapting to niches while co-existing in the same
habitat
Sympatric speciation is however difficult to explain using standard population
genetics of interbreeding populations.
Myer (and Dobjhansky) have made strong arguments that suggested allopatric
speciation is the major (or only) driver of bio-diversity
Genome Evolution. Amos Tanay 2012
Evidence for sympatric speciation
Studies of cichlid fish species in African
lakes showed incredible diversity: 500
endemic species in lake victoria, up to
1000 in lake Malawi
The history of some of these lakes may
have included massive dry-out and
geographical separation..
In smaller lake (shown here is Barombi
Mbo in Cameron), dry-out is
geographically unlikely
several species (7) with a probable
common ancestor do suggest sympatry
Genome Evolution. Amos Tanay 2012
Selection vs. drift
Direct selection on the barrier
Indirect selection
The character selected for cause the barrier
The character selected for affect genes that also cause the barrier
Hitchhiking
Drift
Plain drift
Bottlenecks evolve detelreious alleles to fixation
Reinforcement
Dobjhansky’s scenario:
partial separation in allopatry
Hybrids are unfit (if they are dead we already have species)
Selection create a reproductive barrier
Many theoretical limitations – but solutions exists
Still controversial
Genome Evolution. Amos Tanay 2012
Species trees
Speciation is irreversible! (with some minor exceptions – think parasites)
We end up with a branching process: forming a tree
Strain 1
Strain 2
Species 1
Species 3
Strain 1
Strain 2
Species 1
Species 2
Strain 1
Strain 2
Species 2
Species 4
extinction
Present time
Genome Evolution. Amos Tanay 2012
Genome Evolution. Amos Tanay 2012
Facts on trees
•A tree is a connected graph without cycles
•We will use directed trees: each edge/lineage have a direction (time)
•Directed acyclic graph (DAG): a directed graph without cycles
•a Binary tree: one or 0 parents (incoming edges), two or 0 children (outgoing
edges)
•A binary tree on n extant species will have n-1 inner nodes: (prove)
•Each node partition a binary tree into three disconnected parts (up, left, right)
•The root of the tree is the only node without parents
•Topological order: a permutation of the nodes such that each node appears after
its parents
•BFS/DFS
Genome Evolution. Amos Tanay 2012
Evolutionary inference
We can usually observe only the extent populations
But we want to infer the history of the evolutionary process
-How did the ancestral populations/species looked like? (nodes in the tree)
-What was the evolutionary process that brought an ancestral genome into an
extant one? (edges in the tree)
So we will develop methods for inference: estimating the values of missing
variables based on partial observations
Genome Evolution. Amos Tanay 2012
Do we need inference?
Getting direct evidence on the evolutionary history is only partially possible:
The fossil record had probably given us
more evolutionary understanding than any
other resource (definitely more than
genomes)
But it cannot teach us much on evolution
at the genome level – and we cannot use
it to learn how to read the genome itself
New technologies promise to sequence
the genome of extinct species (mammoth,
Neanderthals). But this is inherently
limited by material availability
Genome Evolution. Amos Tanay 2012
Why do we have a chance with inference?
We are trying to infer the past based on the present. Does this make any sense
at all?
The past is correlated with the present
Low substitution probability
A:past
B:present
Pr( B | A)
High correlation
A:pas
t
B:present
COV ( A, B)
Pr( B | A)
Pr( A | B) 
Pr( A)
Pr( B)
Genome Evolution. Amos Tanay 2012
Maximum parsimony
If we assume that the traits on the tree are changing slowly
Then the ancestral traits is usually the same as the extant one
We for each ancestral node, we have evidence coming in from 3 directions – almost always
two of them should agree
C
A
Formally: given a tree T, and observations (from some
alphabet) Si on the extent species:
1) compute the minimal number of changes along the tree,
2) Find the possible values at each ancestral node given an
evolutionary scenario involving the minimal number of changes
?
2 substitution
substitutions
1
C
A
A
?
C
A
Genome Evolution. Amos Tanay 2012
Computing the parsimony score
Maximum Parsimony Algorithm (Following Fitch 1971):
Start with D=0, up_set[i] a bitvector for each node
Up(i):
if(extant) { up_set[i] = Si; return}
up(right(i)), up(left(i))
up_set[i] = up_set[right[i]] ∩ up_set[left[i]]
if(up_set[i] = 0)
D += 1
up_set[i] = up_set[right[i]] + up_set[left[i]]
Compute the minimal number of changes by calling Up(root)
?
S3
up_set[5]
?
S2
up_set[4]
S1
Genome Evolution. Amos Tanay 2012
Parsimony “inference”
?
up_set[3]
S3
down_set[5]
?
down_set[4]
Set[i] = up_set[i] ∩ down_set[i]
S2
S1
Algorithm (Following Fitch 1971):
Up(i):
if(extant) { up_set[i] = Si; return}
up(right(i)), up(left(i))
up_set[i] = up_set[right[i]] ∩ up_set[left[i]]
if(up_set[i] = 0)
D += 1
up_set[i] = up_set[right[i]] + up_set[left[i]]
Down(i):
down_set[i] = up_set[sib[i]] ∩ down_set[par(i)]
if(down_set[i] = 0) {
down_set[i] = up_set[sib[i]] + down_set[par(i)]
}
down(left(i)), down(right(i))
Algorithm:
D=0
up(root);
down_set[root] = 0;
down(right(root));
down(left(root));
Genome Evolution. Amos Tanay 2012
Genomic sequencing
In its first 100 years, evolutionary theory was about
organismal traits
Starting from the 1960’s, molecular traits became
available (mostly looking at proteins)
Since the 1990’s, and to its full extent today, we can
cheaply sequence whole genomes
It is expected that within a few years, technology will
allow routinely to study whole genomes in large
population samples.
For example: The 3 billion dollars human genome project
can now be done by a single lab within a few weeks for
5,000$, and the price rapidly dropping
The 1000 genomes project
Genome Evolution. Amos Tanay 2012
Sequencing technology is rapidly evolving:
Illumina GAII (here at WIS)
~40,000,000 reads of ~36bp on each, 5k-10k$
Jan 2010: 300 million reads, 150bpx2…
Genome Evolution. Amos Tanay 2012
Genome evolution: nucleotides are not simple traits
A
AAA
AA
C
AA
AAA
Deletion
Insertion
Point mutation
(substitution)
GGAACC
GGAAGGAACC
duplication
We transform nucleotides to traits using alignment
An alignment specifies which positions in two or more genomes represent the same
“trait” – assuming they are the outcome of a single genealogy
As we are seeing this needs not be well defined! (e.g. duplications) – but we will have to
usually assume it is.
A basic pairwise alignment optimization problem is solved using dynamic programming
Pairwise alignment: find the alignment minimizing the number (or some linear cost) of
mismatches (including deletions/insertions characters)
Affine gap pairwise alignment: find the alignment minimizing the cost of mismatches +
the cost of gaps (fixed cost for a new gap, another cost for a gap character)
(see any standard text on comp-genomics)
Genome Evolution. Amos Tanay 2012
The alignment dynamic programming graph (for reference)
a.k.a: Smith-Waterman, Needleman-Wunsch
Species 1
Species 1
0
A
T
1
C
2
T
3
G
4
A
5
T
6
C
7
i 0
Species 2
T1
8
Species 2
j
G2
Match/Mismatch
Initialize 0,0 to
C3
Global Alignment
A4
si,j =
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
max
T5
Local Alignment
A6
0
si,j = max
C7
How can we align all Query to part of the database?
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
Genome Evolution. Amos Tanay 2012
Multiple alignment
The problem: given a set of sequences (each from a difference species), find their
optimal multiple alignment.
Multiple alignment cost: many possible definitions. In most of these the problem is NPhard.
In fact, we should be looking for the complete evolutionary history of these sequences
Therefore, the optimal alignment should in principle define the genealogy of each
nucleotide, such that these histories are reasonable
In practice, multiple alignment algorithms are using heuristics based on these ideas.
Designing and implementing a really principled version of these algorithms is not easy
1. Pairwise alignment (distances)
2. Build a “guide tree”
3. Align from leaves to root, each time a pair
(sequences or profiles)
…ACGAATAGCAGATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGAT…
…ACGTTTAGCAAATGGGCAGATGGCAGTCTAGA-----AGCATGAGACTAGATAGAT…
…ACGAATAGCAAAT------ATGCCAGTCTAGATCGAAAGCATGCCACTAGATAGAT…
Genome Evolution. Amos Tanay 2012
Genome alignment
Given a set of genomes, each consisting of several billion nts - Problem becomes
quite intensive
Heuristics are used to search for pieces of alignment (Blast)
Pieces are then combined into chains of large fragments
Genome alignment can be projected over some reference genome, complex
situations with duplications, large deletions and insertion requires complex solutions
and are routinely ignored