Welcome to Comp 665 - UNC Computational Systems Biology
Download
Report
Transcript Welcome to Comp 665 - UNC Computational Systems Biology
Effective Population Size
• Real populations don’t satisfy the Wright-Fisher
model.
• In particular, real populations exhibit reproductive
structure, either due to geography or societal
constraints.
• The number of descendents in a generation depends
on many factors (health, disease, etc.), as opposed to
the implicit Poisson model.
• Population size isn’t fixed, but changes over time
4/11/2016
Comp 790– Continuous-Time Coalescence
1
Sanity Check
• When the Wright-Fisher model, or the basic
coalescent, is used to model a real population, the
size of the population (2N) cannot be taken literally.
• For example, many human genes have a MRCA less
than 200,000 years ago.
• If we consider one generation per 20 years then
there have been 200,000/20 = 10000 generations
• Recall the average time to MRCA is 2, in Population
scaled time, so with a population of 2N, the effective
population size is 2N = 10000/2 N = 2500
4/11/2016
Comp 790– Continuous-Time Coalescence
2
Effective Population Size
• Without an estimate of an MRCA one can still use coalescence
to find the effective population size.
• Recall for the discrete coalescent, the expected time for two
genes to find a MRCA was E(T2) = 2N
• Thus,
E(T )
Ne
2
2
• This equation would be applied after tracing many paths of
gene pairs, and E(T2) would be measured in actual generations
rather than the normalized notion of time used in the
continuous coalescent
(where t=1.0 represents the time when
the population size is 2N)
4/11/2016
Comp 790– Continuous-Time Coalescence
3
Moran Model
• In 1958 Moran proposed an alternative to the Wright-Fisher
model where reproductive generations overlap
• Central idea, is that each
epoch represents two
events, the loss of one
gene and its replacement
by another
• Rules out multiple
coalescent events
between epochs
4/11/2016
Comp 790– Continuous-Time Coalescence
4
Moran Formulation
• Probability that 2 genes share a common ancestor in
the previous generation, P(T2=1) is:
1
P(T2 1)
N(2N 1)
because only one of the pairs has a common
ancestor
1
• Gives a geometric
distribution
with parameter N(2N1)
and a natural time scale of N(2N -1)
(compared to 2N for the Wright-Fisher model)
2N
2
4/11/2016
Comp 790– Continuous-Time Coalescence
5
Moran Use
• When adjusted for differences in time scale the basic
“continuous” coalescent holds for the Moran model as well
• Moran model often leads to a more tractable computation
than the Wright-Fisher model
• The basic “continuous coalescent” is robust to the actual
population model, whether it is Haploid or Diploid, WrightFisher or Moran, thus it is commonly used as a first-order
approximation for making estimates about population
structure, such as, how many variations one should expect in
a sample size of N, and how long such divergences have
existed
4/11/2016
Comp 790– Continuous-Time Coalescence
6
Dirty Details
• Thus far, we’ve considered very simple, and
admittedly oversimplified models of biological
and genetic processes.
• Next we’ll discuss many of the biological
realities that the coalescent model either
crudely approximates, or entirely ignores
• We also want to move from our simple
geocentric view to a more complete organism
4/11/2016
Comp 790– Continuous-Time Coalescence
7
Terminology
• Gene: A unit of information transferred from generation to
the next.
• Allele: An alternative form of a gene, information that comes
in two or more forms.
• SNP: (acronym for Single Nucleotide Polymorphism) A
position in a DNA’s sequence that can be found in multiple
states of the 4 nucleotides (A, C, G, T). SNPs are one type of
allele
• Haplotype: A subsequence of DNA that includes only
positions known to vary (SNPs)
4/11/2016
Comp 790– Continuous-Time Coalescence
8
Causes of Genetic Variation
• Mutation: Changes in the genetic material of an
organism. Events that actually modify genes
potentially generating new alleles
• Recombination: A process in which new gene
combinations are introduced
– Crossovers, Gene-conversion, Lateral Gene Transfer
• Structural Rearrangement: Modifications that
impact the number of old gene copies and their
relative orderings
– Insertions, Deletions, Inversions
4/11/2016
Comp 790– Continuous-Time Coalescence
9
Mutations
• There are many ways of altering a gene, some common and
some rare
– Environmental exposure (radiation, chemical, etc.)
– Random events (faulty DNA replication, other malfunctions of
biochemical machinery)
• Many mutations affect cells of an higher organisms without
genetic ramifications (mutations of the so-called somatic
cells), but they may be important to the organism (i.e. lead to
cancer)
• Mutations of the germline (gamete) cells are those of genetic
interest because they impact the life of genes, as opposed to
their protective organism
4/11/2016
Comp 790– Continuous-Time Coalescence
10
Sequence Organization
• The DNA sequence is broken into several independent
segments organized into structures called chromosomes
• Chromosomes vary between different organisms. The DNA
molecule may be circular or linear, and can contain from
10,000 to 1,000,000,000 nucleotides.
• Simple single-cell organisms (prokaryotes, cells without nuclei
such as bacteria) generally have smaller circular
chromosomes, although there are many exceptions.
• More complicated cells (eukaryotes, with nuclei) have linear
DNA molecules that are broken into segments and wound
around special proteins. The aggregates are called
chromosomes.
4/11/2016
Comp 790– Continuous-Time Coalescence
11
Monoploid Number
• The number of fragments that DNA is broken into leads to a
distinct number of chromosomes. The number is called the
monoploid number.
4/11/2016
Organism
Unique Chromosomes
Human
23
Chimpanzee
24
Mouse
20
Dog
39
Horse
32
Donkey
31
Hare
23
Comp 790– Continuous-Time Coalescence
12
Diploidy and Polyploidy
• Having only one copy of DNA is a risky proposition, since the
loss of a single functional gene could lead to a bad outcome
• Evolution has addressed this obvious shortcoming by
incorporating a mostly redundant copy of the entire sequence
in most cells
• The haploid number is the number of chromosomes in a
gamete of an individual.
• Nearly all mammals are diploid and receive a homologous
sequence from each parent
• Many plants carry more than 2 copies of there sequence, 4
and 8 are typical, and the number can vary between
subspecies.
4/11/2016
Comp 790– Continuous-Time Coalescence
13
Crossover Recombination
• In the formation of
gametes (sperm and
ovum) homologous
DNA strands are
combined in a process
called crossover
• This effectively combines
the prefix of one sequence
with the suffix of another
4/11/2016
Comp 790– Continuous-Time Coalescence
14
Gene Conversion Recombination
• The DNA sequence is transferred
from one copy (which
remains unchanged)
to another,
whose
sequence
is altered.
• Results from the repair
of damaged DNA as described
by the Double Strand Break Repair Model.
4/11/2016
Comp 790– Continuous-Time Coalescence
15
Lateral Gene Transfer
• Any process in which an organism incorporates genetic
material from another organism without being the offspring
of that organism.
• Horizontal gene transfer is a confounding factor in inferring
phylogenetic trees based
on sequences.
• One of the most prevalent
forms of recombination in
“early” evolution
4/11/2016
Comp 790– Continuous-Time Coalescence
16
Structural Rearrangements
• Large scale structural changes
(deletions/insertions/inversions) may
occur in a population.
Wi’07
Vineet Bafna