Recombination Mapping

Download Report

Transcript Recombination Mapping

Recombination Mapping
Why
•
•
A fundamental problem in human genetics
today is locating and identifying the specific
gene responsible for a given genetic disease.
However, the disease is just a phenotype, and
gene responsible for that phenotype might be
very different from what we would expect.
–
For instance, Lesch-Nyhan syndrome’s most
spectacular manifestation is self-mutilating
behavior. The Lesch-Nyhan gene codes for
hypoxanthine-guanine phosphoribosyl
transferase, which helps salvage nucleotides
derived from the breakdown of nucleic acids.
•
So, we need to reduce the number of
candidate genes to a manageable level.
•
Using the naturally occurring recombination
process to map genes remains the best way to
localize the gene responsible for a genetic
disease. The goal is to reduce the amount of
DNA that need to be searched to a small
region, a few million base pairs or so. Below
that level, molecular tools need to be
employed.
Markers for Mapping
• What makes a good marker:
– co-dominant (so homozygotes and heterozygotes can
be distinguished)
– many alleles at each locus (so most people will be
heterozygous and different from each other)
– many loci well distributed throughout the genome
– easy to detect, especially with automated machinery
• No system is perfect
Marker Systems
•
Originally, genetic markers were
visible phenotypes and blood groups.
There simply aren’t enough markers
available, and many of them are
dominant. Also, very few people
display visible phenotypes that can be
attributed to single genes.
–
•
before the advent of molecular
markers, very few genes had been
mapped, and most of them were on the
X.
Protein electrophoresis. Isozymes
are enzymes that have different
electrophoretic mobility because they
are produced by different alleles at the
same gene.
–
–
–
They are usually co-dominant, but
frequently form dimers that can confuse
interpretation.
However, no more than 100 have ever
been described, and many of these are
not very polymorphic.
Each enzyme requires a unique set of
reaction conditions, which makes
automation difficult.
Isozymes of Lactate dehydrogenase (LDH)
More Marker Systems
•
Restriction Fragment length
Polymorphisms (RFLPs). The original
DNA-based marker system.
–
–
•
These markers are (usually) single
nucleotide polymorphisms which create or
destroy a restriction site (a 6-8 bp sequence
that can be cut by a restriction enzyme).
Thus, they have only 2 alleles per locus.
The original detection technique, Southern
blots, were expensive, time-consuming and
finicky (and radioactive too).
Microsatellites (also called Simple
Sequence Repeats: SSRs or Short
Tandem Repeats: STRs). Short repeats
of 2-5 bp in a tandem array. During
replication, DNA polymerase occasionally
“stutters”: increases or decreases the
number of repeats, which creates new
alleles.
–
–
–
Lots of loci well scattered throughout the
genome. Most loci have multiple alleles that
are easily distinguishable.
Detected by PCR followed by
electrophoresis
Electrophoresis needs to be high resolution:
to easily detect length differences of 2 bp.
Single Nucleotide Polymorphisms
• Single Nucleotide
Polymorphisms (SNPs).
Which of the 4 possible
nucleotides is present at an
exact position in the DNA.
– The current method of choice.
– Each locus has a maximum of 4
alleles (with 2 being the usual
case).
– There are very large numbers of
SNP loci, often several per gene
even within exons.
– Detection can be done with
assays that don’t require
electrophoresis and so are very
fast and easy to automate.
– At present there are
approximately 12 million human
SNPs recorded in the NCBI
database.
Fingerprinting Markers
•
Fingerprinting markers are used to distinguish the DNA of one person from
another. Not generally useful for mapping.
–
–
–
•
Major Histocompatibility Locus (MHC) also called Human Leukocyte
Antigen (HLA). The main gene locus involved in the immune system’s ability
to distinguish self from non-self.
–
•
Criminal investigations
Paternity tests
Body identification
Lots of haplotypes, but all at one location of chromosome 6.
Minisatellites also called Variable Number Tandem Repeats (VNTRs).
–
–
–
Longer than microsatellites: 10-60 bp.
Many loci (about 1000 known), but mostly clustered near telomeres.
No general method of finding them.
CODIS
•
CODIS (Combined DNA Index System) is the
marker system used by the FBI and foreign
police agencies for DNA-based identification.
–
•
The FBI currently uses a set of 13 markers,
located on many different chromosomes, plus
a marker for distinguishing the X and y
chromosomes.
•
•
The European Union uses a somewhat different set of
markers, and there are proposals and add and drop
several of the current CODIS markers. The FBI’s
plan is to expand from 13 to 18 markers soon.
All are 4 or 5 bp repeats, which PCR-amplify
better than 2 bp repeats. And, easier to tell
apart.
–
•
Based on Short Tandem Repeats (STRs)
The markers aren’t associated with any disease
genes or other visible phenotypes.
Detected with commercially available kits,
with PCR amplification products run on a
DNA sequencing machine, which gives
precise band sizes (which are easily
compared between labs)
CODIS markers are multiplexed: several
different loci are run on the same
electrophoresis gel lane. PCR primers are
chosen to give different, non-overlapping
sizes to the amplified bands.
STR Alleles
•
Alleles are named by the
number of complete
repeats they have. Some
variant alleles have a
partial repeat: the number
of bases in the partial
repeat is used after the
decimal point. For
example, the TH01 locus
has an allele called 9.3
that is common in
Caucasians. It has 9
complete repeats plus
another partial repeat that
has only 3 bases in it.
Some CODIS Markers for 10 Random
Individuals
•
D1S80
•
D21S11
Probability of Identity
• The fundamental question with fingerprinting: what is the chance
that two unrelated individuals will have the same genotype?
(Probability of identity, Pi)
– More alleles at any given locus improves the chances of not having
unrelated people matching.
– Since loci are genetically independent, Pi for several loci together is just
the product of the individual Pi’s.
– For perspective: there are about 7 x 109 people living today, which
means there are about 25 x 1018 possible pairs of individuals. To be
sure that you don’t misidentify someone, you need a Pi that is much
less than 2.5 x 10-19.
• Study done by National Institute of Standards and Technology
(NIST) in 2012.
– Examined 1036 unrelated individuals from the US, divided into these groups:
Caucasian, African-American, Hispanic, and Asian. Ethnicity was self-identified,
a procedure that obviously has some issues.
Probability of Identity for Individual Loci
• This table
shows the
probability that
two people of
the same
ethnicity share
the same
genotype at
specific loci.
• Range is
about 0.5% to
20%,
depending on
ethnicity and
locus.
Pi with different marker sets and ethnic groups
Mutations in STR Loci
•
•
STR loci have a high mutation rate relative to base change mutations (SNPs).
This phenomenon produces multiple alleles, which is very useful for easy
identification of individuals. However, it also complicates paternity tests and
other relationship studies.
Situations where both parents and their child have been tested, and it is clear
that they are the real parents, and the child contains an allele not found in either
parent. From the American Association of Blood Banks.
– For 19 alleles, examined in roughly 500,000 cases, mutation rates are between 0.1%
and 0.3% most cases.
CODIS Issues
•
NIST works to understand
unusual variants by sequencing
them when they are reported.
•
Variant alleles. The more
individuals are tested, the more
new, rare variants appear.
–
–
Different numbers of repeat units as
well as partial repeats
Sometimes large changes in repeat
number moves a band out of the
expected range on the gel.
Images from
http://www.cstl.nist.gov/biotech/strbase/pub_pres/Kline_DuckKey2005.pdf
More Problems
•
Null alleles and drop-outs. No
amplification occurs with a specific
locus:
–
–
–
•
Appears as if the subject were a
homozygote.
can be caused by a mutation in one of
the primer sites, or the deletion of the
entire locus.
This event is detected when two
different sets of primers are used to
amplify the same locus: one set
produces a band and the other doesn’t.
Tri-allelic cases. Sometimes due to
duplications of the locus (including
trisomy 21), sometimes due to
mosaic tissue (or even mixed
samples).
Images from
http://www.cstl.nist.gov/biotech/strbase/pub_pres/Kline_DuckKey2005.pdf
Ethnicity Prediction
•
•
•
•
Some loci have very different frequencies in different ethnic groups
However, self-reported ethnicity isn’t very reliable.
And, ethnicity isn’t a well-defined concept anyway.
Mutation rates in STRs: identity by state (2 people have the same
allele) vs. identity by descent (2 people have inherited an allele from
the same common ancestor).
• SNPs and Alu element insertions are more stable than STRs and
probably work better for ethnicity prediction.
• A related issue: linkage to disease genes. A DNA profile may give
information about susceptibility to diseases.
– Note that many disease genes were mapped using STR markers
Gene Mapping
Recombination Basics
•
in prophase of meiosis I, homologous
chromosomes synapse (pair up) and
crossing over occurs. The chromosomes
break at approximately the same location
and are rejoined to each other. This is
called crossing over or recombination.
–
•
the recombinase enzyme complex
catalyzes this reaction.
A crossing over event has 2 possible
outcomes:
–
–
Crossover: genetic markers outside the
site of crossing over switch chromosomes.
This is what we usually think of.
Gene conversion. Markers outside the
site of crossing over stay on the same
homologues, but a short region of DNA at
the site is made homozygous: one allele is
replaced by another allele.
More Basics
•
•
Recombination appears in the
offspring’s phenotype as exchange of
marker genes on either side of the
crossover.
Thus, to detect crossing over we
examine two marker genes. The
parent we are observing must be
heterozygous for both genes.
–
–
–
•
if both dominant alleles are on one
homologue and both recessives are on
the other, the alleles are in coupling
phase.
if one dominant and one recessive are
on each homologue, the alleles are in
repulsion phase.
coupling and repulsion can also use to
describe relationships between codominant markers.
The marker alleles in an offspring are
either in the Parental configuration
(same as they were in the parents) or
in the Recombinant configuration
(marker exchange has occurred).
Map Distances
•
•
•
•
•
•
Crossing over occurs at random along
chromosome--means that the closer 2
genes are, the less frequently recombination
occurs. Basis for mapping.
Recombination Fraction (RF or theta or θ)
is the percentage of recombinant gametes
produced.
– one complicating factor when looking at
offspring: meiosis occurs in both
parents.
RF is never more than 50%--due to only 2 of
the 4 chromatids recombining
1% recombination = 1 map unit = 1
centiMorgan (cM), but only for short
distances.
for longer distances, double crossovers
decrease observed recombination
frequency.
– two crossovers between marker genes
leaves the markers in the parental
configuration: no way to tell there were
any crossovers.
Double crossovers should occur at
frequency predictable from distances
between genes, but there is also
interference, which affects the chance for CO in any interval.
– interference: one crossover inhibits
the occurrence of another nearby.
Mapping Function
•
•
We want a gene map to be calibrated in map units that accurately reflect the
frequency of crossovers between genes. The equation used to convert the observed
recombination fraction into map units is called the mapping function.
For a simple model of randomly placed crossovers and no interference, Haldane’s
function works well:
–
•
w = - ½ ln(1-2θ) , where w is map distance and θ is the observed proportion of
recombinants
this expression produces the curve on the previous slide
Interference complicates things, and a variety of functions can be used. Kosambi’s
function is a common one:
w = ¼ ln[(1+2θ) / (1-2θ)]
•
•
•
Interference has been estimated for human genes, and it seems to be a very small
effect. For a 10 cM interval, only 0.01% of the potential crossovers is inhibited by
interference.
Also, from a practical point of view, the main value of recombination mapping is
finding a small region of DNA to search with molecular tools. Worrying about
interference seems (to me) to be a lot of work for very little benefit.
Further, it is clear that a crossover is not equally probable at every nucleotide: at
the level of the DNA sequence, recombination primarily occurs at hot spots with very
little in between:
Chiasmata
•
•
•
•
•
•
Crossing over is visible in the microscope as
chiasmata (which is the plural form of
chiasma).
It is possible to count chiasmata. Each one
counts as 50 map units (one crossover
between 2 of the 4 DNA molecules at
prophase of meiosis 1).
In male meiosis (testicular biopsy), one
study showed an average of 50.6 chiasmata
per cell. Multiplying by 50, this gives 2530
cM as the length of the genetic map in
males.
In female meiosis (between 16 and 24
weeks of fetal life), an average of 70.3
chiasmata per cell were seen. This gives a
female map of 3515 cM.
Recombination mapping has given
estimates of 2590 cM for males and 4281 for
females.
So, females have more crossovers and a
larger map than males. The total map
length in humans is about 3000 cM.
LOD Score Mapping
•
The general problems with mapping genes in humans: small families,
uncontrolled matings, uncertain paternity.
– Thus you can’t set up a test cross, where one parent is a heterozygote and the
other is homozygous for other alleles, and count parental and recombinant
offspring.
•
•
•
•
•
•
Given a pedigree family, the LOD score method involves determining the
probability (the likelihood) of that family at different values of θ, the
recombinant fraction.
Then, the method allows you to add probabilities across different families,
even if some information about them is missing or ambiguous. Also, each
family can start with different parental arrangements of markers, and can
have different numbers and types of children.
The LOD score method is an example of a maximum likelihood procedure.
The point of the maximum likelihood procedure is to estimate the value of a
parameter that can’t be directly observed, in this case the recombination
fraction.
The likelihood (probability) of an observed set of data (the phenotypes seen
in a family, in this case) is calculated as a function of that parameter.
The parameter value that gives the maximum likelihood is taken as the best
estimate of the parameter.
LOD Procedure
1. Start with a model of inheritance for the gene of interest: an
equation that gives the expected frequency of various types of
offspring given an arbitrary value of θ.
2. Using a form of the binomial expansion, determine the likelihood of
your data (family) at a number of different values of θ: L(θ)
3. Determine the odds ratio: the likelihood at each value of θ divided by
the likelihood at θ = 0.5 (unlinked).
– The LOD score is the base 10 logarithm of the odds ratio. This is the log of the
odds, the LOD score for each value of θ.
4. Add LOD scores for all θ values between families. This is the beauty
of logarithms: they can be added. Thus, data from many small
families can be added to achieve a statistically significant value for θ.
Statistical significance
• A LOD score of 3.0 for some value
of θ is considered the threshold for
accepting that the two genes are
linked, with a 5% chance of a false
positive (p = 0.05).
• A LOD score of -2 is considered
evidence for the genes not being
linked.
• Generally more than one value of
θ will go over the 3.0 level. The θ
with the highest LOD score is the
point estimate of the true map
distance. All other adjacent θ
values with a LOD score of at
least 1 less than the maximum
value are considered the “support
interval”, the region in which the
true linkage value is found.
Developing a Model
• We will use an example of two
heterozygotes mating. We want to
estimate the recombination distance
between genes A and B, which both
show complete dominance.
• Both parents produce recombinant
and parental gametes, which we can
combine using a Punnett square.
• θ is the proportion of recombinant
gametes. Since there are two
recombinant gametes, each has a
proportion of 1/2 θ.
• 1- θ is the proportion of parental
gametes. Each of the two parental
gametes has a proportion of 1/2(1- θ).
Gametes:
Parental:
A B
a b
Recombinant:
A b
a B
Punnett Squares with Frequency Equations
• The next step is to create
equations showing the
frequency of each phenotype of
offspring. This is most easily
done using a Punnett square.
• For each cell, the equations for
the gamete frequencies are
multiplied together.
• Then all cells with the same
phenotype are added together.
• Final result: 4 equations
showing the expected
frequency of each phenotype as
a function of Ɵ (the proportion
of recombinant gametes, the
map distance).
• Note that the sum of the 4
equations is 1.0.
Punnett square with equations
for the frequency of each type of
offspring. The equations are
generated by multiplying the
gamete frequencies together.
Expected Frequencies at Different
Values of Ɵ
•
•
Once the equations for phenotype frequencies as a function of
recombination frequency have been generated, it is easy to substitute in
different values. This generates a table of expected frequencies of the
phenotypes.
Range: RF = 0.0 is completely linked, to RF = 0.5, which is unlinked.
Likelihood of a Family
•
•
•
Likelihood functions determine the probability of the
observed data in terms of the parameter being
estimated.
For lod scores, a version of the binomial expansion
is used.
The binomial describes the probability of families with
two different phenotypes
–
–
–
–
–
•
•
( p  q)  1
Consider a family of 3 children whose parents are
heterozygous for a recessive genetic disease.
–
–
•
p = probability of a normal child
q = probability of a mutant child
n = total number of children
each term describes a different family composition
the exponents on p and q represent the number of
children with each phenotype.
n
p = chance of normal child = 3/4
q = chance of mutant child = 1/4
Here, p3 is a family of 3 normal children, 3p2q is 2
normal plus 1 affected, 3pq2 is 1 normal plus 2
affected, and q3 is 3 affected.
Chance of 2 normal + 1 affected is described by the
term 3p2q. Thus, 3 * (3/4)2 * 1/4 = 27/64.
p  3 p q  3 pq  q  1
3
2
2
3
Multinomial Distribution
• The multinomial distribution extends the binomial to
more than two phenotypes. It is very simple: just add
more components to each term.
– For example, for 4 phenotypes, C p2q1r3s1 (where C is some
coefficient) describes the probability of a family of 7 children,
where 2 of them have the “p” phenotype, 1 has the “q”
phenotype, 3 have the “r” phenotype, and 1 has the “s”
phenotype.
• The coefficients in front of each term represent the
number of possible families of the given composition.
For the binomial we can calculate the coefficients using
Pascal’s triangle (or a useful formula).
• However, for LOD score mapping we don’t need to
bother with the coefficients because they get divided out.
Likelihood Ratio
• Using a spreadsheet, we first calculate the
expected frequency of each type of offspring at
different values of θ.
• Then we use the data from actual families to
calculate the likelihood of each family at each
value of θ.
• Then we take the likelihood ratio: divide the
likelihood at each θ by the likelihood at θ = 0.50
(i.e. unlinked).
• Then we take the logarithm (base 10) of each
likelihood.
Example
•
Consider a family of 7 children:
–
–
–
–
•
•
•
A_ B_ : 4 children
A_ bb : 2 children
aa B_ : 0 children
aa bb : 1 child
The expression we will use to determine likelihood L(Ɵ) is p4q2r0s1, where
p, q, r, and s are the probabilities of the 4 types of offspring (A_ B_, A_ bb,
aa B_, and aa bb) at different values of Ɵ.
The likelihood ratio L(Ɵ) / L(0.5) is obtained by dividing each L(Ɵ) value by
the unlinked likelihood L(0.5), which is 0.00021997 for this family.
The LOD score is the base 10 logarithm of the likelihood ratio.
Maximum LOD Score
• The LOD score data for this family shows that a recombination
frequency of 0.3 is the most likely.
• However, the maximum LOD score is only 0.133, far less than the
value of 3.0 need to prove linkage
• More data from other families is needed. LOD scores for each value
of Ɵ can be added together.
– It typically requires about 20 families to prove linkage.