Combined Deficiency of Vitamin-K-Dependent Clotting Factors Type 2

Download Report

Transcript Combined Deficiency of Vitamin-K-Dependent Clotting Factors Type 2

IB404 - 18 - Human genome 4 – March 28
1. As you can imagine, there has been an extraordinary amount of work performed with the
human genome sequence, so we can only touch on a few examples. One of the simplest and most
obvious is that having it makes positional cloning of genes considerably easier. The basic idea is
that if you have a large family, or several families, with a mutant gene causing a disease, then it is
possible to map that mutant gene by association mapping to a region of a chromosome. This is
now readily possible using the physical maps with many molecular sequence markers developed
in the 1990s. Mapping can get down to perhaps 1Mbp, and then one searches for candidate loci in
the region, and sequences all exons after amplifying them using PCR primers designed to the
flanking introns, looking for mutations. The genome sequence
has hugely accelerated this last step.
2. For example, the VKORC1 gene, which encodes an
enzyme called vitamin K epoxide reductase. This enzyme
reduces and recycles vitamin K, after it has been used by
vitamin-K-dependent carboxylase to add a carbon dioxide
or carboxyl group to certain glutamic acid residues in
certain proteins. The major context of this enzyme cycle in
us is that some of the proteins modified are blood clotting
factors, and this modification is required for their activation.
The drug warfarin (coumadin) in small amounts is used to
treat heart attack and stroke victims to thin their blood, and
in large amounts to poison rats. Edward Doisy, who purified
vitamin K and got a Nobel in 1943, got his BS here.
3. A German group mapped a mutation in a family
with Combined Deficiency of Vitamin-KDependent Clotting Factors Type 2 to a 16 Mbp
region of chromosome 16. In an effort to narrow the
region down a little more they noted that there are a
series of genes in this region that are orthologous
and syntenic with a series of genes in the previously
mapped warfarin resistance locus region in rats and
mice. This narrowed the search to 4 Mbp, but there
were still 130 annotated genes with ~1100 exons.
4. So they sequenced all ~1100 exons from all patients and parents in their families, plus in
patients with resistance to warfarin treatment, plus in susceptible and resistant rats! The only gene
that had mutations changing amino acids in all patients and resistant rats was a three-exon gene
they call VKORC1, for vitamin K epoxide reductase complex subunit 1. In the homozygous
patients with clotting disorders they found a single base change (C>T) causing an arginine to
change to a tryptophan (arrows in top panel on next slide). Warfarin-resistant patients inherit this
in a dominant fashion and in four patients they found four different heterozygous mutations,
each again causing single amino acid changes, while the resistant rats were also heterozygous at
yet another amino acid (“N” in bottom two panels). Thus amazingly all these mutations are
simple single-base changes causing so-called missense mutations, that is, changes in single
amino acids. You can imagine that many other kinds of mutations are possible, including
frameshifts, stop codons (called nonsense mutations), small indels, splice junction mutations,
promoter mutations, and others of no obvious effects, and these have been found in abundance in
other genes, e.g. hundreds of mutations causing cystic fibrosis have been identified.
5. Human genes often have paralogs that were duplicated early in vertebrate evolution, often
four of them (derived from two polyploidization events in early chordates), but ranging from
none to many. VKORC1 turns out to have a single paralog in vertebrates, called VKORC1-Like1.
In fact this turns out to be a more conservative protein evolutionarily, but its function is unknown.
Presumably it has a related function because there is a single gene in other animals (except it has
been lost from nematodes). I found a single gene in three protists, implying that this is an ancient
gene, but lost from plants and fungi and other protists.
Mouse
10% corrected distance
There are also several retroor processed pseudogenes in
the human and rodent genomes
for each paralog.
6. The public genome paper
identified ±300 paralogs for
±1000 human disease genes,
and these might be involved
in related genetic diseases.
Similarly, 18 novel paralogs
of previous drug target
genes were identified, which
might then be new drug
targets, e.g. another serotonin
receptor subtype.
Rat
Human
Chicken VKORC1
Xenopus frog
Pufferfish
Zebrafish
Mouse
Rat
Human
Chicken
Xenopus frog
VKORC1L1
Pufferfish
Zebrafish
Ciona intestinalis
Ciona savignyi
Deuterostomes
Strongylocentrotus purpuratus - sea urchin
Dmelanogaster
Dpseudoobscura
Anopheles gambiae
Insects
Bombyx mori
Apis mellifera
Trypanosoma brucei
Trypanosoma cruzi Kinetoplastid protists
Leishmania major
7. The molecular markers used for these kinds of mapping projects are called sequence tagged
sites or STSs, and most commonly are microsatellites, that is, strings of repeats of di-, tri-, tetraor penta-nucleotides where the repeat number varies. These are amplified using PCR primers
designed to the flanking unique DNA, and length variants scored on a gel or chromatograph.
This is also the technology used for forensic DNA work. Here is a dinucleotide microsatellite.
Allele1
Allele2
Allele3
Allele4
Primers
ACGGTCGATATGATAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG------------------TACCGCATATGTCATG
ACGGTCGATATGATAGCGCGCGCGCGCGCGCGCGCGCGCGCG------------------------TACCGCATATGTCATG
ACGGTCGATATGATAGCGCGCGCGCGCGCGCGCGCGCGCG--------------------------TACCGCATATGTCATG
ACGGTCGATATGATAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGTACCGCATATGTCATG
ACGGTCGATATGATAG
TACCGCATATGTCATG
8. More recently the sequencing of multiple human genomes has led to identification of millions
of single nucleotide polymorphisms or SNPs, that is, places where human DNA commonly
varies by a single base pair, roughly 1/1000bp or 3 million per genome copy. These can be typed
or identified by various technologies and also used as genetic markers. One goal was to use these
to map more complicated genetic traits, such as the basis for predisposition to heart attack or
various cancers, that is, quantitative traits affected by alleles of several genes.
It turns out this is hard to do technically and financially, so the HapMap project was undertaken
to identify all the common haplotypes of multiple SNPs, hoping that this will simplify the
requirement for typing of millions of SNPs. Of course, the difficulty is that haplotypes can be
long or short, and for any particular region there could be many different haplotypes, all vestiges
of our relatively young history as a species. And these are only the common ones, the assumption
being that these common diseases will involve common haplotypes, contrasted with the rare
single gene disorders like VKORC1 involving rare mutations.
9. Here are haplotypes of SNPs across a 100 kbp region in (A) and a 500 kbp region encoding 7
genes in (B). The SNPs were typed in ~400 individuals, and the resultant frequencies of
haplotypes are shown in A. B shows the lengths of different haplotype blocks. Picking 2-3 SNPs
to type for each haplotype block simplifies the required analysis down to less than 1m SNPs.
10. One of the promises of the human genome project was that it might eventually help reveal the
genetic contribution to complex polygenic diseases like heart disease, diabetes, obesity, and
various predispositions to cancers and other diseases. Because these are polygenic and hence
difficult to map (except as QTLs or quantitative trait loci as is done in many species, but only
identifies broad regions of the genome), there was a hope that typing or characterizing the SNP
and haplotype patterns of sufficient numbers of patients might allow Genome Wide Association
of SNP variants and haplotypes with susceptibility to disease (or reaction to drugs, known as
pharmacogenomics; or other traits, like height, intelligence, or longevity). This is the promise of
“personal genomics”, that by determining one’s SNP genotype one might obtain information
about one’s own health. Ultimately, of course, complete diploid genomes would be the best, but it
is still too expensive to do on the scale needed for GWA studies.
To be statistically effective, given the huge numbers of SNPs being sampled, typically 500600,000, representing all the major haplotype blocks in our genome, the numbers of patients and
controls in these GWA studies needs to be in the thousands, preferably approaching 10,000.
Given that each genotyping costs about $500, these are very expensive studies, but many have
now been performed. This is why NIH is pushing so hard to get the cost of sequencing entire
diploid genomes down to $1000, so GWAS can be repeated using entire genome sequences.
Essentially one asks, is there a high probability of a particular SNP, usually between 1-10% of the
alleles in the population, being associated with the disease or trait? There is no hypothesis, no
candidate genes, it is a random undirected search of the entire genome with no preconceived
ideas about what one might find.
11. Here is an example of a GWAS for
metabolic traits, the kinds of
measures doctors routinely make from
blood tests these days, from the top,
triglycerides, HDL, LDL, CRP,
glucose, insulin, BMI, diastolic and
systolic blood pressure. The dots each
represent a SNP, with the
chromosomes alternating in black and
grey on each line. The Y axis is the log
score indicating divergence from
background, with the red line around
log6 showing statistical significance.
The vertical blue lines are regions
previously associated with these traits.
Some traits have many significant
SNPs/regions (haplotypes), while
others, like the bottom blood pressure
tests, have none. Most earlier
significant results were confirmed, but
some are not. This is the kind of
messiness one expects from tests in
different populations.
12. While several hundred such studies have now been published since 2005, and many
interesting associations of SNP haplotypes with diseases and traits have been identified, there are
several major problems with these GWAS studies. One is that SNPs seldom contribute more than
a few percent to the genetic variation for a trait, leaving a big question mark as to what the rest of
the genetic heritability of these traits is determined by. In this metabolic study you can see the
problem in this diagram. These SNPs only explain 2-20% of the known genetic variation.
Two obvious possibilities is that it is synergistic combinations of SNPs and haplotypes that is
important (epistasis), or alternatively alleles that are more rare than 1% in the population and
have larger effects might be important, which will require whole genome sequencing to find. In
addition, there are other kinds of genetic variation (four slides down) that might be important.
13. A second problem is that when the SNPs are examined in detail, they seldom are what you
would hope, that is, they are seldom non-synonymous changes in exons of genes that encode
proteins. Instead most are outside of genes, and at best identify regions of the genome of interest,
with neighboring genes now being targeted for study. Nonetheless, they have given researchers
hundreds of new leads in trying to identifying new genes involved in these many polygenic traits.
14. Some GWAS studies have not been successful, even though we know that the trait has a
considerable genetic basis, most famously for “general intelligence” or g. The most recent study
used the extremes of the range of g in ~10,000 children in the UK, and found only a few weak
associations (blue dots), with none significant at log 6 (below).
15. Several GWAS have now attempted to determine the genetic contribution to long life. Clearly
there is a major environmental component to long life as well, but surely there must be genetic
contributions. One controversial study from Europe was just published. They compared 801
centenarians (median age at death of 104 years) with 914 reasonably matched controls. As we
already know from many other studies including previous aging GWAS, the APOE gene has the
most significant effect, with 100% prediction of aging (see Manhattan plot below). This
apolipoprotein E has long been associated with Alzheimer’s disease, with several alleles
predisposing, and others protecting. Having the protective alleles is essential to becoming a
centenarian. But they extended this to a suite of 281 SNPs, which they call a “genetic model” of
aging, which together have good predictive power for becoming a centenarian, and confirmed the
model by repeating the analysis in two other cohorts of centenarians, including a group of 60 who
lived to a mean age of 107 years. More refined analysis allowed separation or clustering of these
SNPS into three separate groupings, suggesting three different ways to have a long-lived genetic
inheritance. Some of the SNPs in other aging-associated genes are highlighted in the figure too.
16. For 5 years it has been possible to get yourself genotyped this way, by various “personal
genomics” companies. The most famous is 23&Me, based in California, which for $300-500 will
genotype ~500,000 SNPs from a spit sample you mail to them. Then you get access to all your
raw data (the two nucleotides, A, C, G, T you have at each SNP position) on their website (useful
for checking newly published GWAS), plus an interpretation of the results for traits, disease
predisposition, and drug metabolism (primarily variants in p450 enzymes). These are all
explained, with the relative importance of genetic and environmental factors made clear.
I did this for my family, and the results are quite fascinating, even if not highly predictive, and as
expected, not devastating. For some they have been quite instructive, for example Francis Collins
discovered he has a predisposition for diabetes and made a radical lifestyle change as a result,
while for others they can be devastating, for example Sergey Brin (co-founder of Google and
husband of the co-founder of 23&Me, Anne Wojcicki), discovered that he is homozygous for a
rare recessive allele that with almost certainty causes early onset Parkinson’s disease.
The FDA does not approve of this kind of personal genetic testing, hence it has to be sold as
“entertainment”, and there is a major debate in the research, medical, and ethics community about
how readily available these kinds of genetic results should be. The doctors, of course, want to
control it all, making even more money. I and many others think we should all be free to find out
our own genetic inheritance, even if it means unpleasant discoveries, like your father is not really
your biological father, or that you have a high likelihood of early Alzheimer’s disease.
Some other companies doing personal genetic testing are Navigenics, Knome, and DecodeMe,
although they have more restrictive policies on sharing your own data with you, for example,
requiring that you consult with a doctor or genetic counselor, at considerably higher cost.
17. A related aspect of all this is that from the SNPs on your Y chromosome (for males) or your
mitochondria (for females), your ancestry can be figured out with considerable resolution. This
is because despite most genetic variation being between individuals, there are also residual
differences in SNP frequencies between races and population groups and geographic regions.
23&Me will also provide these results, and even goes further today, identifying close and distant
relatives who have also been genotyped by them, and allowing you to ask them to make contact
with you (I’ve yet to ask any of my fourth and fifth cousins, or responded to the ~10 requests for
contact I’ve had – they seem too distant to me - my close relatives are all in Scotland still).
18. This distance phylogenetic tree shows
how this holds up across the entire world,
for SNPs on all chromosomes. CEU is
North Americans of European descent;
CHB is Chinese, JPT is Japanese, YRI is
Nigerians, and TGN and GDP are
Tongans and Papua New Guineans (the
study was confirming that Polynesians
derived from SE Asia). Note that while
most genetic divergence is indeed
between individuals (the long lines to
each individual colored dot), there is a
common root to each group. The longer
branch to YRI shows the divergence of
the Out-of-Africa grouping.
19. But SNPs are only one kind of genetic variation between humans. The other major kind is
called copy number variations or CNVs, and includes all insertions and deletions and
duplications, from single bases to Mbp. It is much harder to assay all of these, and it was not until
we had entire genomes from individuals that we realized that if you count up all the bases
involved in these CNVs we differ by up to 0.5%, and perhaps more. From an evolutionary
perspective, however, we count each of these CNVs as a single event.
A GWAS in the UK of 19,000 individuals for 3,432 polymorphic CNVs longer than 500 bp was
recently published, looking at eight diseases previously examined extensively by the same group
using SNPs. They found just three CNVs associated with Crohn’s disease and diabetes, but all
three regions had already been identified in their SNP GWAS, indicating that at least large CNVs
are not responsible for the missing genetic heritability, at least for these diseases.
20. The same may not be true for a wide array of neurological problems, from schizophrenia to
autism, where associations of large unique (non-inherited) CNVs are now being made. The
strategy used to find them is quite remarkable. The researchers argued that these individuals are
unlikely to reproduce effectively, hence their genetic defects might be unique (having occurred in
one of their parents’ germlines), and large, affecting many genes. Indeed they found in many
cases that individuals with major neurological defects have an excess of large CNVs versus
controls, and most of these are unique and novel to them. Figuring out which genes in single or
triple copies are the problem is a major task, something yet to be completely resolved even for
Down Syndrome or trisomy-21, where an entire chromosome is present in three copies. Given
that over 50% of genes are expressed in the brain, it is not surprising that having single or three
copies of several genes might cause neurological problems.
21. Today most effort is directed towards sequencing entire diploid genomes. Venter did his using
Sanger sequencing for ~$100m, James Watson was sequenced by 454 Life Sciences for ~$1m,
then a series of Chinese, Nigerian, and others were sequenced with Illumina for $200,000 and
dropping, now down to around $5,000. Meanwhile Complete Genomics with other re-sequencing
methods claims to have it down to $3000, and two 1000-genome projects are underway here and
in China. Eventually the idea is to get to $1000 per genome and do GWAS again with 10,000
patients and controls, this time hopefully finding the missing genetic variation!
The bottom line from the comparisons of these genomes is that most non-Africans differ from
each other by about 3 million SNPs and 300,000 CNVs, while African groups differ from each
other, and everyone else, by around 4 million SNPs and about 400,000 CNVs.
A recent paper
described four
different
Khoisan men,
and Bishop
Desmond Tutu,
revealing that
indeed the San
are the most
divergent extant
humans, even
amongst
themselves.
22. Finally, two years ago a group including Leroy Hood in Seattle finally did the obvious, and
sequenced the genomes to two children with homozygous recessive genetic conditions that were
previously unexplained, plus both parents - family-based sequencing. By eliminating all the
common variants already known from the SNP and CNV databases, they were able to quickly
narrow the search down to just a few candidate genes in which a child had inherited obvious
potentially deleterious mutations from each parent, and quickly managed to identify the genetic
basis for each condition. Furthermore they were able to show that each child inherited about 70
novel mutations from their parents, the first direct estimate of our actual mutation rate. They were
also able to reveal the complete recombination pattern for every chromosome, showing 1-3
crossovers for short chromosomes and many for long ones, such as #4 below. The power gained
by sequencing all four individuals was needed to get complete resolution here.