gene structure - School of Life Sciences
Transcript gene structure - School of Life Sciences
IB404 - 22 - Other vertebrates – April 11
1.The mouse genome is around 2.5 Gbp, versus 2.9 for human, primarily because deletions were
more frequent in the rodent lineage, removing more “junk” DNA (transposons and pseudogenes).
2. The rodent/primate split is thought to be about 75 MYA. Note that this means that the placental
mammals radiated into the lineages that became the major orders before the dinosaurs went
extinct roughly 67 Myr ago at the Cretaceous/Tertiary asteroid impact boundary. Marsupials are
perhaps 140 Myr diverged, while egg-laying monotremes are perhaps 170 Myr old.
3. Simple DNA comparisons of the mouse and human sequences indicate that ±40% remains in
common between them, the rest having been removed by deletion or added by lineage-specific
insertion of transposons and pseudogenes. In this 40%, some 600,000 well-conserved,
reciprocally-unique, bits of sequence in the range of 50-200bp long are present. Many are exons
of genes, but there are many others. The vast majority of these show great microsynteny in large
blocks, e.g. this 500 kb block on mouse chromosome 12 compared to a microsyntenic 600 kb
block on human chromosome 14. Note that mouse is “uniformly” shorter than human.
4. On a genome-wide scale, the mouse and human genomes have suffered something like 300
chromosomal rearrangements if one looks only at blocks over 300 kb in size, so there could be
still more smaller ones. The result is that mice have only 20 acrocentric chromosomes and
rather radical reshuffling of microsyntenic blocks. Other mammal genomes are less radically
rearranged relative to the human genome, although there are some other peculiarities, e.g.
the muntjac or Asian barking deer has just five chromosomes, while horses have 32!
It seems that rates of rearrangement were higher in the mouse lineage.
Note also that the X has not suffered translocations with or
transpositions to other chromosomes.
5. The gene/protein content of the mouse and human
genomes are expectedly very similar. On the grandest scale
they reveal another version of the “quartered-pie” figure we
saw for human, with roughly 25% of proteins being
vertebrate-specific, animal- (metazoan) specific, 25%
eukaryote-specific, and the final 25% universal.
6. The tiny rodent-specific fraction reflects
lineage-specific expansions of certain
families, primarily in olfaction, taste,
immunity, and reproduction. The most radical
example is the olfactory receptors expressed
in the nasal epithelium. Mouse has about
1000 genes and 200 pseudogenes, while we
have about 300 genes and 700 pseudogenes.
These pseudogenes have decayed due to lack
of selection for function, that is, they have
accumulated in-frame stop codons and
7. Other examples include the p450
cytochromes involved in detoxification,
which are considerably expanded in mouse,
perhaps dealing with a wider variety of toxins
mice encounter in all the plants they eat, or
removing odorants (right).
8. The genome paper compared ~12,000 clearly orthologous genes for their Ka/Ks ratios. Ks is
around 0.60, that is, about two-thirds of the synonymous sites have changed, while the Ka rate is
generally low at 0.07, since these orthologs average 80% amino acid identity. But, examining
mouse-specific gene family expansions revealed high Ka rates (0.87) for many paralogous
relationships, showing that new duplicates evolve rapidly, under positive or relaxed selection.
9. To examine gene structure they used ~1300
orthologs for which the entire gene structure is
confidently known, e.g. from cDNAs. Gene
structures, that is numbers of exons and introns,
are overwhelmingly the same, with just 22
instances of an intron difference in roughly
10,000 exon/intron comparisons. These 22 are
all intron losses, so intron gain seems to have
come to a complete halt in mammals.
10. They also generated the same kind of figure
of percent DNA identity for the “average”
gene as for the Drosophila comparisons (Fig a).
As expected, exons are well conserved at about
85% DNA identity, while introns are like noncoding flanking DNA at about 67% (similar to
the Ks value). Notice the spikes of identity for
the ATG start codon, the intron boundaries, and
the polyadenylation signal (AATAAA) at the
end of the 3’UTR. More detailed analysis
shows the neutrality of third codon positions
(b) and the consensus intron splice site
conservation (c, d).
11. The mouse has also allowed us to understand the long AT-rich gene deserts better. It turns
out that the regulatory regions of these few genes can be huge, e.g. in this example of the
human DAC locus which encodes a protein expressed in many tissues and involved in
development of brain, limbs, and sensory organs. The coding exons plus introns is 430 kb, but the
upstream potential regulatory region is 900 kb, with another 1300 kb downstream to the next
gene, with lots of conservation in mouse. Comparisons with Xenopus frogs and three fish
(Zebrafish, Tetraodon, and Takifugu) reveal just 32 highly conserved regions in the promoter,
first intron, and downstream of the coding exons. 7 of 9 of these candidate enhancers, when
tested for function upstream of a “minimal” heatshock promoter in lacZ reporter constructs in
transgenic mice, give appropriate partial expression patterns. So this gene is really 2 Mbp long.
12. Recall that around 600,000 50-200bp pieces of the mouse and human genome could be well
aligned, and this is roughly 5% of the genome. But only about 225,000 of these are coding
exons, comprising about 1.5% of the genome. The 29-mammal paper we considered on Monday
reveals what more can be learned about the other 3.5% of this 5% of conserved sequence. For the
purpose of the exam, I want you to master the highlights from the abstract of that paper. As
noted near the end of the discussion, this kind of comparative analysis is the only way to identify
all the conserved functional regions of our genome. A similar point is made below. You might ask
what the remaining non-conserved, non-tranposon 50% of our genome is doing - we don’t know!
13. Here are the last paragraphs from the public mouse genome sequencing consortium paper.
Comparative genome analysis is perhaps the most powerful tool for understanding biological
function. Its power lies in the fact that evolution's crucible is a far more sensitive instrument than
any other available to modern experimental science: a functional alteration that diminishes a
mammal's fitness by one part in 10,000 is undetectable at the laboratory bench, but is lethal from
the standpoint of evolution. Comparative analysis of genomes should thus make it possible to
discern, by virtue of evolutionary conservation, biological features that would otherwise escape
our notice. In this way, it will play a crucial role in our understanding of the human genome and
thereby help lay the foundation for biomedicine in the twenty-first century.
The initial sequence of the mouse genome reported here is merely a first step in this intellectual
programme. The sequencing of many additional mammalian and other vertebrate genomes will
be needed to extract the full information hidden within our chromosomes (e.g. the 29 mammal
paper, and the 10,000 vertebrate genomes project). As we begin to understand the common
elements shared among species, it will become possible to approach the even harder challenge of
identifying and understanding the functional differences that make each species unique.
14. Large comparative studies are now being performed
across all these mammals, beginning to fulfill the promise
of the mouse genome paper. Here’s just one example of the
possible insights, concerning involvement of genes in
short versus long age. They coded each mammal with a
genome sequence according to their lifespan in captivity,
from short in red, to medium in purple, to long in blue.
They then compared ~10,000 well-behaved orthologous
genes/proteins from these 25 species, looking for ones that
showed an unusual pattern, exemplified in the figure. This
particular amino acid in a particular protein shows repeated
changes in short lived species (changes are inferred to have
occurred on branches with red dots). In contrast, this amino
acids is almost always S, with one conservative change to
T in medium- and long-lived species, implying that being S
is important for old age. They performed this analysis
across a total of ~5 million amino acids, and then classified
the proteins with this pattern by GO categories.
Surprisingly, instead of highlighting such GO categories as
DNA repair or oxidative metabolism, the major categories
were fatty acid synthesis and vitamin C binding, requiring
contortions about how saturated fatty acids absorb reactive
oxidative products by becoming unsaturated?? There will
be many surprises as these analyses become sophisticated.
15. David Clayton will hopefully give you an outline of the chicken genome in his comparison
with his zebra finch genome. Bird genome architecture is very different, with several macrochromosomes and scores of mini- or micro-chromosomes. With a genome around 1.2 Gbp,
largely because of fewer retrotransposons (almost no SINEs, and just one LINE that is not very
active, hence also few retropseudogenes), and intermediate in distance from mammals to fish,
chicken provides an additional testing ground for conserved sequences in mammals. Anything
that is recognizably the same sequence between the two has clearly been maintained by selection
(exons, non-coding RNAs, and regulatory regions).
For both Gene Robinson and David Clayton’s guest lectures next week, they will not provide
handouts, so just take a few notes to make sure you get the high points of their lectures. I will
have their powerpoint files and will ask one short question from each on the exam, focusing on
the highlights. Please make a point of attending these lectures, there is nothing worse for a guest
lecturer than to have only a few students turn up.
16. A frog genome has also been sequenced representing the
amphibians. The obvious frog would be Xenopus laevis, the African
clawed frog (left), which is the major model system used in research,
especially for their eggs which serve as monster cells in which to
express all sorts of receptors , transmitters, and other membrane proteins
for in vitro assays of their ligands and functions. But its genome is huge,
so instead X. tropicalis (right) was sequenced, with a genome around 1.7
Here’s a compilation of the Gprotein coupled receptor (GPCRs)
complement for several genomes.
The “rhodopsin” family includes the
odorant receptors, hence the huge set
of them in mouse, but notice that
Xenopus has a remarkable number
of them too, especially compared to
chicken (which nevertheless have a
lot more than some had expected
since olfaction is not a big deal in
chickens, but it is in some other
birds, like vultures). The last three
species are a puffer fish, amphioxus,
and a sea squirt.
17. Amongst fish, the common models in labs are the zebra fish and the Japanese medaka fish,
but puffer fish of the order Tetraodontiformes have long been known to have the smallest
vertebrate genomes. And the smallest amongst these are the family Tetraodontidae or smooth
puffer fish. Around 1992, Sydney Brenner, having passed along the torch on Caenorhabditis
nematodes, decided to push a particular marine species, originally Fugu, now Takifugu, rubripes
whose genome was estimated at 370 Mbp, or roughly 1/8th of the human genome (below).
Cloning and sampling of parts of the genome revealed that Fugu genes are relatively small and
close together, because their introns and intergenic regions do not have all the transposons and
other junk DNA of the human genome. Combined with the ~450 Myr divergence from human
(below), this genome promised to allow detection of human genes/exons by comparative
methods, so Brenner pushed it as an early “model” vertebrate genome that was doable and
17. The sequence reveals that transposons are only 15% of their genome. Yet annotation
revealed±33,000 genes, with the upper limit estimated at ±40,000 genes. Comparing these with
the human annotation showed that the vast majority, roughly 75%, have good human homologs
(not necessarily orthologs as with mouse). In the reverse comparison, roughly 8000 human genes
did not have a Fugu homolog, suggesting that these are either highly derived or novel in some
way, or lost from fish.
18. As indicated by the relative lack of transposons, the intron sizes of these genes (red) are far
shorter than human genes (blue), with only about 500 large introns over 10kb compared with
about 12,000 such large introns in human genes. The number of introns is roughly comparable,
but when individual genes are examined, while most have the same structure, there are many
instances of introns missing
in one or the other species some of these may be gains, but
most are losses (inferred by their
being present in homologous
genes in other organisms, such
as basal chordates or insects).
Introns are thought to be lost by
recombination with a reverse
transcribed cDNA copy of the
mRNA. Notice again that both
species do have a peak of short
introns around 100bp in size.
19. As a result, most genes
are smaller in Fugu (right),
however some are similar
sized, and a wacky few are
actually larger than their
human homologs, which
shows the idiosyncrasy of
molecular biology. This
example below is a 176 kb
gene that encodes a RNAbinding protein - all human
and fly genes in this family
are less than 50 kb.
However, notice the very
compact flanking genes.
20. So why does Fugu seem to have
more genes than mammals? The
answer seems once again to be yet
another whole-genome duplication
in the ancestor of all ray-finned bony
fish or teleosts, a tetraploidization.
A major signal of this is that Fugu
has 7 HOX complexes. The genome
of the zebra fish also has 7 HOX
complexes, however one different
complex in each fish seems to be
missing, so the ancestral teleost
fish had 8 complexes, compared
with 4 in other vertebrates. Similar
results are observed for many gene
families, but clearly the only 50%
greater numbers of Fugu versus
mammalian genes means that many
of the duplicated genes have been
lost from Fugu. (P is puffer fish, Z is
zebra fish, M is mouse, where the
four complexes are A, B, C, D)
21. There is a rough draft of a shark genome, plus these authors sequenced large numbers of
lamprey ESTs (unfortunately both the lamprey and hagfish genomes are repeat-rich and large,
hence have yet to be sequenced). Using these data, they attempted to determine the relative
timing of the 2R events, the two polyploidizations at the base of all vertebrates that led to our
having four HOX complexes, etc., and while they cannot be certain about this, it is clear that the
shark is post-2R, and their best estimates are that one polyploidization occurred before agnathans,
and one after the agnathans (assuming agnathans are monophyletic, which seems to be true).
Hence the image below summarizing the story.
22. Finally, the sequencing of the
genomes of multiple basal chordates
allows us to see another clear signal of
the two polyploidization events in that
for each of 17 hypothetical ancestral
chordate chromosomes there were
four vertebrate chromosomes (a-d in
top image), the pieces of which are
now spread around in humans (e.g. for
ancestral chordate linkage group 8,
below). Obviously there were also
many gene losses amongst these
duplicates, and this previously made
the story murky, except for the HOX
complex (although they also have
internal gene losses from each