Transposons, transposable elements, or jumping genes

Download Report

Transcript Transposons, transposable elements, or jumping genes

IB404 - 17 - Human genome 3 – Mar 26
1. Transposons, transposable elements, or jumping genes comprise half of our genome so it
is necessary that we learn something about them. They come in two major flavors, those that
move by a RNA intermediate, also known as RNA or retro-transposons or Class I elements,
and those that move as a DNA molecule, also known as DNA transposons or Class II elements.
2. Retrotransposons are the major class in mammalian genomes. They in turn come in three
major kinds. The first are retroviral-like transposons, and as their name implies they have 6-10
kbp genomes resembling those of retroviruses such as HIV. Thus they encode a reversetranscriptase that copies their RNA genome into a DNA copy, an integrase that integrates this
DNA copy into our chromosomes, a Gag protein that forms a capsule around the complex of
RNA, RT, and integrase, and a protease that cleaves these other three proteins from a long
precursor protein. There are now anti-HIV drugs that target each of these proteins.
Unlike true retroviruses that can cross infect
between animals, they do not encode an
envelope protein that is part of the membrane
surrounding an HIV virion. The coding region
is flanked by Long Terminal Repeats of 5002000 bp, and these LTRs help define these
transposons. Similar transposons are found in
most genomes, e.g. the Gypsy and Copia
transposons and many others in Drosophila,
and the Ty1 and 2 retrotransposons in yeast.
3. Non-LTR retrotransposons are the second major kind, and as their name implies, do not
have the LTRs at their ends. They encode only the reverse transcriptase and an endonuclease.
At their 3’ ends they typically have a long A-rich stretch. Copies of these kinds of transposons
in the genome often have their 5’ ends truncated, because the mechanism by which they are
integrated into the genome is combined
with their reverse-transcription, and if the
latter is interrupted, then they lose their 5’
ends. This figure is for a particular Drosophila
element called R2 that specifically integrates
into the 28S rDNA repeats.
4. Again, non-LTR retrotransposons are found
in most animal genomes, but their prevalence
in the human genome is extraordinary. A
single type, known as LINE1 for Long
Interspersed Nuclear Element 1, has a full
length of about 8 kbp, and makes up about
20% of our DNA. It is not hard to imagine
how a few master copies of such an element,
when abundantly transcribed, could lead to
the integration of many copies throughout the
genome. Our genome appears to have
suffered numerous such events in the past
couple hundred million years, and movement
of these LINEs is on-going.
5. The third major kind are the SINEs or Short Interspersed Nuclear Elements. In the human
genome this kind is best represented by the Alu element, so-called because the bacterial
restriction enzyme Alu recognizes target sites near the ends of these element and in raw digests of
human DNA these elements can be seen as a ~300 bp band. These do not encode any proteins,
but have an A-rich 3’ end. There are more than a million copies in our genome, and they appear
to have been produced from master copies in bouts over the past hundred million years, and again
some are still actively moving in us each generation. Thus some copies are highly diverged from
their consensus, because they were produced from a master copy long ago and have mutated
individually since then, while others are very similar to their consensus, implying that they were
formed relatively recently. As with the LINE1 element, this process is ongoing, and indeed
polymorphic Alu insertions have been discovered in humans, sometimes defining particular
lineages, eg. native North Americans.
6. The Alu element is thought to have arisen as a duplication of two copies of the signal
recognition particle component 7S non-coding RNA (recall all the pseudogene copies of these
and other non-coding RNAs that carry their RNA Pol III promoter sequences with them). Like
these and other retropseudogenes, Alus are thought to result from the action of the reverse
transcriptase of LINE1 elements in our genome, although if this is the case then Alu has achieved
a particularly efficient way to “parasitize” the LINE1 system. Strangely, LINE1 elements
usually integrate into AT-rich portions of our genome and comprise most of the long AT-rich
deserts, while the relatively GC-rich Alu elements usually integrate into GC-rich regions of the
genome. It is not understood how Alu elements manage to bias the integration preference of the
LINE1 system this way, but Alus are clearly far more abundant in gene-rich regions of the
genome (revisit slide 6, point 15 from last lecture and the associated figure to see this trend).
7. An example of a chromosomal regions showing GC and AT isochores. This is the 4p22-4p15
region of the 4th chromosome. Note the isochores of high GC content enriched in SINES (Alus)
and genes, contrasted with isochores of low GC content enriched for LINEs but with only a few
genes. These genes in the AT-rich deserts are often very large (not shown here).
8. The DNA-mediated elements are particularly diverse in the human genome, as they are
elsewhere. Most transposons in bacterial genomes are of this kind, and famous ones in
Drosophila include the P element used for transformation and mutagenesis (I did my postdoc in
Madison on it, and then spent the first 15 years here working on mariner transposons). The
various kinds all share a common structure, with a single gene encoding a transposase, flanked by
inverted terminal repeat (ITR) sequences. Each different kind encodes a different kind of
transposase, although many of these are distantly related to each other, while also having
different sequence and length ITRs. For example, the mariner family of transposons I studied
generally are 1.3 kb long, with 30 bp ITRs and a ~1000 bp ORF encoding a ~300 amino acid
transposase. This is one of the smaller kinds, with others reaching several kbp long. The basic
mechanism of transposition is that the transposase protein encoded by each element recognizes
the ITRs of a copy, brings them together, cleaves them from the flanking DNA, cleaves a suitable
target elsewhere in the genome, and inserts the actual DNA molecule of the transposon into the
target. Often the host is fooled into replicating the original element position from the sister
chromatid.
Transposase ORF
9. Transposons are recognized as belonging to these various classes and kinds based on sequence
similarity and structures, however that only works for recently formed copies (in the past few
million years). For older copies it is necessary to generate consensus sequences that represent
what they once looked like (up to about 200 Myr ago), while eventually the sequences of
individual copies change so much that it becomes impossible to recognize them as being derived
from a transposon. Roughly 45% of our genome can currently be recognized as being
transposons, but perhaps as much as another 20% is such ancient transposon copies that they can
no longer be recognized as such.
10. This is a summary table of the transposon content of our genome, taken from the public
paper, where they extensively discuss these. Celera’s WGS assembly stumbled on a lot of these
and did not represent them well, leaving gaps with NNNNNNNNNs instead, so they did not treat
them much. After all, they were only after the genes. While the non-autonomous versions of
some transposons are simply internally deleted versions of them that sometimes outnumber the
normal copies, remember that the SINEs are not simply versions of LINES, rather they evolved
separately albeit being apparently dependent on LINEs for activity.
Amongst the DNA transposon fossils are one kind of P element, and three kinds of mariners,
plus many other kinds. But as far as we can tell, none of the DNA transposons are still active, so
they truly are molecular fossils, remnants of horizontal transfers from other species.
11. One can break down the copies of transposons into age classes according to their percent
divergence from their consensus, with roughly 4% divergence representing 25 Myr. The
remarkable result is that the youngest DNA transposons in our genome are about 50 Myr old,
while the youngest LTR retrotransposons are about 25 Myr. LINE elements appear to have been
continuously active in our genome, while SINES show an explosion of activity in the past 100
Myr, although both these classes have also become relatively quiescent in the past 10 Myr.
12. While there is no question that transposons are primarily selfish genetic elements making
copies of themselves at the expense of the host, and most RNAi systems and other host genome
protection systems like the RIP system of Neurospora crassa appear to have evolved as defenses
against them, transposons occasionally become useful to their hosts. A classic example is that
flies do not have the normal short telomeric repeats generated by telomerase, instead their
telomeres are maintained by the faithful transposition of two kinds of LINEs into them. In our
genome the most famous example is the RAG genes that encode the two recombinases that help
generate the diversity of antibodies in B cells and receptors in T cells of our adaptive immune
system. These appear to have been derived from a transposon perhaps 450 Myr ago when we
were cartilaginous fish.
13. The public paper recognized about 40 additional “domesticated” transposon copies that are
now functional genes in our genome. We worked on several of these, including seven derived
from various Tigger DNA transposons in our genome, but their functions are not known. The
SETMAR gene is particularly interesting as it is a chimeric or fusion gene resulting from an
exon encoding a SET domain (which is involving in methylation of the lysines in histones and
hence the “histone code” that controls which genes are available for transcription), and an exon
encoding a transposase domain from a mariner transposon about 50 Myr ago.