genes are islands in a sea of transposons

Download Report

Transcript genes are islands in a sea of transposons

IB404 - 14 - Other plants - March 10
1. The next plant genome was the 450 Mbp genome of rice, Oryza sativa. Several groups
contributed to this effort, including two large companies, Syngenta and Monsanto, who
produced WGS drafts, a WGS draft by a Chinese genome center, and detailed clone-by-clone
efforts by the Japanese. Several conclusions are worth noting:
A. Despite about at least 200 Myr divergence between these two major lineages of angiosperms
(dicots and monocots), there is still considerable microsynteny between their genes.
B. The rice genome is much larger because of many transposon insertions between genes, that
is, genes are islands in a sea of transposons.
C. The rice and other grass/cereal genomes are essentially colinear, being around 70 Myr old,
so rice serves as a model for the others, including maize/corn at 3 Gbp and wheat at 16 Gbp.
2. An enormous effort has gone into sequencing the maize genome, but at ~2.5 Gbp it is nearly
the same size as ours, except even more of it is transposons, and these transposons form long
stretches of seemingly barren genomic landscape. Like rice, but even more so, genes are islands
in this sea of transposons. Various strategies were employed to separate out the gene islands and
sequence only them, but eventually the genome was completed, using old-fashioned clone-byclone approach involving sequencing ~17,000 BACs, and published in 2009. It is 85%
transposons, and the authors predict around 32,000 genes. The transposon numbers are simply
amazing.
3. Like Arabidopsis, rice, and sorghum, the genes are mostly on the chromosome arms of the 10
metacentric chromosomes, with massive pericentromeric regions completely dominated by
transposons and other repeats. The first line is exons, the rest are various transposons.
4. Since these are all reasonably young flowering plants or angiosperms, we expect to see most
genes shared, especially amongst the three grains, but even across to Arabidopsis the number of
shared genes is high, with only a few losses in each lineage, and of course some rapidly
evolving lineage-specific genes. This Venn diagram is simplified by looking only at gene
families, of which there are ~12,000 per genome (recall that animals have ~10,000 gene
families).
5. The soybean (Glycine max)
genome was published in
2010. It also took a long time
because it is relatively large at
around 1 Gbp with tons of
transposon repeats, and again
has a lot of duplicated genes.
The genome is in 20
chromosomes, and extensive
genetic and physical maps
allowed assigning most
scaffolds to these
chromosomes.
In this image of them, genes
are in blue, transposons in
green, yellow, and light blue,
while the upper grey layer is
“none of the above”. Notice
that the genes are mostly at the
ends of the chromosome arms,
with the pericentromeric
transposon repeats making up
most of the genome.
Centromeres are pink.
6. The ~46,000 genes reveal two cycles of polyploidization, estimated by these authors at 13 and
60 Myr ago (which is completely different from the diagram at the end of the last lecture??). In
this histogram, the ~31,000 genes that are paralogous are shown as pairs, with the younger
polyploidization event having a Ks mode of ~0.15 (15% of possible synonymous changes have
occurred), whereas the older event has a Ks mode of ~0.5. The remaining ~15,000 genes are
singletons, presumably because their paralog was lost.
7. One of the most important features of soybeans is their production of lipids, with soybean oil
being one of the major products. They tried to annotate all the genes possibly involved in lipid
metabolism, and came up with 1,157. In every category there are more than in Arabidopsis,
presumably due to retention of gene duplicates after the ~13 MYA polyploidization.
Other analyses include an extensive discussion of the genes implicated in facilitating and
regulating the formation of nodules, the root structures peculiar to legumes which harbor
nitrogen-fixing bacteria.
8. Just to show the complexity of these genomes, here are their pie charts of the many different
kinds of transcription factors found in these two plant genomes (5,673 genes in 63 families for
soybean, fully 12% of their gene catalog). Remember that the MADS family genes are the major
regulators of plant form. They do have homeodomain/homeobox TFs, but they have different
functions in plants. Soybean is on the left, with roughly double the number of each class of TF.
9. Several other plants have been sequenced, including
sorghum, grape, and Populus, and more recently cucumber and
strawberry. Our own Ray Ming in Plant Biology led sequencing
of the papaya genome, starting when he was working in Hawaii
generating transgenic strains resistant to viral infection. They
did a fairly rough draft 3X WGS genome assembly with lots of
gaps and a total size around 400 Mbp.
Unlike all these other plants, it does not appear to have undergone a recent polyploidization, and
hence has only ~13,000 genes. Indeed, this is close to the estimate for what the ancestral
angiosperm genome looked like.
10. The Populus genome is
reminiscent of the soybean
genome, with ~45,000
protein-coding genes, and a
relatively recent wholegenome duplication or
polyploidization event
contributing ~8000 paralog
pairs that persist in this
genome in large
microsyntenic chunks.
Populus is being promoted
as a biofuel.
11. Many plants have large
genomes, especially amongst
the gymnosperms like pines
(15-25 Gbp), so they are
unlikely to be completely
sequenced any time soon.
Instead, large scale
sequencing of cDNAs, plus
massive scale EST projects
are being undertaken,
allowing comparisons across
all plants. Here is a
phylogenomic tree based on
large numbers of proteins
from both genomes and
transcriptomes, rooted with
mosses, ferns, and liverworts
at the bottom. The authors
used the tree and protein
dataset to conclude that
several aspects of microRNA
biology only evolved in the
angiosperms, and might have
contributed to their diversity.