Transcript document

IB404 - 11 - D. melanogaster 3 - 22 Feb
1. The genome of D. pseudoobscura was completed by the Baylor HGSC in 2005, and while
their manuscript focused on analyses of chromosome rearrangements (about 900 were estimated),
it also suggested that the comparison supports ~2,000 new genes, plus refinement of thousands
others. Thus it is clear that there are many more genes in the fly genome, perhaps ~15,000.
2. Their analyses of Ka/Ks ratios indicated relatively few instances of genes with signatures of
positive selection, but even this congeneric comparison might be too distant to find these.
3. An interesting analysis was a summary of the kinds of change seen in different parts of a
“typical/averaged” gene, taken across all ~10,000 confident 1:1 orthologs identified. Focus on the
top of the green band to understand it - this is effectively the percent nucleotide identity.
C. 50bp before transcript start site.
D. Entire 5’ UTR aligned at TS.
E. Entire 5’UTR aligned at ATG.
F. 5’ end of first exon aligned at
ATG.
G. 3’ end of coding exons aligned
at intron donor site.
H. Intron aligned at donor site.
I. Intron aligned at acceptor site.
J. 5’ end of internal coding exons
aligned at intron acceptor site.
K. 3’ end of coding region aligned
at STOP codon.
L. 3’ UTR aligned at STOP.
M-O 3’ UTR and flanking DNA.
P. Genome-wide average.
12 Drosophila in 2007. To refine these analyses along the lines of the ~10 yeast species
comparisons, an additional 10 Drosophila genomes were sequenced. They range from very close
sibling species, like D. simulans (also cosmopolitan) and D. sechellia (restricted to one fruit on
the Seychelles islands in Indian Ocean), all the way out to three species in the sister subgenus,
confusingly called Drosophila, including D. grimshawi, a representative of the ~800 Hawaiian
species. It is worthwhile to learn most of these species names, as they are intensively studied.
Cosmopolitan human commensal
Out-of-Africa generalist
Specialist in Seychelles islands
Cosmopolitan human commensal
Out-of-Africa generalist
Afro-tropical savannah generalist
West African tropical specialist
Pan-tropical species - Japanese
Western USA, sibling species
favorites of Theodosius
Dobzhansky for microadaptation
South/Central American generalist
SW desert specialist on cactus
Larger dark flies, from Asia, but
now cosmopolitan with humans
Representative of the ~800
picture-winged hawaiian flies
These genomes are about the same size (~200 Mbp) and contain similar numbers of genes,
around 15,000-20,000. The analyses were on patterns of changes. The simplest is that there are
around 7,000 1:1 single-copy orthologs and around 5,000 conserved homologs. The remaining
genes are more patchily distributed across species or too rapidly evolving to show convincing
similarities (like my chemoreceptors, of which each has a couple hundred). Note that the category
of “lineage-specific” genes in D. melanogaster is tiny, because the initial annotation was
conservative, while the “patchy homologues” is absent because all comparisons are with D. mel.
Recall that the comparison of D. melanogaster with D. pseudoobscura indicated that there had
been many hundreds of rearrangements between their chromosomes. Here they show this
progression with increasing phylogenetic distance. These are “synteny” plots comparing two
chromosome arms (the left and right arms of chromosome 2 in D. mel., called Muller elements B
and C to make them equivalent across all species, which have different numbering conventions).
Amazingly these are almost all intra-chromosomal rearrangements (mostly inversions), with
precious few inter-chromosomal rearrangements (transpositions and translocations - presumably
because they are more disruptive). Even inversions crossing a centromere are rare (blue lines).
They attempted to look for positively
selected genes using the phylogeny and
multiple species comparisons to
improve power of analysis. However
there are two major caveats. First, these
species evolve too rapidly to include all
of them, so only the six melanogaster
group species with D. ananassae as root
could be used. Second, few genes
actually showed Ka/Ks ratios above 1,
with, of course, the vast majority being
far less than 1. This is because even if
there are some positively selected
amino acid changes in a protein, they
will be swamped by all the other
conservative positions.
This figure sorts the 8,510 1:1 orthologs
in these six species by GO category,
showing more rapid evolution in
“Defense response” at the top, as well as
“Unknown” and “Other biological
process”. Presumably the latter include
many environmentally relevant genes
that we don’t yet know much about.
By undertaking detailed studies of the patterns of conservation and divergence across subsets or
all of these species they could identify evolutionary signatures allowing recognition of refined
features of these genomes, using again D. melanogaster as reference:
1. ~150 new protein genes recognized by patterns of synonymous codon positions changing
while non-synonymous codon positions generally stay the same, prevalence of conservative
amino acid changes (e.g. I-L-M), plus conservation of reading frame (indels are multiples of
three).
2. ~500 new exons of existing genes, especially alternatively-spliced and small exons.
3. ~300 spurious gene predictions in D. melanogaster with the reverse kind of evidence, that is,
frameshifting indels in other species and changes in all three codon positions equally frequent.
4. Many unusual gene structures, the most remarkable being ~150 instances of conserved codons
after a stop codon, which they show result from read-through of some stop codons.
5. ~300 new candidate non-coding RNA genes, by virtue of conservation of folding of the RNA.
6. 35 new microRNAs to add to the 75 known, by conserved hairpin structures.
7. Many pre-transcriptional regulatory motifs (enhancers and silencers) with high confidence.
8. Post-transcriptional regulatory motifs, e.g. miRNA-binding regions in 3’ UTRs of transcripts.
Here are actual examples of A. protein coding exon conservation or lack thereof, B. conservation
of base-pairing in folded portions of a non-coding RNA, C. similarly for microRNA hairpin.
Scaling of comparative genomics power. Finally, given such huge sets of analyses with 12
species of varying phylogenetic distance, they were able to ask which types of comparisons work
best for which features of genomes. The three plots below attempt to ask how pairwise and
multiple species comparisons of different phylogenetic distance or total tree branch length,
respectively, yield known ncRNAs, miRNAs, and regulatory motifs. Clearly pairwise
comparisons in blue are less powerful than multiple species comparisons, more so for ncRNAs
and less so for miRNAs (apparently because the latter are so highly conserved they are easier to
find). Notice that very close comparisons are essentially useless for motif discovery, presumably
because there are not enough background changes for conserved motifs to stand out against. They
also looked at exon discovery and find that for long exons even close pairwise comparisons are
fine, but for short exons distant and multiple comparisons are needed. 9 more sequenced today.
Flies versus vertebrates. These two schematics show the levels of molecular divergence
between these 12 fly species, compared with vertebrates back to fish. The measure is essentially
synonymous changes, which are presumably evolving close to neutral rates. The top line is
pairwise comparisons, showing that comparisons across just the melanogaster species subgroup
are half the distance of all placental mammals, to ananassae is equivalent to placental versus
marsupial or even monotreme, while across the two subgenera are similar to all tetrapods. Going
out to the next fly genus would be equivalent to all vertebrates, which is roughly 50 Myr versus
500 Myr, thus flies evolved molecularly ~10X faster, as indicated in that schematic at the end of
the C. elegans 2 lecture.
The second line is multi-species comparisons, showing that including all 12 flies is like including
all 20 species of mammals out to the monotremes. Taking this further in the next two lectures,
comparing orders of endopterygote or metamorphosing insects is equivalent to going through all
chordates and indeed all deuterostomes. Thus insects evolve roughly 10X faster molecularly.
modENCODE 2010. An exhaustive effort was made to confirm and extend these inferences by
experimental work, called the modENCODE project for Model Organism Encyclopedia of DNA
Elements, following a similar study of 1% of the human genome. It was done for both D.
melanogaster and C. elegans. An exhaustive study of chromatin structures, plus epigenetics and
histone modifications, DNA replication, and RNA transcription. It succeeded in tripling the
number of nucleotides in the genome for which a role is known, largely by adding regulatory
regions of various kinds. Just as a few examples, they added another ~2000 protein-coding and
non-coding genes, modified or added ~53,000 exons in existing gene models, recognized ~2000
non-coding transcripts that might represent additional genes, and extensively documented all the
short- and micro-RNAs being produced from this genome. Because the functions of these noncoding transcripts are seldom known, they are commonly called the “dark matter” of genomes.
modENCODE totals.
Bars show contribution
of sub-regions of the
genome; the line above
showing cumulative
coverage of the genome.
While coding exons
(left) occupy 23% of the
unique non-repetitive
regions of the genome
(blue), and 34% of the
region of the genome
conserved across these
12 Drosophila species
and mosquitoes (red),
when you add in all the
other regions they
document, including 5’
and 3’ UTRs, non-coding
RNAs, TF and other
protein-bound regions,
chromatin domains, and
introns, 90% of genome
is accounted for!