Gene duplication and rearrangement

Download Report

Transcript Gene duplication and rearrangement

The evolution of expression patterns
in the Arabidopsis genome
Todd Vision
Department of Biology
University of North Carolina at Chapel Hill
Driving forces in genome evolution
• Proximate vs. ultimate explanations
• Deleterious mutations are frequent and
selection cannot effectively act on all of them
–
–
–
–
Substitutions
Insertions and deletions
Duplications
Transpositions
• Cellular processes will be affected by this rain
of mutations
• At the molecular level, we must entertain
ultimate explanations that do not invoke
adaption
An example: Codon bias
• Genes differ in the frequency that they use
the preferred codon for a given amino acid,
thereby affecting
– Translational efficiency
– Translational accuracy
• The strongest codon bias is typically seen in
short, highly expressed genes under strong
purifying selection
• Realized codon bias is a balance between
selection for preferred codons and a continual
rain of mutations toward unpreferred codons
What are the consequences of
mutational rain on the
regulatory networks that
modulate gene expression?
Outline
• Arabidopsis gene expression (MPSS)
• Two evolutionary issues in the evolution
of expression profiles:
– Physical clustering of co-expressed genes
– Divergence of duplicated genes
Digital expression profiling
• “Bar-code” counting raises fewer concerns about
cross-hybridization, probe selection, background
hybridization, etc.
• Serial Analysis of Gene Expression (SAGE)
– Count occurrence of 10-12 bp mRNA signatures
– Long SAGE: 21-22 bp signatures
– Uses conventional sequencing technology
• Massively Parallel Signature Sequencing (MPSS)
– Count occurrence of 17-20 bp mRNA signatures
– Cloning and sequencing is done on microbeads
– Commercialized by Lynx Therapeutics
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
extract mRNA from tissue
MPSS library
construction
Brenner et al., PNAS 97:1665-70.
mRNA
AAAAAAA
Convert to
cDNA
TTTTTTT
AAAAAAA
Cut w/
Sau3A
5’ - Add
standard
primer
Anneal to beads coated with unique anti-tag
(32 bp, complementary to tag on mRNA)
Remove 3’ primer and expose
single stranded unique tag
(digest, 3'  5' exonuclease)
GATC
TTTTTTT
AAAAAAA
TTTTTTT
AAAAAAA
(added by cloning)
PCR
Add linker
TTTTTTT
AAAAAAA
3’ - Add
unique
32 bp
tag and
standard
primer
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
MPSS library
construction
Brenner et al., PNAS 97:1665-70.
Sort by FACS to
remove ‘empty’ beads
The result of the library construction is a
set of microbeads. Each bead contains
many DNA molecules, all derived from the
3’ end of a single transcript.
Beads are loaded in a monolayer on a
microscope slide for the sequencing of
17 – 20 bp from the 5’ end.
+
NNNN
4321
NNNX
RS
CODEX1
NNXN
RS
CODEX2
NXNN
RS
CODEX3
RS
CODEX4
XNNN
Add adaptors
MPSS Sequencing
Brenner et al., Nat. Biotech. 18:630-4.
Sequence by
hybridization
16 cycles
for 4 bp
13 bp
Repeat Cycle
NNNN
8765
^GNNN
^
Steps of four
bases; overhang
is shifted by four
bases in each
round
CNNN
4321
9 bp
Digest with Type
IIS enzyme to
uncover next 4
bases
RS
CODEC4
DECODERED
MPSS Sequencing
Each bead provides a signature of 17-20 bp
Tag #
1
2
3
4
5
6
7
8
9
.
.
30,285
Signature
Sequence
GATCAATCGGACTTGTC
GATCGTGCATCAGCAGT
GATCCGATACAGCTTTG
GATCTATGGGTATAGTC
GATCCATCGTTTGGTGC
GATCCCAGCAAGATAAC
GATCCTCCGTCTTCACA
GATCACTTCTCTCATTA
GATCTACCAGAACTCGG
.
.
GATCGGACCGATCGACT
Total # of tags:
# of Beads
(Frequency)
2
53
212
349
417
561
672
702
814
.
.
2,935
>1,000,000
ATG
Two sets of signatures are generated from each
sample in different reading frames staggered
by two bases
TGA
A catalog of signatures in the
Arabidopsis genome
All potential signatures (GATC + 13 bp)
are identified on both strands of the
genomic sequence.
There is one potential signature appx.
every 293 bp on each strand of genome
A signature is classified according to
its position relative to the 29,084 genes
& pseudogenes in the TIGR annotation
Signatures may not be unique. The
number of ‘hits’ in the genome is
recorded
“Hits”
1
2
3
4
5
6
7
8
9
10
11
12-20
21-30
31-50
> 50
Total
At genome
748204
88392
11019
3512
1452
874
470
326
237
192
158
707
247
124
86
851,212
% of total
87.407%
10.326%
1.287%
0.410%
0.170%
0.102%
0.055%
0.038%
0.028%
0.022%
0.018%
0.083%
0.029%
0.014%
0.010%
Random
845057
6134
21
0
0
0
0
0
0
0
0
0
0
0
0
851,212
Classifying signatures
Duplicated:
expression may
be from other
site in genome
Potential alternative
splicing or nested
gene
Anti-sense transcript
or nested gene?
Potential
alternative
termination
Typical
signatures
Potential
anti-sense
transcript
Potential
un-annotated
ORF
Triangles refer to colors used on our web page:
or
Class 1 - in an exon, same strand as ORF.
Class 2 - within 500 bp after stop codon, same strand as ORF.
or
Class 3 - anti-sense of ORF (like Class 1, but on opposite strand).
or
Class 4 - in genome but NOT class 1, 2, 3, 5 or 6.
or
Class 5 - entirely within intron, same strand.
or
Class 6 - entirely within intron, anti-sense.
or
Grey = potential signature NOT expressed
Class 0 - signatures found in the expression libraries but not the genome.
Arabidopsis signatures
Based on TIGR annotation (release 3.0, July 2002)
Class
1 sense exonic
2 3’UTR, <500 bp
3 anti-sense exonic
4 inter-genic
5 intronic
6 anti-sense intronic
TOTAL
# in genome
203,174
44,202
197,065
288,109
60,817
57,845
851,212
% of total
24.0
5.2
23.3
34.0
7.2
6.8
100.5
355 genes lack potential Class 1 or 2 signatures (undetectable)
On average, there are 8.5 class 1 & 2 signatures per gene
8422 genomic signatures have secondary classes due to overlap or
near overlap of two genes in the TIGR annotation.
Core Arabidopsis MPSS libraries
sequenced by Lynx for Blake Meyers, U. of Delaware
Library
Root
Shoot
Flower
Callus
Silique
TOTAL
Signatures
sequenced
3,645,414
2,885,229
1,791,460
1,963,474
2,018,785
12,304,362
Distinct
signatures
48,102
53,396
37,754
40,903
38,503
133,377
Genome-wide expression profiling Arabidopsis
Chr. I
Chr. II
Chr. III
Chr. IV
Chr. V
Of the 29,084 gene models, 14,674 match unique, expressed signatures
http://www.dbi.udel.edu/mpss
Query by
• Sequence
• Arabidopsis gene identifier
• chromosomal position
• BAC clone ID
• MPSS signature
• Library comparison
Site includes
• Library and tissue information
• FAQs and help pages
Outline
• Arabidopsis gene expression (MPSS)
• Two evolutionary issues in the evolution
of expression profiles:
– Physical clustering of co-expressed genes
– Divergence of duplicated genes
Physical clustering of co-expression
Caenorhabditis elegans
Drosophila melanogaster
Homo sapiens
Saccharomyces cerevisiae
Roy et al., (2002) Nature 418, 975
Lercher et al (2003) Genome Research 13, 238
Boutanaev et al (2002) Nature 420, 666
Spellman and Rubin (2002) J Biology 1, 5
Caron et al (2001) Science 291, 1289
Lercher et al (2002) Nature Genetics 31, 180
Cohen et al (2000) Nature Genetics 26, 183
Hurst et al (2002) Trends in Genetics 18, 604
Mannila et al (2002) Bioinformatics 18, 482
‘
• What are the proximate explanations?
– shared cis-regulatory elements
– chromatin packaging, etc.
• What are the ultimate explanations?
– Adaptive: greater transcriptional efficiency/accuracy?
– Maladaptive: mutational rain chipping away at insulators and
other mechanisms that over-ride regional controllers of gene
expression?
Measuring expression distance
library 2
library 1
library 3
Clustering of tissue-specific expression
Chromosome 1
Flower (red)
Silique (violet)
Leaf (green)
Root (blue)
Callus (white)
Statistical tests of coexpression clustering
• Measured median pairwise expression
distance (MPED) in non-overlapping windows
of 20 genes
– Summed unique class 1 and 2 signatures for each
gene
– Only one gene within each tandemly arrayed
family was counted
• Out of 100 shuffles of gene order
– Zero shuffles had as many windows with small
MPED (less than 1.5) as the unshuffled data
– Zero shuffles had as large a variance in MPED
among windows as the unshuffled data
Coexpression in Arabidopsis
Coexpression in Arabidopsis
Coexpression in Arabidopsis
Selection and recombination
• In regions of low recombination
– deleterious mutations can hitch-hike to high
frequency along with favorable ones
– favorable mutations are kept at low frequency by
linkage to deleterious ones
• Therefore, the effectiveness of natural
selection is causally related to recombination
rate
• Are clusters more concentrated in regions of
– high recombination (i.e. are they adaptive)
– low (i.e. are they maladaptive)?
Measuring recombination rate
Chromosome 1
9
genetic distance (cm)
8
100
7
80
6
5
60
4
40
3
2
20
1
0
0
0
5
10
15
20
physical distance (Mb)
25
30
35
recombination rate (cm/Mb)
120
3.5
3
recombination rate (cm/Mb)
>10
10
9
8
7
6
5
4
3
2.5
2
expression distance
Co-expression is greater in
low recombination regions
Co-expression clusters
• MPSS data provides evidence for
clusters of co-expression among nonrelated genes in Arabidopsis
• Co-expression is greater in regions of
low recombination
• Thus, co-expression clusters may be
maladapative, at least on average
Outline
• Arabidopsis gene expression (MPSS)
• Two evolutionary issues in the evolution
of expression profiles:
– Physical clustering of co-expressed genes
– Divergence of duplicated genes
Divergence of duplicated genes
Age of duplication
Duplicated genes in Arabidopsis
Modes of gene duplication
• Tandem (unequal crossing-over)
• Dispersed (transposition)
• Segmental (polyploidy)
Divergence of duplicated genes
• All gene families of size 2 in Arabidopsis were
classified as ‘dispersed’, ‘segmental’ or
‘tandem’
• Expression distance was calculated for each
• The number of silent (i.e. synonymous)
substitutions per site was calculated for each
(as a proxy for age since duplication)
expression distance
Divergence and mode of duplication
4
3
2
dispersed
segmental
tandem
1
0
0
2
4
6
silent substitutions (per site) x 10
8
Divergence of duplicated genes
• Almost all expression divergence occurs during (or
immediately following) duplication
• Initial expression divergence is more extreme for
tandem than dispersed duplicates
• Tandem and dispersed duplicates with the most
divergent expression profiles are quickly lost
• Segmental duplicates plateau at a lower level of
expression divergence than dispersed duplicates
• The average divergence in relative expression level
in each tissue is about 8-fold.
Lessons learned
• Clusters of co-expression in Arabidopsis may
be largely the result of a rain of weakly
deleterious mutations that homogenize the
expression profiles of neighboring genes
• Divergence in expression profile between
duplicated genes is dependent on the nature
of the mutation that gave rise to the
duplication
Thanks!
• UNC Chapel Hill
– Jianhua Hu
• University of Delaware
– Blake Meyers
• NSF Plant Genome Research Program
– DBI-01103267 (TJV)
– DBI-0110528 (BCM)