Discovery and revision of Arabidopsis genes by

Download Report

Transcript Discovery and revision of Arabidopsis genes by

Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) .
Discovery and revision of
Arabidopsis genes
by proteogenomics
Natalie E. Castellanaa, Samuel H. Payne, Zhouxin
Shen, Mario Stanke,* Vineet Bafna, and Steven P.
Briggs
University of California San Diego,,
*Institute for Microbiology and Genetics,
Gottingen, Germany
Limitations of gene annotation
• Based on evidence of transcripts
• Depends on gene finding/ protein prediction
algorithms.
• How do we define genes?
• Models suffer from errors in reading frame and exon
definition.
• Rare transcripts? Noise?
• Arabidopsis is the best annotated plant genome and
other plant genomes are annotated relative to
Arabidopsis.
Types of alternative splicing
What did Castellana et al. do to detect gene
model errors?
• Isolated Arabidopsis proteins from different tissues.
• Analyzed tryptic peptides by Tandem Mass
Spectrometry.
• Determined sequences for 144,079 distinct peptides.
• Confirmed gene models for 40% (12,769) of
annotated genes (assuming gene total of 31,922).
• 18,024 novel peptides were found, suggesting 13% of
the proteome was missing or incorrect.
• They added or corrected 1473 gene/proteins, leaving
1 to 4% unidentified protein coding genes.
Proteins
• Protein extracts of four Arabidopsis organs: (leaf,
root, flower, silique) and cell culture MM2d.
• Phosphoproteins were enriched using TiO2from
MM2d
• Sodium orthovanadate (Na3VO4)used as a
phosphatase inhibitor.
• Cysteines were reduced and alkylated.
• Digested with trypsin.
• Separated by high resolution 3D-LC: RP1, SCX, RP2,
•
in 45 runs producing 144,079 tryptic peptides.
Mass Spectrometry (MS) From Wikipedia.
Ionized molecules or molecule fragments are measured
by their mass-to-charge ratios
1) the components of the sample are ionized by
an electron beam, which results in the formation of
charged particles (ions),
2) directing the ions into a electric and/or magnetic
fields,
3) computation of the mass-to-charge ratio of the
particles based on their motion as they transit
through electromagnetic fields
4) 5) detection of the ions, which in step 3) were
sorted according to m/z.
Mass Spectrometers consist of three modules:
1) An ion source, which can convert gas phase sample molecules
into ions (or, in the case of electrospray ionization, move ions that
exist in solution into the gas phase);
2) a mass analyzer, which sorts the ions by their masses by applying
electromagnetic fields; and
3) a detector, which measures the value of an indicator quantity and
thus provides data for calculating the abundances of each ion present.
A quadrupole time-of-flight hybrid tandem mass
spectrometer.
Multiple stages of mass analysis
separation can be accomplished with
MS steps separated in space or time.
In tandem mass spectrometry the
elements are physically separated.
These elements can be sectors,
transmission quadrupole, or time-offlight.
ESI is electrospray ionization
MALDI is matrix-assisted laser
desorption/ionization
Work flow
Castellana N. E. et.al.
PNAS (2008)
105:21034-21038
©2008 by National Academy of Sciences
Acquisition of Spectra
• Peptides charged by electrospray ionization.
• LTQ linear ion trap tandem mass spectrometery
• 21 million spectra were acquired. Data is archived
in Tranche (http://tranche.proteomecommons.org)
• Spectra were searched against three reference
databases: TAIR 7, a six frame translation of the
genome, and ab initio gene predictions using
AUGUSTUS and exon prediction.
Number of assigned spectra, distinct peptides, and
proteins in different samples and organs.
Baerenfaller et al. (2008) Science 320: 938-941.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Plant tissue
Spectra
Distinct peptides Proteins
Differentiated organs 465,836
64,219
10,902
Roots
71,516
27,546
6,125
Roots 10 days
38,476
20,301
5,159
Roots 23 days
33,040
16,984
4,466
Leaves
80,186
20,417
4,853
Cotyledons
39,419
13,628
3,665
Juvenile leaves 40,767
14,437
3,892
57.8
Flowers
147,650
33,192
7,040
Flower buds
54,588
19,467
5,104
Open flowers
57,861
20,205
5,215
Carpels
35,201
13,393
3,946
Siliques
79,589
23,054
5,779
Seeds
86,895
13,901
3,789
Cell culture
324,345
49,842
8,698
Dark
149,051
34,551
6,547
Light
143,583
32,656
6,474
Light; small
31,711
15,318
4,472
Total
790,181
86,456
13,029
•
TAIR7
27,029
Avg. Mol. Mass (kD)
54.6
55.0
55.7
54.3
57.5
58.2
57.4
58.5
59.0
56.7
54.6
54.7
57.3
59.7
59.8
43.2
54.7
45.9
65% of all peptides were detected in only one organ. 1.3% were identified an all organs.
Some Peptide Bookkeeping
Total peptides
Peptides in TAIR 7 annotation
Peptides not in TAIR
Peptides not in TAIR but uniquely
located in the genome
New intergenic “clusters”
Former noncoding pseudogenes
Never recognized as genes before
due to inadequate support
Uniquely identified by peptides
144,079
126,055
18,024
16,348
1,765 (genes)
561 genes (31%)
331 genes (20%)
198 genes
Fig. S1. Discovery Curve, showing the number of distinct peptides matching to
TAIR7 recovered as a function of the number of annotated spectra. The
discovery curve is separated to show the contribution of each individual
dataset.
Novel gene discovery
A cluster of 13 uniquely located peptides that do not overlap a current gene model (Chr3). The prediction
track shows the single exon gene model produced by AUGUSTUS.
(B) The predicted sequence shows strong homology to a Thylakoid lumen family protein
(sp|P82658|TL19_ARATH). It also shows strong similarity to proteins in both grapevine (emb|CAO40861.1
a hypothetical gene) and rice (Os08g0504500 a cDNA derived gene).
Castellana N. E. et.al. PNAS 2008;105:21034-21038
©2008 by National Academy of Sciences
Intergenic Regions
64% of intergenic clusters overlap annotated
pseudogenes or transposons.
Annotated pseudogenes may be incorrectly truncated,
and have missing exons.
Transposons may contain protein coding genes
unrelated to transposon activity. (gene hitch-hiking)
A large number (7,442 ) of small ORFs have been
found as transcripts from intragenic regions*. 155 of
these have predicted peptides.
*Hanada et al. (2007) Genome Research 17:632-640.
Peptides overlapping a predicted transposable element gene
Five peptides overlap an annotated transposable element gene. The
inferred protein is 56% identical to a ubiquitin like protease.
Castellana N. E. et.al. PNAS 2008;105:21034-21038
©2008 by National Academy of Sciences
Gene refinement: new exons, boundary
change, exon skipping, modified translation start
and stop sites.
A majority are novel exons: 60% are within introns,
and 40% are in UTRs.
26 cases may actually be a single exon.
Exon extension and shortening are equally frequent.
AUGUSTUS using the peptide evidence predicts
altered transcripts in 695 genes.
In 130 cases, peptide variation indicates new
isoforms.
Refined Gene Model
4 novel peptides map in the 5’UTR and the first exon of a protein kinase
Castellana N. E. et.al. PNAS 2008;105:21034-21038
©2008 by National Academy of Sciences
New gene models from identified peptides
Baerenfaller et al (2008) Science 320: 938-941.
New gene models from identified peptides
Baerenfaller et al (2008) Science 320: 938-941.
Take home lessons
MS is a powerful adjunct to genomics and transcriptomics.
More precise definition of coding genes.
Proteomics is becoming more quantitative and less
expensive.
MS can provide absolute protein quantitation.
Likely to play an increasing role in “omic” research.
Proteomics people will want more respect.
References
•Katja Baerenfaller, Jonas Grossmann, Monica A. Grobei, Roger Hull, Mattias
Hirsch-Hoffman, Shaul Yalovsky, Phillip Zimmermann, Ueli Grossniklaus, Wilhelm
Gruissem, Sacha (2008). Genome scale proteomics reveals Arabidopsis thaliana
Gene models and proteome dynamics. Science 320: 938-941.
•Stephen Tanner, Zhouxin Shen, Julio Ng, Liliana Florea, Roderic Guiogo, Steven
Briggs and Vineet Bafna. (2007). Improving gene annotation using peptide mass
spectrometry. Genome Res. 2007. 17: 231-239 2007;17:231-239
•Kousuke Hanada, Xu Zhang, Justin O. Borevitz, Wen-Hsiung Li,
•and Shin-Han Shiu1 (2007). A large number of novel coding small open reading
frames in the intergenic regions of the Arabidopsis thaliana genome are
transcribed and/or under purifying selection. Genome Res. 2007 17: 632-640