Presentation Title Goes Right Here
Download
Report
Transcript Presentation Title Goes Right Here
Proteomic Characterization of
Alternative Splicing and Coding
Polymorphism
Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park
Why don’t we see more novel
peptides?
Tandem mass spectrometry doesn’t
discriminate against novel peptides...
...but protein sequence databases do!
Searching traditional protein sequence
databases biases the results towards
well-understood protein isoforms!
What goes missing?
Known coding SNPs
Novel coding mutations
Alternative splicing isoforms
Alternative translation start-sites
Microexons
Alternative translation frames
Why should we care?
Alternative splicing is the norm!
• Only 20-25K human genes
• Each gene makes many proteins
Proteins have clinical implications
• Biomarker discovery
Evidence for SNPs and alternative splicing
stops with transcription
• Genomic assays, ESTs, mRNA sequence.
• Little hard evidence for translation start site
Novel Splice Isoform
Human Jurkat leukemia cell-line
• Lipid-raft extraction protocol, targeting T cells
• von Haller, et al. MCP 2003.
LIME1 gene:
• LCK interacting transmembrane adaptor 1
LCK gene:
• Leukocyte-specific protein tyrosine kinase
• Proto-oncogene
• Chromosomal aberration involving LCK in leukemias.
Multiple significant peptide identifications
Novel Splice Isoform
Novel Splice Isoform
Novel Mutation
HUPO Plasma Proteome Project
• Pooled samples from 10 male & 10 female
healthy Chinese subjects
• Plasma/EDTA sample protocol
• Li, et al. Proteomics 2005. (Lab 29)
TTR gene
• Transthyretin (pre-albumin)
• Defects in TTR are a cause of amyloidosis.
• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
Novel Mutation
Searching Expressed Sequence
Tags (ESTs)
Pros
No introns!
Primary splicing
evidence for
annotation pipelines
Evidence for dbSNP
Often derived from
clinical cancer
samples
Cons
No frame
Large (8Gb)
“Untrusted” by
annotation pipelines
Highly redundant
Nucleotide error
rate ~ 1%
Compressed EST Peptide
Sequence Database
For all ESTs mapped to a UniGene gene:
•
•
•
•
Six-frame translation
Eliminate ORFs < 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
Gene-centric peptide sequence database:
• Size:
223 Mb vs 8 Gb, 20774 FASTA entries
• Running time: 15 mins vs 22 hours
• E-values:
50-fold reduction
Download:
• http://www.umiacs.umd.edu/~nedwards
Back to the lab...
Current LC/MS/MS workflows identify a
few peptides per protein
• ...not sufficient for protein isoforms
Need to raise the sequence coverage to
(say) 80%
• ...protein separation prior to LC/MS/MS
analysis
Future informatics directions...
Combine results from multiple searches from multiple
engines
Fast, automated triage of “significant false-positive”
peptide identifications
Compressed EST peptide sequence database for other
species
• Mouse, Rat, Zebrafish, Chicken, Cow, A. thaliana, ??
Relational database and web-application infrastructure
• Interactive browser data-grid, flexible web-services export
• Java Applet MS/MS viewers, GFF for Genome Browser
Conclusions
Peptides identify more than just proteins
• Untapped source of disease biomarkers
• Functional vs silencing variants
Compressed peptide sequence databases
make routine EST searching feasible
Statistically significant peptide
identification is only the first step
Acknowledgements
Catherine Fenselau, Steve Swatkoski
• UMCP Biochemistry
Chau-Wen Tseng, Xue Wu
• UMCP Computer Science
Cheng Lee
• Calibrant Biosystems
PeptideAtlas, HUPO PPP, X!Tandem
Funding: NCI