Novel Peptide Identification using ESTs and

Download Report

Transcript Novel Peptide Identification using ESTs and

Novel Peptide Identification using
ESTs and Genomic Sequence
Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Sample Preparation for
Peptide Identification
Enzymatic Digest
and
Fractionation
2
Mass Spectrometer
Sample
+
_
Ionizer
• MALDI
• Electro-Spray
Ionization (ESI)
Mass Analyzer
• Time-Of-Flight (TOF)
• Quadrapole
• Ion-Trap
3
Detector
• Electron
Multiplier
(EM)
Single Stage MS
MS
m/z
4
Tandem Mass Spectrometry
(MS/MS)
m/z
Precursor selection
m/z
5
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
m/z
MS/MS
m/z
6
Peptide Identification
• For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
• Peptide sequences from protein sequence
databases
• Swiss-Prot, IPI, NCBI’s nr, ...
• Automated, high-throughput peptide identification
in complex mixtures
7
What goes missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
8
Why should we care?
• Alternative splicing is the norm!
• Only 20-25K human genes
• Each gene makes many proteins
• Proteins have clinical implications
• Biomarker discovery
• Evidence for SNPs and alternative splicing
stops with transcription
• Genomic assays, ESTs, mRNA sequence.
• Little hard evidence for translation start site
9
Novel Splice Isoform
10
Novel Splice Isoform
11
Novel Frame
12
Novel Frame
13
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
14
Novel Mutation
15
Genomic Peptide Sequences
• Genomic DNA
• Exons & introns, 6 frames, large (3Gb → 6Gb)
• ESTs
• No introns, 6 frames, large (4Gb → 8Gb)
• Used by gene, protein, and alternative splicing
annotation pipelines
• Highly redundant, nucleotide error rate ~ 1%
16
Compressed EST Database
• Six-frame translation of all ESTs
• Optionally, ESTs that map to a gene
• Eliminate ORFs < 30 amino-acids
• Amino-acid 30-mers
• Observed in at least two ESTs
• Represent AA 30-mers in C3 FASTA database
• Complete, Correct, Compact
17
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases &
CSBH-graphs
• Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
20
Sequence Databases &
CSBH-graphs
• All k-mers represented by an edge have
the same count
1
2
2
1
2
21
cSBH-graphs
• Quickly determine those that occur twice
2
2
1
2
22
Compressed-SBH-graph
2
2
1
2
ACDEFGI
23
Compressed EST Database
• Gene centric compressed EST peptide
sequence database
• 20,774 sequence entries
• ~8Gb vs 223 Mb
• ~35 fold compression
• 22 hours becomes 15 minutes
• E-values improve by similar factor!
• Makes routine EST searching feasible
• Search ESTs instead of IPI?
24
Conclusions
• Peptides identify more than just proteins
• Compressed peptide sequence databases
make routine EST searching feasible
• cSBH-graph + edge counts +
C2/C3 enumeration algorithms
• Minimal FASTA representation of k-mer sets
25
Collaborators
• Chau-Wen Tseng, Xue Wu
• Computer Science
• Catherine Fenselau, Crystal Harvey
• Biochemistry
• Calibrant Biosystems
• Thanks to PeptideAtlas, X!Tandem
26