Gene Set Enrichment and Splicing Detecting using

Download Report

Transcript Gene Set Enrichment and Splicing Detecting using

Gene Set Enrichment and Splicing
Detection using Spectral Counting
Nathan Edwards
Department of Biochemistry and Mol. & Cell. Biology
Georgetown University Medical Center
Outline
• Systems Biology
• Gene Sets & Functional Enrichment
• Balls in Urns
• Proteomics
• MS/MS and Peptide ID
• Quantitation and Spectrum Counting
• Differential Protein Abundance
• Detecting Splicing and Isoforms
2
Systems Biology
High-Throughput
Experiments
Mathematical
Models
Knowledge
Databases
3
Systems Biology
molecular biology
↕
phenotype
High-Throughput
Experiments
Mathematical
Models
•
•
•
•
Sequencing
Microarrays
Proteomics
Metabolomics
Knowledge
Databases
4
Systems Biology
High-Throughput
Experiments
• UniProt
• OMIM
• Kegg
molecular biology
↕
biology
Mathematical
Models
5
Knowledge
Databases
Systems Biology
High-Throughput
Experiments
• Software
• Statistics
• Algorithms
Mathematical
Models
phenotype
↕
biology
Knowledge
Databases
6
Systems Biology
molecular biology
↕
phenotype
High-Throughput
Experiments
•
•
•
•
Sequencing
Microarrays
Proteomics
Metabolomics
• UniProt
• OMIM
• Kegg
• Software
• Statistics
• Algorithms
Mathematical
Models
phenotype
↕
biology
molecular biology
↕
biology
7
Knowledge
Databases
Gene Expression Analysis
• Differential expression via:
• Structured experiments
• Transcript measurements
• Statistics
• But now what?
8
Gene Expression Analysis
Hengel et al. J Immunol. 2003.
• Structured experiment:
• CD4+/L-selectin- T-cells, vs
• CD4+/L-selectin+ T-cells
• Affymetrix Human Genome U95A Array
• Processing & Statistics
• MAS 4.0, t-Tests, FDR filtering, …
• 164 probe identifiers for upregulated genes.
9
Gene Expression Analysis
34529_AT
34623_AT
34529_AT
34249_AT
32469_AT
33027_AT
967_G_AT
34546_AT
33922_AT
33685_AT
37166_AT
38816_AT 679_AT 37105_AT
36378_AT 35648_AT 33979_AT
1372_AT 38646_S_AT 35896_AT
40317_AT 32413_AT 33530_AT
34720_AT 36317_AT 31987_AT
35439_AT 36421_AT 966_AT
31525_S_AT 38236_AT 34618_AT
31512_AT 40959_AT 38604_AT
40790_AT 35595_AT 33963_AT
35566_F_AT 33684_AT 36436_AT
34453_AT 1645_AT
39469_S_AT
10
Gene Expression Analysis
1112_g_at
neural cell adhesion molecule 1
1331_s_at
tumor necrosis factor receptor superfamily, member 25
1355_g_at
neurotrophic tyrosine kinase, receptor, type 2
1372_at
tumor necrosis factor, alpha-induced protein 6
1391_s_at
cytochrome P450, family 4, subfamily A, polypeptide 11
1403_s_at
chemokine (C-C motif) ligand 5
1419_g_at
nitric oxide synthase 2, inducible
1575_at
ATP-binding cassette, sub-family B (MDR/TAP), member 1
1645_at
KiSS-1 metastasis-suppressor
1786_at
c-mer proto-oncogene tyrosine kinase
1855_at
fibroblast growth factor 3 (murine mammary tumor virus integration site (v-int-2) oncogene homolog)
1890_at
growth differentiation factor 15
…
…
11
Gene Set Enrichment
• Candidate genes are “special” with respect
to the experiment structure (phenotype)
• Are they special with respect to general
biological knowledge?
•
•
•
•
Are the candidate genes related?
Can we filter out the noise?
Can we expose associated genes?
What genes' changes are linked to the
experimental structure / phenotype?
12
Gene Sets
• Genes may be related in many ways:
• Same pathway, similar function, cellular location
• Cytoband, identified in previous study, etc.
• Define gene sets for relatedness
•
•
•
•
•
GO Biological Process
GO Molecular Function
GO Cellular Component
KEGG Pathway, Biocarta Pathway
Biological knowledge databases
13
Gene Set Enrichment
14
Gene Set Enrichment
15
Gene Set Enrichment
16
Drawing Balls from Urns
1000 Balls, 900 Red, 100 Blue.
17
Drawing Balls from Urns
100 Balls Drawn at Random? # Red? # Blue?
18
Drawing Balls from Urns
How surprising is 5, 10, 15, 20, … blue?
19
Drawing Balls from Urns
How surprising is 30, 50, 70, … blue?
20
Drawing Balls from Urns
6 of 155 upregulated genes have
"oxygen binding" GO annotation!
All human genes ( = 25), blue is oxygen binding.
21
How surprised should we be?
• Classic problem in probability theory
• How well do the observed counts match the
expected counts?
• Various mostly equivalent statistical tests
are applied:
• Fisher exact test
• Hypergeometric
• Chi-Squared (χ2)
• p-value measures "surprise".
22
Proteomics
• Proteins are the machines that drive
much of biology
• Genes are merely the recipe
• The direct characterization of
proteins en masse.
• What proteins are present?
• How much of each protein is present?
• Which proteins change in abundance?
23
Sample Preparation for
Tandem Mass Spectrometry
Enzymatic Digest
and
Fractionation
24
Single Stage MS
MS
25
Tandem Mass Spectrometry
(MS/MS)
MS/MS
26
Peptide Fragmentation
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
y6
100
% Intensity
y7
y5
b3
y2
y3
b4
y4 b5
b6
b7
b8
y
b9 8 y9
0
250
500
750
27
1000
m/z
LC-MS/MS
• Powerful combination of liquid
chromatography (LC), and
• Tandem mass-spectrometry (MS/MS)
• Automatically collect 100k MS/MS
spectra in an afternoon
• Tens of thousands of peptide/spectra
assignments,
• Thousands of proteins identified
28
Spectral Counting
• Abundant proteins are more likely to
be identified:
• Selection (by the instrument) for
fragmentation is based on intensity
• More abundant ions are more likely to
fragment in an informative manner
• A proteins' peptide identification count
(spectra) can be used as a crude
abundance measurement.
• Easy, cheap, (relative) protein quantitation
29
Differential Spectral Counts
• Spectral counts are too crude for
classical (microarray) statistics.
• Fold change, t-tests, …
• However, we expect "similar" spectral
counts when the protein abundance is
unchanged.
• Recast as drawing balls from urns.
30
HER2/Neu Mouse Model of
Breast Cancer
• Paulovich, et al. JPR, 2007
• Study of normal and tumor mammary
tissue by LC-MS/MS
• 1.4 million MS/MS spectra
• Peptide-spectrum assignments
• Normal samples (Nn): 161,286 (49.7%)
• Tumor samples (Nt): 163,068 (50.3%)
• 4270 proteins identified in total
31
Drawing Balls from Urns
Plastin-2 (Lcp1)
827 102
2.437E-123
Osteopontin (Spp1)
334
19
2.444E-62
Hypoxia up-regulated protein 1 (Hyou1)
200
7
1.437E-40
All Tumor Spectra
32
All Normal Spectra
Functional Enrichment
• 374 proteins with "significantly"
increased abundance in tumor tissue
• Use 4270 proteins as background!
• DAVID gene set enrichment:
• Protein translation
• RNA binding, splicing
33
Differential Spectral Counting
• Assumptions of the formal tests
(Fisher exact, χ2) are violated, so
• p-values can be misleading (too small)
• Use label permutation tests to compute
empirical p-values. SLOW!
• Collapse spectral counts to protein sets
(GO terms) directly:
• Potential to observe more subtle spectral
count differences
34
Unannotated Splice
Isoform
35
Unannotated Splice
Isoform
36
Halobacterium sp. NRC-1
ORF: GdhA1
• K-score E-value vs PepArML @ 10% FDR
• Many peptides inconsistent with annotated
translation start site of NP_279651
0
40
80 120 160 200 240 280 320 360 400 440
37
What if there is no
"smoking gun" peptide…
38
What if there is no
"smoking gun" peptide…
39
What if there is no
"smoking gun" peptide…
40
PKM2 in Peptide Atlas
experiments
peptides
41
What if there is no
"smoking gun" peptide…
42
Nascent polypeptide-associated
complex subunit alpha
• Long form is "muscle-specific"
• Exon 3 is missing from short form
• Peptide identifications provide evidence
for long form only
• 9 peptides are specific to long form
• 6 peptides are found in both isoforms
• Urn with balls of 15 different colors
• p-value of observed spectral counts: 7.3E-8
43
Nascent polypeptide-associated
complex subunit alpha
44
Pyruvate kinase isozymes M1/M2
• Exon "substitution" changes sequence in
the middle of the protein
• Peptide identifications provide evidence
for both isoforms
• 3 peptides are specific to isoform 1
• 5 peptides are specific to isoform 2
• Urn with balls of 63 colors for isoform 1
• p-value of observed spec. counts: 2.46E-05
45
Pyruvate kinase isozymes M1/M2
46
Summary
• Systems biology requires:
• Experiments, Databases, Models
• Informaticians and Disease Experts
• Functional Enrichment:
• Quickly navigate knowledge databases using
experiment derived genes
• Classical probability experiment: Balls & Urns
• How surprised should you be?
• Still require domain expert to pick out gems
47
Summary
• Proteomics:
• High-throughput protein comparison
• Proteome "sample" is identified
• Crude spectral count quantitation
• Differential protein abundance:
• Use Balls & Urns to find significant changes
• Apply functional enrichment tools
• Splicing detection:
• Perturbed peptide spectral counts provide
evidence for splicing.
• Evaluate using Balls48& Urns