UofU Medical Informatics Bioinformatics Seminar Presentation 9
Download
Report
Transcript UofU Medical Informatics Bioinformatics Seminar Presentation 9
proteomics
myriad
Bioinformatics Industrial Applications in
High Throughput Proteomics
Alan F. James
Director of Software Development
Myriad Proteomics, Inc., Salt Lake City
What is Proteomics?
• Proteomics refers to the study of the protein
constituents and protein activities of a cell, a
tissue or an organism.
• Proteomics may be seen from several viewpoints:
– Protein Expression
– Protein Interaction (Interactome)
–…
proteomics
myriad
Challenges in Proteomics
• Proteins are all different
- some degrade easily, some are sticky, many require
accessory factors
• Proteins are more complex than DNA
- there are several protein forms per gene
- proteins are post-translationaly modified
• There isn’t really ONE proteome in humans
• Proteins change:
• with cell type
• during differentiation
• during development
• in response to stimuli
• with cell cycles
• So which Proteome do you study?
proteomics
myriad
Methods of Analyzing Proteomes
– Expression,
Abundance,
Distribution
Normal
cell
Cancer
cell
– Structural Genomics
– Protein-Protein Interaction Analysis
• Yeast two-hybrid system
• Mass spectrometry of
protein complexes
PDIRP5
novel
novel
CASP3
MPO-XYZ
NCALD
OS-9
proteomics
myriad
Methods of Analyzing Proteomes by
Comprehensive Surveys of Protein-Protein
Interactions
Yeast two-hybrid (Y2H)
• Measures association between two proteins.
• Allows very high throughput.
Mass Spectrometry
Allows identification of the proteins in a complex of
many proteins (2-100) that carry out some cellular
function.
proteomics
myriad
Y2H Background Information:
Gene Activity in Yeast
Yeast transcription factor are composed of a DNA Binding Domain and a
Transcriptional Activation Domain.
The DNA Binding Domain recruits the Activation Domain to the yeast gene,
which allows the yeast gene to be active.
Transcription
Factor
Activation
Domain
DNA
Binding
Domain
Activation
Activation
Activation
Yeast Gene
proteomics
myriad
Principles of the Yeast Two-Hybrid System
1. The DNA Binding Domain is separated from the Transcriptional Activation Domain
of a transcription factor.
2. Libraries of human proteins are fused to both domains to create “hybrid” proteins.
3. The recruitment of the Activation Domain to the yeast gene is now mediated by
interactions of the human proteins.
Activation
Domain
2.
1.
Human
Proteins Human
Activation
Domain
DNA
Binding
Domain
Human
Proteins
Activation
Domain
Human
Protein
Y
3.
2.
Proteins
2.
Human Human
Proteins Proteins
3.
1.
DNA
Binding
Domain
2.
Human
Protein
X
DNA
Binding
Domain
Activation
Domain
Human
Protein
Y
Human
Protein
X
DNA
Binding
Domain
Activation
Activation
Activation
Yeast Gene
proteomics
myriad
Yeast Two-Hybrid Screens: Assay for Interactions
Scenario A: Human Proteins X and Y do not Interact
Prey
Activation
Human Domain
Protein
Y
Human
Protein
X
Bait
DNA
Binding
Domain
( No Reporter Gene Activity )
Readout:
No growth of yeast colonies
Reporter Gene
Scenario B: Human Proteins X and Z do Interact
Prey
Human
Protein
X
Bait
Activation
Human Domain
Protein
Z
DNA
Binding
Domain
Readout:
Yeast colonies grow
Reporter Gene
proteomics
myriad
Directed vs. Random Approach
Directed: selecting specific proteins as baits for
Y2H analysis.
Random: using individual baits picked at random
from libraries of baits.
The random approach can be used to rapidly
generate large amounts of interaction data.
proteomics
myriad
Random Two-Hybrid (R2H) Process Overview
Library Construction
1.
Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries
from cDNA synthesized from mRNA libraries using random primers.
Pick BD-Colonies
2.
Put yeast colonies containing BD-hybrid proteins into 96-well culture plates
Mating w/AD-Library
3.
Add yeast containing the AD-hybrid proteins to the 96-well plates with the
yeast colonies picked in (2.); allow yeast mating to occur.
Selection Plating
4.
Plate yeast matings onto dishes containing selective medium that allows
yeast to grow only if the human hybrid proteins interact.
Incubation
5.
Allow several days for yeast that contain interacting human proteins to grow.
Pick Growing Yeast
6.
Pick yeast colonies containing interacting human proteins (“Positives”) and
put them into 96-well culture plates.
Amplify Human DNA
7.
Amplify the human DNA that encodes the interacting proteins by PCR.
DNA Sequencing
8.
Sequence the amplified DNA and identify the interacting proteins.
proteomics
myriad
Mass Spectrometry
Vital tasks that in cells are often performed
by Multi-Protein Complexes (MPC)
proteomics
myriad
Mass Spectrometry
Cell Biology
Gene Cloning
pDEST5
.
pDEST1
pENTR
.
.
pDEST4
.
pDEST2
.
.
.
.
.
pDEST3
Protein “Baits”
Protein Purification
Handles
(Affinity Tags)
Mass Spectrometry
Protein “Preys”
proteomics
myriad
Mass Spectrometry
Cell Biology
cDNA Cloning
pDEST5
.
pDEST1
pENTR
.
.
pDEST4
.
pDEST2
.
.
.
pDEST3
Protein “Baits”
.
.
Protein Purification
Protein “Preys”
Handles
Mass Spectrometry
proteomics
myriad
Pulldown Assay
Bait Protein
Purification
Tag
Incubate with cell extract
Complex formation
Non-binding Proteins
Associated Proteins
Affinity Beads
Elute
Separate
proteins
Identify by
Mass
Spectrometry
proteomics
myriad
Mass Spectrometry
Cell Biology
cDNA Cloning
pDEST5
.
pDEST1
pENTR
.
.
pDEST4
.
pDEST2
.
.
.
pDEST3
Protein “Baits”
.
.
Protein Purification
Protein “Preys”
MPC
Handles
Mass Spectrometry
proteomics
myriad
Mass Spectrometry Procedure
Purified protein
complex
Protein separation
Mass spectrum
Protein digestion
Database Searching
(Peptide Mass Fingerprint Search)
Mass Spec. analysis
Protein ID
proteomics
myriad
Summary of Protein-Protein Interaction
Analysis Methods
I
Random Yeast Two-Hybrid:
Yields sets of binary associations
between protein fragments (that
may represent protein-protein
interactions).
G H
A
K
C
Mass Spectrometry:
Yields sets of n-ary associations
among proteins (that may represent
protein complexes).
K
B
FG
B
A
H I
K
C
D
J
E
L
proteomics
myriad
The Goal: Biological Relevance
New Protein-Protein Interaction
Known Protein-Protein Interaction
Transduction Pathway
Known Pathway Member
Identified Interactor
Novel Transcript
Traditional “Drugable” Enzyme
Other Enzymes
fibril formation,
deposition
Amyloid Plaque,
Neurofibrillary
Tangle Formation
APOPTOSIS
Underlying Pathway Adopted from http://www.kegg.com
proteomics
myriad
Role of Bioinformatics in Proteomics
Knowledge
Identification as potential drug target
Identification of participation in disease
pathway
Manual/
Experimental Data
Analysis
Identification of participation in protein complexes
Identification of protein interaction networks
Information
Identification of binary and n-ary interactions
Automated Data
Analysis
Identification of Loci/Domains/Proteins
Blast/PMF Searches
Mass Peak List Determination
Automated Data
Reduction
Base Calling
Data Warehousing
Data
Data Collection
LIMS
Data Collection, Analysis, and Interpretation
Biology
Computational Biology
Software Development
proteomics
myriad
Bioinformatics Techniques Used in Proteomics
•
•
•
•
•
•
•
•
•
•
•
•
•
Robot programming
Software engineering
Database modeling and design
Data warehouses and Data Marts
Database federation
Grid Computing
Information Visualization
Graph analysis, graph layout and display
Hidden Markhov Models
Bayesian networks
Statistical models
Signal Processing
Algorithm development
• …
proteomics
myriad
Objectives of Bioinformatics in Proteomics
1. Automate and manage highthroughput laboratory processes.
2. Retrieve, collect, and store
experimental interaction data.
3. Analyze, reduce, and extend
experimental interaction data.
4. Mine and visualize interaction analysis
results.
proteomics
myriad
Automate and Manage Laboratory Processes
Laboratory Automation
•
•
High-throughput proteomics is not possible
without a high degree of laboratory
automation.
Instruments and robotics
must interact directly and
reliably with LIMS
(Laboratory Information
Management System).
proteomics
myriad
Automate and Manage Laboratory Processes
Laboratory Management Information System (LIMS)
•
•
•
•
•
•
•
High-throughput proteomics is not possible without a sophisticated
LIMS.
The LIMS provides the foundation for all automated data collection,
reduction, and analysis.
Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene
Cloning, Protein Pull-down, Mass Spec., etc.
May collect very large amounts of data.
Fast runtime performance of the LIMS is essential to deal with the
high volume of transactions and possible near real-time interactions
between the LIMS and robotics and instruments.
High availability of the LIMS and supporting computer systems is
required to support production laboratories and time-critical
operations.
May be one of the most (if not the most) labor intensive
(programming, database management, and system management)
and expensive software systems in the enterprise.
proteomics
myriad
Automate and Manage Laboratory Processes
Functions of the Laboratory Management Information System (LIMS)
•
Track samples consistently through a protocol so that each sample:
–
–
–
–
–
•
•
•
•
•
•
•
Is identified.
Is linked to the appropriate results.
Is linked to the protocol used to process the sample.
Is linked to any related samples, reagents, etc.
Can be located physically.
Manage and enforce the protocol used to process a sample.
Capture laboratory quality control information and provide displays, reports,
statistical analyses, etc. to allow management and quality control of the
laboratory.
Provide interfaces for laboratory personnel, robotics, and instruments to
support high-throughput operations.
Capture results directly from laboratory instruments.
Provide experimental results in a format suitable for analytical programs.
Provide the interface between analytical systems and instruments (such as
Mass Spectrometers) that require real-time (or near real-time) analysis
during operation.
Manage laboratory personnel work lists, incident alerting, reporting and
correction, etc.
proteomics
myriad
Automate and Manage Laboratory Processes
LIMS Architecture
...
Web-based
Management Client
Web-based
Management Client
Web-based
Management Client
(Servlets, JSP, CGI Script)
(Servlets, JSP, CGI Script)
(Servlets, JSP, CGI Script)
LIMS Data
Warehouse(s)
(ODS)
Web Application Server
Lab Workstation
XML
(Java Application)
L
XM
Lab Workstation
(Java Application)
XML
...
SQL Net
LIMS
Database(s)
LIMS SERVER
(Java Socket Application)
XML
Lab Workstation
XM
L
(Java Application)
Analysis
Databases
Robot or Instrument
XM
L
XML
Robot or Instrument
...
Robot or Instrument
proteomics
myriad
Collect, Store, and Retrieve Experimental Data
Yeast two-hybrid Data
•
•
•
•
Electropherograms for sequence forward and reverse reads
Sequences and sequence quality scores from base-calling
Robot/Instrument Operational Parameters
Quality control data
–
–
Distributions of positive colonies within a search
Distributions of sequencing reaction success/failure within a
plate.
Yeast two-hybrid Data Collection Challenges
•
•
Transmission of electropherograms from remote sequencing
facility and associated error handling.
Relating/correlating data received from remote sequencing
facility with LIMS data.
Archival of electropherograms.
•
Retrieval of archived electropherograms.
•
proteomics
myriad
Collect, Store, and Retrieve Experimental Data
Mass Spectrometry Data
•
Spectrograms
–
–
–
–
•
•
Multiple Instruments (MALDI-TOF, Electrospray/Ion Trap, etc.)
Multiple spectrogram types (MS, MS/MS)
Individual samples may be analyzed with multiple instruments,
mass spectrogram types.
False Positive/Contamination Control Sample Spectrograms
Mass Peak Lists derived from spectrograms
Mass Spectrometry Instrument Operational Parameters
Mass Spectrometry Data Collection Challenges
•
•
•
•
Individual experiments will generate many spectrograms.
Interfacing with instrument to retrieve spectrograms and mass
peak lists.
Archival of spectrograms and mass peak lists
Retrieval of archived spectrograms and mass peak lists
proteomics
myriad
Collect, Store, and Retrieve Experimental Data
External Data Sources
•
•
•
•
•
NCBI LocusLink, RefSeq, GenBank, …
SwissProt, PFAM, …
Gene Ontology, …
KEGG, …
PubMed, Manually curated papers, …
External Data Sources Challenges
•
•
Wide variety of data formats.
Integrating or federating disparate data sources with internal
data bases.
Sometimes questionable quality of data.
Data sources frequently change/evolve
•
•
–
–
Changes may invalidate previous analysis results.
May require analysis databases to support versioning of results.
proteomics
myriad
Analyze, Reduce, and Extend Experimental Data
•
The goal of data analysis is to extract or discover biological
relevance from the raw data.
Raw data must be “cleaned”, filtered, and transformed
•
–
–
–
–
–
•
Vector/adaptor identification & clipping
Sequence assembly
Consensus sequence identification
Peptide mass fingerprint (PMF) searching
False positive detection/filtering.
Data representations must be modeled and developed.
–
How to represent interaction data?
•
•
•
–
How to organize data structures to enable querying (analysis)
involving
•
•
•
•
•
Sequences? Electropherograms? Mass Peak Lists?
Interactions? Pathways? Sequence Annotations?
Many other biological concepts / processes / functions?
Many tables
>1 million rows in some tables
filtering, aggregation, and computation of data
Analysis algorithms must be developed/adapted.
Statistical models must be developed/validated.
proteomics
myriad
Example: consequences of naïve data modeling
proteomics
myriad
Example: Y2H Data Analysis Process Flow
Y2H Laboratory
Send/Receive
Lab Sequence
Track Sequence
Submitted
Versioning
Perform Basecalling
Sequence String
Quality Score
Quality Matrix
Perform QC and
Clean Lab Sequence
Failed Requeue
Vector Clipping
Repeat Masking
Low Quality Filter
Annotate/Identify Lab
Sequences
BLAST, Parameters, Version
Homologous Seqs, Splice Variants
Domain Search
Construct Interaction Pair
Frequency of Interaction
Confidence Level
Collect False Positive, Self Activators
Construct Interaction Map
Visualization
Query
Compare Difference
Integrate External
Evidence
Gene Expression
Pathway
Disease
Perform Downsteam Analysis
proteomics
myriad
Dealing with False Positives
•
False positives will always be generated.
–
Y2H
•
•
–
Mass Spectrometry
•
•
•
•
“Self-activating” baits.
“Promiscuous” preys.
Proteins that interact directly with affinity beads.
Proteins that interact directly with affinity tags.
Contaminants.
False positives are very hard to detect and distinguish from
real positives.
False positives must be addressed both biologically and
informatically:
•
–
–
–
–
Known false positives can be “subtracted” from Y2H AD/BD
libraries before experiments.
Mass spectrometry control experiments with affinity beads,
affinity tags, and background contaminants can be “subtracted”
from results.
Known false positives can be “subtracted” during analysis.
Statistical tests can be developed to help identify possible false
positives during analysis.
proteomics
myriad
Mine and Visualize the Results of Analysis
•
Proteomics-specific data mining tools are required to extract
meaningful knowledge from massive amounts of data.
–
–
–
–
–
•
Flexible searching capabilities.
Flexible filters to reduce the amount of data.
Multiple views of the data.
Ad-hoc query tools for unanticipated data mining needs.
Data warehouses and/or data marts are required to support data
mining without impacting performance sensitive LIMS and
analytic systems.
Visualization tools are required to visually organize the data
and reveal meaningful patterns.
–
–
–
–
Quality control visualizations.
Interaction network visualizations.
Interaction network visualizations with experimental data
overlays.
Disease and metabolic pathway visualizations with interaction
network overlays.
proteomics
myriad
Quality Control Visualization (1)
Scatter Plot
160
140
120
100
80
60
40
20
0
24148 24774
26485
27617
28532
36250
37511
39413
SEARCHID
proteomics
myriad
Quality Control Visualization (2)
Plate-by-plate Sequencing Purity Monitor
RA0000055
RA0000058
RA0000059
RA0000060
RA0000061
RA0000101
RA0000103
RA0000104
RA0000106
RA0000108
RA0000109
RA0000110
RA0000113
RA0000119
RA0000122
RA0000123
RA0000124
RA0000125
RA0000126
RA0000127
RA0000128
RA0000129
RA0000130
RA0000131
RA0000132
RA0000133
RA0000134
RA0000135
RA0000149
RA0000150
RA0000151
RA0000152
RA0000154
RA0000155
RA0000156
RA0000158
RA0000161
RA0000162
RA0000164
RA0000165
RA0000166
RA0000169
RA0000170
RA0000171
RA0000172
RB0000041
RB0000043
RB0000044
RB0000045
RB0000046
RB0000047
RB0000048
RB0000050
RB0000051
RB0000052
RB0000053
RB0000054
RB0000055
RB0000056
RB0000058
RB0000059
RB0000060
RB0000061
RB0000063
RB0000064
RB0000066
RB0000067
RB0000068
RB0000069
RB0000070
RB0000072
RB0000073
RB0000074
RB0000075
RB0000076
RB0000077
RB0000078
RB0000101
RB0000103
RB0000104
RB0000106
RB0000109
RB0000110
RB0000113
RB0000119
RB0000122
RB0000123
RB0000124
RB0000127
RB0000136
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
P
K
F
A
1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24 1
9 16 24
well
proteomics
myriad
Quality Control Visualization (3)
Y2H Interaction Map with Curated Promiscuous Protein Annotation
interacting baits highlighted with their pronet annotation
58528
26289
11244
9114
6670
4343
1198
20
38 577
3691
6421
9090
10814
23469
55216
84619
prey
Prey Annotated
Bait Annotated
proteomics
myriad
Interaction Network Sub-Graph Visualization
proteomics
myriad
Y2H Interaction Network Sub-Graph
Visualization with Protein Pull-down Overlay
l o c6
loc21
l o c7
loc23
l o c1
l o c1 0
loc22
loc24
l o c8
loc25
l o c4
l o c5
l o c3
l o c3
2
1
l o c3
3
l o c3
4
l o c3
5
proteomics
myriad
Pathway with Interaction Network Annotation
New Protein-Protein Interaction
Known Protein-Protein Interaction
Transduction Pathway
Known Pathway Member
Identified Interactor
Novel Transcript
Traditional “Drugable” Enzyme
Other Enzymes
fibril formation,
deposition
Amyloid Plaque,
Neurofibrillary
Tangle Formation
APOPTOSIS
Underlying Pathway Adopted from http://www.kegg.com
proteomics
myriad
Acknowledgements
proteomics
myriad