Transcript PPT

BCB 444/544
Lecture 38
Review: Microarrays
Proteomics
#38_Nov28
Thanks to
Doina Caragea, KSU
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
1
Required Reading
(before lecture)
3 √Mon Nov 26 - Lecture 37
Clustering & Classification Algorithms
• Chp 18 Functional Genomics
2 Wed Nov 28 - Lecture 38
Proteomics & Protein Interactions
• Chp 19 Proteomics
Thurs Nov 30 - Lab 12
R Statistical Computing & Graphics (Garrett Dancik)
http://www.r-project.org/
1 Fri Dec 1 - Lecture 39 (Last Lecture!)
Systems Biology
(& a bit of Metabolomics & Synthetic Biology)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
2
Assignments & Announcements
Mon Nov 26 - HW#6 Due
Mon Dec 3
(5 PM Mon Nov 26 or ASAP)
- BCB 544 Project Reports Due (NO CLASS that day!!)
ALL BCB 444 & 544 students are REQUIRED to attend
ALL project presentations next week!!!
Tentative Schedule:
Wed Dec 5: #!: Xiong & Devin (~20’)
Fri Dec 7: #3: Kendra & Drew (~20’)
#2: Tonia (10-15’)
#4: Addie (10-15’)
Thurs Dec 6 - Optional Review Session for Final Exam
Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM)
Will include:
40 pts In Class: New material (since Exam 2)
20 pts In Class: Comprehensive
40 pts In Lab Practical (Comprehensive)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
3
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:
http://www.bcb.iastate.edu/seminars/index.html
Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium,
• Greg Voth Univ. of Utah
• Multiscale Challenge for Biomolecular Systems: A Systematic Approach
Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Sue Gibson Univ. of Minnesota
• How do soluble sugar levels help regulate plant development, carbon
partitioning and gene expression?
Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Shashi Gadia ComS, ISU
• Harnessing the Potential of XML
Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB
• John Abrams Univ Texas Southwestern Medical Center
• Dying Like Flies: Programmed & Unprogrammed Cell Death
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
4
Chp 18 – Functional Genomics
SECTION V
GENOMICS & PROTEOMICS
Xiong: Chp 18 Functional Genomics
• Sequence-based Approaches
• Microarray-based Approaches
• Comparison of SAGE & DNA Microarrays
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
5
Gene Expression Analysis
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
6
Pattern Recognition in Microarray Analysis
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
7
Microarray Analysis - Questions
& Answers
• How do hierarchical clustering algorithms work?
• How do we measure the distance between two
clusters? (similarity criteria)
• Single link
• Complete link
• Average link
• What are “good clusters”?
• Big difference between INTRA-cluster distance and INTERcluster distance, i.e., INTRA-cluster distance is minimized while
INTER-cluster distance is maximized
• What are pros & cons of:
• Hierarchical vs K-means clustering
• Clustering vs Classification
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
8
Clustering Metrics
• A key issue in clustering is to determine what
similarity / distance metric to use
• Often, such metric has a bigger effect on the
results than actual clustering algorithm used!
• When determining the metric, we should take into
account our assumptions about the data and the
goal of the clustering
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
9
How Determine Distances?
Intra-cluster distance
Inter-cluster distance
• Min/Max/Avg the distance
between
- All pairs of points in the
cluster OR
- Between centroid and all
points in the cluster
• Single link
• distance between two most
similar members
• Complete link
• distance between two most
similar members
• Average link
• Average distance of all pairs
• Centroid distance
What is the centroid? the "average" of all points of X. The
centroid of a finite set of points can be computed as the arithmetic
mean of each coordinate of the points. Wikipedia
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
10
INTRA- vs INTER-Cluster Distances
Good!
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Bad!
11/28/07
11
Methods for Clustering
(Unsupervised Learning)
• Hierarchical Clustering
• K-Means
• Self Organizing Maps
• (in lab, won’t discuss in lecture)
• …many others….
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
12
Hierarchical Clustering*
*This method was illustrated in Lecture 36,Tables 6.1-MM6.4
•
•
•
Probably most popular clustering algorithm for microarray analysis
First presented in this context by Eisen et al. in 1998
Nodes = genes or groups of genes
Agglomerative (bottom up)
0. Initially each item is a cluster
1. Compute distance matrix
2. Find two closest nodes (most similar
clusters)
3. Merge them
4. Compute distances from merged node to all
others
5. Repeat until all nodes merged into a single
node
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
13
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Copyright: Russ Altman
11/28/07
14
Hierarchical Clustering:
Strengths & Weaknesses
• Easy to understand & implement
• Can decide how big to make clusters by choosing
cut level of hierarchy
• Can be sensitive to bad data
• Can have problems interpreting tree
• Can have local minima
Bottom-up is most commonly used method
• Can also perform top-down, which requires
splitting a large group successively
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
15
K-Means Clustering (Model-based)
2nd
Centroid A
Computationally attractive!
1.
2.
3.
4.
5.
Choose random points (cluster
centers or centroids) in k
dimensions
Compute distance from each
data point to centroids
Assign each data point to
closest centroid
Compute new cluster centroid as
average of points assigned to
cluster
Loop to (2), stop when cluster
centroids do not move very
much
Initial
Centroid A
Initial
Centroid B
2nd Centroid B
For K = 2
Two features:
f1 (x-coordinate) & f2 (y-coordinate)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
16
K-Means Clustering Example, for k=2
For simplicity, assume k=2 & objects are 1-dimensional
(Numerical difference is used as distance)
Steps in K-means clustering:
0. Objects: 1, 2, 5, 6, 7
1. Randomly select 5 and 6 as centers (centroids)
2. Calculate distance from points to centroids &
assign points to clusters: {1,2,5} & {6,7}
3. Compute new cluster centroids:
(C1) = 8/3 = 2.7
(C2) = 13/2= 6.5
4. Calculate distance from points to new centroids &
assign data points to new clusters: {1,2} & {5,6,7}
5. Compute new cluster centroids:
(C1) = 1.5
(C2) = 6.0
6. No change? Converged!
=> Final clusters = {1,2} & {5,6,7}
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
1
5
2
1
2
2.7
1
1
2
1.5
2
7
6
5
2.7
1
6
5
2
7
6.5
6
5
5
7
6.5
6
6
11/28/07
7
7
17
K Means Clustering for k=2
A more realistic example
Pick seeds
Assign clusters
Compute centroids
Re-assign clusters
x
x
x
x
Compute centroids
Re-assign clusters
Converged!
From S. Mooney
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
18
K-Means Clustering:
Strengths & Weaknesses
• Fast, O(N)
• Hard to know which K to choose
• Try several and assess cluster
quality
• Hard to know where to seed the
clusters
• Results can change drastically with
different initial choices for
centroids - as shown in example:
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Example Illustrating
Sensitivity to Seeds
In the above, if start
with B and E as centroids
will converge to {A,B,C}
and {D,E,F}
If start with D and F
Will converge to
{A,B,D,E} {C,F}
11/28/07
19
Choice of K? Helpful to have additional
information to aid evaluation of clusters
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
20
Hierarchical Clustering vs K-Means
Running Time
Assumptions
Hierarchical
Clustering
K-Means
Slower
Faster
Requires distance Requires distance
metric
metric
Parameters
None
K (number of
clusters)
Clusters
Subjective
(only a tree is
returned)
Exactly K
clusters
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
21
Clustering vs Classification
• Clustering (unsupervised learning)
• Uses primary data to group measurements, with no
information from other sources
• Classification (supervised learning)
• Uses known groups of interest (from other sources) to learn
features associated with these groups in primary data and
create rules for associating data with groups of interest
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
22
Classification:
Supervised Learning Task
• Given: a set of microarray experiments, each done
with mRNA from a different patient (but from same
cell type from every patient)
Patient’s expression values for each gene constitute
the features, and patient’s disease constitutes the
class
• Do: Learn a model that accurately predicts class
based on features
• Outcome: Predict class value of a patient based on
expression levels of his/her genes
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
23
Methods for Classification
• K-nearest neighbors (KNN)
•
•
•
•
•
Linear Models
Logistic Regression
Naive Bayes
Decision Trees
Support Vector Machines
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
24
K-Nearest Neighbor (KNN)
• Idea: Use k closest neighbors to label new data
points (e.g., for k = 4)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
25
Basic KNN Algorithm
INPUT:
• Set of data with labels (training data)
• K
• Set of data needing labels
• Distance metric
1. For each unlabeled data point, compute distance to
all labeled data
2. Sort distances, determine closest K neighbors
(smallest distances)
3. Use majority voting to predict label of unlabeled
data point
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
26
Variations on KNN
• Can classify into multiple classes easily
• Weighted KNN - an weight votes of nearby
training samples based on their distance from
unknown sample
• Can set a threshold, p, for the # of votes needed
to win. (If no winner, then either NULL result or
set default winner)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
27
Compare in Graphical Representation
Clustering
Classification
Apply external labels:
RED group & BLUE group
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
28
Tradeoffs for Clustering vs Classification
• Clustering is not biased by previous knowledge, but
therefore needs stronger signal to discover
clusters
• Classification uses previous knowledge, so can
detect weaker signal, but may be biased by
WRONG previous knowledge
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
29
Chp 19 – Proteomics
SECTION V
GENOMICS & PROTEOMICS
Xiong: Chp 19 Proteomics
•
•
•
•
Technology of Protein Expression Analysis
Post-translational Modification
Protein Sorting
Protein-Protein Interactions
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
30
ISU Proteomics Resources & Researchers
Facilities:
Proteomics Facility (Carver Co-lab)
http://www.plantgenomics.iastate.edu/proteomics/
Protein Facility (MBB)
http://www.protein.iastate.edu/
Experiments:
Plant: Rodermel, Wise, Voytas
Animal: Greenlee, perhaps others soon?
Computational Analysis:
Honavar, Wise, Dobbs
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
31
Proteomics: What do all those proteins do??
Biological processes for yeast proteins
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
32
Proteome Analysis: “Traditionally”
using Two-dimensional (2D) gels
1st D: Isoelectric focusing (IEF) in pH gradient:
Proteins migrate to isoelectric points & stop moving
2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis):
Proteins migrate according to molecular weight
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
33
Proteins identified on 2D gels
(IEF/SDS-PAGE)
Direct protein microsequencing by Edman degradations
-- done at facilities (here at ISU)
-- typically need 5 picomoles
-- often get 10 to 20 amino acids of sequence
Protein mass analysis by MALDI-TOF
-- Matrix-Assisted Laser Desorption/Ionization
Time-Of-Flight Spectroscopy
-- done at facilities (here at ISU)
-- often detect post-translational modifications
(such as phosphorylated Ser, Thr, Tyr)
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Page
11/28/07
250-1
34
Evaluation of 2D gels (IEF/SDS-PAGE)
Advantages:
Visualize hundreds to thousands of proteins
Improved identification of protein spots
Disadvantages:
Limited number of samples can be processed
Mostly abundant proteins visualized
Technically difficult
Jonathan Pevsner
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
Page 251
35
11/28/07
Tandem Mass Spectrometry (TS)
to Identify Proteins
Figure 8.19 Tandem mass spectrometry for
protein identification
a) ESI creates ionized proteins, represented by
colored shapes with positive charges. Each shape
represents many copies of identical proteins.
b) Ionized proteins are separated based on their
mass to charge ratio (m/z) and sent one at a time
into the activation chamber. Separation and
selection take place in the first of the two MS
devices. The solid purple protein has been selected
for analysis; the other three are temporarily stored
for later analysis.
c) The group of m/z selected ionized proteins enters a
collision cell that is filled with inert argon gas. Gas
molecules collide with proteins, which causes them
to break into two peptide pieces (labeled b and y).
d) Ionized peptide pieces are sent into second MS
device, which again measures the m/z ratio. A
computer compares spectrum of peptide pieces
to a database of ideal spectra to identify the
original group of identical proteins.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
36
MS data: Protein identification through
peptide fragment identification & separation
Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y
peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses
(gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from
the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were
more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and protein
identity.
Copyright © 2006
A. Malcolm Campbell
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
37
Databases of 2D Gel Information
http://ca.expasy.org/ch2d/2d-index.html
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
38
Jonathan Pevsner
BCB 444/544 F07 ISU Dobbs #38 - Proteomics
11/28/07
39