Transcript slides
PROTERAN:
ANIMATED TERRAIN
EVOLUTION FOR VISUAL
ANALYSIS OF PROTEIN
FOLDING TRAJECTORY
1
The need for Bioinformatics
Bioinformatics: Application of
computational techniques to the
management and analysis of biological
information.
Clustering techniques applied on data not
enough. Need a good visual
representation
2
Agenda
Microarrays
Review of existing clustering and visualization
techniques on gene expression data
The need for a customized visualization tool for use
by Dr. Laxmi Parida & Dr. Ruhong Zhou of the
computational biology group at the IBM Watson
Research Center for visual analysis of protein
characteristics
Introduce our new technique that makes use of an
animated terrain, implemented in the program called
PROTERAN
3
Function of Genes & Proteins
Through the proteins they encode genes
orchestrate the mysteries of life
Protein functions vary widely from
mechanical support to transportation to
regulation.
4
Still a lot of work ahead
Traditional methods of discovering their
functions were done on a gene-by-gene
basis, thus throughput was low.
Believed that many genes work together;
this is not exhibited in a one-by-one
fashion.
5
Microarrays
Solve the throughput problem
Allow scientists to see genes on a genomic
level
6
Expression Matrix
Experiment 1
Experiment 2
………..
Experiment M
Gene 1
C511/C311
C512/C312
………..
C51M/C31M
Gene 2
C521/C321
C522/C322
………..
C52M/C32M
.
.
.
Gene N
.
.
.
C5N1/C3N1
.
.
.
C5N2/C3N2
.
.
.
………..
.
.
.
C5NM/C3NM
7
Clustering & Visualization
Techniques Review
8
Clustering
Clustering: Act of grouping similar
objects together
Applied to gene expression in order to
find the function of unknown genes
Many different clustering techniques in
the literature. Represented techniques are
discussed next.
9
Determining similarity
between two genes
Choose a similarity distance to compare
genes
e.g. Euclidian distance
Experiment 1
Experiment 2
………..
Experiment M
Gene 1
C511/C311
C512/C312
………..
C51M/C31M
Gene 2
C521/C321
C522/C322
………..
C52M/C32M
.
.
.
Gene N
.
.
.
C5N1/C3N1
.
.
.
C5N2/C3N2
.
.
.
………..
.
.
.
C5NM/C3NM
10
Hierarchical Clustering
1. Create distance matrix of all genes in relation
to each other
2. Find the two closest genes
3. Merge these two genes and redo distance
matrix
4. Repeat steps 2-3 until only one cluster left
11
Dendrogram
Binary tree with a distinguished root,
which has all the data items at the leaves
Re-orders the expression matrix to place
similar genes beside each other
12
Example
A
B
A
B
C
D
0
1
6
8
0
5
7
0
2
(A,B)
D
D
C
D
0
5
7
(A,B)
(C,D)
C
C
(A,B)
0
(A,B)
(C,D)
0
5
0
2
0
0
Agglomerative Hierarchical Clustering
13
Advantages
Familiar to biologists
Few parameters to specify
14
Disadvantages
Requires fast CPUs and large amounts of
memory
Does not identify important clusters
Only represents hierarchical organized data
Does not scale up
15
Disadvantages cont..
Dendrogram always offers 2n-1
representations (where n = number of
elements)
16
Self Organizing Maps (SOMs)
User picks number of clusters called nodes
Nodes randomly mapped to M-dimensional
space (M = # of experiments)
Node values are adjusted by random vectors
picked from original data
After node values settle vectors are
clustered to closest node
17
Visualization
1.
2.
Dendrogram
Error Bar Representation
18
Visualization
3.
U-Matrix
19
Advantages
User has partial control over structure
Fuzzy Clusters
Variety of visual techniques applicable
20
Disadvantages
Knowledge of number of clusters
beforehand
Many parameters to specify
21
Principle Component Analysis
(PCA)
Mathematical technique that can be used to
reduce the number of dimensions of data
Principal component analysis
22
Visualization
23
Advantages
No parameters required
3D Visualization
24
Disadvantages
Little control over structure
Running time of O(N3)
Not applicable when input is a distance
matrix
25
Biclustering
Clustering of both rows and columns
simultaneously
26
Available Software
Software Name
Description
Available at
F-Scan
Quantification and analysis of fluorescently probed
microarrays; scatterplots; multiple image comparison.
http://abs.cit.nih.gov/fscan/
TIGR SpotFinder
Spot identification.
http://www.tigr.org/software/
Cluster
Hierarchical clustering, K means clustering
Self-Organizing Map (SOM), PCA
http://rana.lbl.gov/EisenSoftware.h
tm
Genesis
A Java suite containing various tools such as filters,
normalization, visualization tools, common clustering
algorithms, SOM, k-means, PCA,
http://genome.tugraz.at/Software/
GenesisCenter.html
J-Express Pro 2.0
Hierarchical clustering, K-means, Principal Component
Analysis, Self-organizing maps, Profile similarity search,
Normalization and filtering, Raw data import, Project
organization
http://www.molmine.com/frameset
/frm_jexpress.htm
TreeView
Cluster output visualization
http://rana.lbl.gov/EisenSoftware.h
tm
27
Protein Folding
28
Reaction Coordinates
Folding determines the function of protein
All-atom recreation of protein unrealistic
Reaction coordinates used to describe protein
structure
1.
2.
3.
4.
5.
6.
7.
Fraction of Native Contacts
Radius of Gyration
RMSD from the native structure
Number of beta-strand Hydrogen Bonds
Number of alpha helix turns
Hydrophobic core radius of gyration
Principle Components
29
Protein States
While folding, a protein goes through certain
states
The raw data is similar to microarray data.
Dr. Parida and Dr. Zhou have developed their
own techniques and clustered β-Hairpin data.
30
Reaction Coordinates used on
the β-Hairpin
1. Number of Native β-strand hydrogen bonds
2. Radius of gyration of the hydrophobic core
3.
4.
5.
6.
7.
residues
Radius of gyration of entire protein
Fraction of native contacts
Principle component 1
Principle component 2
Root mean square deviation (RMSD) from
the native structure.
31
Raw Data
32
Patterned Cluster
2
3
0 0.1
23 26
4 0.23
27
RED = Number of columns in pattern.
(Also defined as the Pattern Type)
WHITE = Column Number
PURPLE = Column Value
YELLOW = Number of occurrences
GREEN = Occurrences
33
Sample Patterned Cluster File
2
1006
2
1003
3
1036
3
1056
0 7.335
59728
0 7.335
59728
0 7.335
59728
0 7.335
59728
1 0.735
87235
94826-94831
95748-95752
95761-95763
…
120424-120426
94826-94831
95748-95752
95761-95763
…
95769
94826
94828-94831
…
95761-95763
94826
94828-94831
…
95761-95763
…
95748-95752
1 0.736
87235
4 -5.881
72071
4 -5.881
72071
6 3.292
87235
5 2.214
87235
:
5
1089
2 8.144
45533
3 0.899
59728
4 -3.855
72071
5 -33.574
87235
6 3.292
94826
34
The need for Visual Analysis of
Patterned Cluster Data
β-Hairpin file approx 500MB large
Difficult to study the textual representation
and get a global view
Very difficult to see interaction of all
patterned clusters in relation to each other
Also very difficult to remember all
patterned clusters and their occurrence in
time
35
Visual Requirements
Global View
Navigation & Focus
Relative growth
Details of characteristics on demand
36
Need for Customized Tool
All of the existing visualization techniques
on microarrays had one or more drawbacks
None were able to provide a visual for
depicting relative growth of clusters.
37
Terrain Metaphor
Has been shown to be a useful technique in
searching a corpus of documents
Very recently the idea has been applied to
gene expression with high density clusters
representing mountains
38
Using a Landscape Metaphor to
solve our requirements
Each mountain represents a patterned
cluster
Mountain growth represents evolution of
patterned cluster
Clicking on mountains returns details of
patterned cluster
39
PROTERAN
40
Mapping of Patterned Cluster
Data into Terrain Geometry
41
Mapping of Patterned Cluster
data into Terrain Geometry
Pattern Type: Number of columns in a patterned
cluster
Column Combination: Unique number that
identifies a combination of columns
2
3
0 0.1 4 0.23
23 26 27
42
Column Combinations
c!
(c – t)! * t!
c = number of characteristics
t = pattern number
Pattern Type
Number of Column
Combinations
2
21
3
35
4
35
5
21
6
7
7
1
43
Layout
We first thought of using an automated layout technique.
However, one of Dr. Zhou’s requirements was that the
same pattern cluster should appear in the same position for
consistent interpretation.
Another was that larger pattern types (6 and 7 column)
must be very distinguishably placed.
Hence it was decided to use a manual layout design
described next.
44
Layout
01
02
03
01234
01235
01236
012
013
014
015
016
04
05
06
01245
01246
01256
023
024
025
026
034
12
13
14
01345
01346
01356
035
036
045
046
056
15
16
23
01456
02345
02346
123
124
125
126
134
24
25
26
02356
02456
03456
135
136
145
146
156
34
35
36
12345
12346
12356
234
235
236
245
246
45
46
56
12456
13456
23456
256
345
346
356
456
0123
0124
0125
0126
0134
0135
0136
0145
0146
0156
0123456
012345
012346
012356
0234
0235
0236
0245
0246
012456
013456
023456
0256
0345
0346
0356
0456
1234
1235
1236
1245
1246
1256
1345
1346
1356
1456
2345
2346
2356
2456
3456
123456
45
Top Patterned Clusters
Visualized
Final requirement by Dr. Parida and Dr. Zhou is that
only the top 10 largest patterned clusters of each
column combination should be visualized
10TH Highest
Occurrence of
combination 01
9TH Highest
Occurrence of
combination 01
2ND Highest
Occurence of
combination 01
3RD Highest
Occurrence of
combination 01
8TH Highest
Occurrence of
combination 01
Highest Occurrence
of combination
01
4TH Highest
Occurrence of
combination 01
7TH Highest
Occurrence of
combination 01
6TH Highest
Occurrence of
combination 01
5TH Highest
Occurrence of
combination 01
46
PROTERAN LAYOUT
47
Animated Terrain Evolution
Time proceeds from 0 to the maximum
number of experiments
Each time unit all patterned clusters are
checked
If there is an occurrence the mountain’s
height is increased
48
Mountains of PROTERAN
49
Results & Extensions
50
Results
Very encouraging feedback
Easy to use layout and the interface allows
1. Identification of states
2. Obtain values of patterned clusters
3. Relation of patterned clusters to each other as they grow
over time
In the initial use itself, Dr. Zhou said that “ he
was able to find that the hydrophobic core is
largely formed before the beta-strand hydrogen
bonds are formed.”
51
Future of PROTERAN
Introduced at the Intelligent Systems For
Molecular Biology (ISMB) in Scotland –
Received very well
Robert-Cedergren Bioinformatics
Colloquium at University of Montreal (Sept
23-24th)
52
Extensions
Analyze with different types of protein data
More generic layout with more
characteristics
Application with different types of data
53
Summary
1. Review of existing techniques to cluster and
visualize gene expression data
2. Protein characteristics data is similar to that of
gene expression data
3. None of the existing techniques applied, thus
the need for a customized visual
4. Terrain Metaphor to solve our requirements
implemented in the program PROTERAN
54
Questions
55