Transcript slides

PROTERAN:
ANIMATED TERRAIN
EVOLUTION FOR VISUAL
ANALYSIS OF PROTEIN
FOLDING TRAJECTORY
1
The need for Bioinformatics

Bioinformatics: Application of
computational techniques to the
management and analysis of biological
information.

Clustering techniques applied on data not
enough. Need a good visual
representation
2
Agenda




Microarrays
Review of existing clustering and visualization
techniques on gene expression data
The need for a customized visualization tool for use
by Dr. Laxmi Parida & Dr. Ruhong Zhou of the
computational biology group at the IBM Watson
Research Center for visual analysis of protein
characteristics
Introduce our new technique that makes use of an
animated terrain, implemented in the program called
PROTERAN
3
Function of Genes & Proteins

Through the proteins they encode genes
orchestrate the mysteries of life
 Protein functions vary widely from
mechanical support to transportation to
regulation.
4
Still a lot of work ahead

Traditional methods of discovering their
functions were done on a gene-by-gene
basis, thus throughput was low.
 Believed that many genes work together;
this is not exhibited in a one-by-one
fashion.
5
Microarrays

Solve the throughput problem
 Allow scientists to see genes on a genomic
level
6
Expression Matrix
Experiment 1
Experiment 2
………..
Experiment M
Gene 1
C511/C311
C512/C312
………..
C51M/C31M
Gene 2
C521/C321
C522/C322
………..
C52M/C32M
.
.
.
Gene N
.
.
.
C5N1/C3N1
.
.
.
C5N2/C3N2
.
.
.
………..
.
.
.
C5NM/C3NM
7
Clustering & Visualization
Techniques Review
8
Clustering



Clustering: Act of grouping similar
objects together
Applied to gene expression in order to
find the function of unknown genes
Many different clustering techniques in
the literature. Represented techniques are
discussed next.
9
Determining similarity
between two genes

Choose a similarity distance to compare
genes
e.g. Euclidian distance
Experiment 1
Experiment 2
………..
Experiment M
Gene 1
C511/C311
C512/C312
………..
C51M/C31M
Gene 2
C521/C321
C522/C322
………..
C52M/C32M
.
.
.
Gene N
.
.
.
C5N1/C3N1
.
.
.
C5N2/C3N2
.
.
.
………..
.
.
.
C5NM/C3NM
10
Hierarchical Clustering
1. Create distance matrix of all genes in relation
to each other
2. Find the two closest genes
3. Merge these two genes and redo distance
matrix
4. Repeat steps 2-3 until only one cluster left
11
Dendrogram


Binary tree with a distinguished root,
which has all the data items at the leaves
Re-orders the expression matrix to place
similar genes beside each other
12
Example
A
B
A
B
C
D
0
1
6
8
0
5
7
0
2
(A,B)
D
D
C
D
0
5
7
(A,B)
(C,D)
C
C
(A,B)
0
(A,B)
(C,D)
0
5
0
2
0
0
Agglomerative Hierarchical Clustering
13
Advantages

Familiar to biologists
 Few parameters to specify
14
Disadvantages

Requires fast CPUs and large amounts of
memory
 Does not identify important clusters
 Only represents hierarchical organized data
 Does not scale up
15
Disadvantages cont..

Dendrogram always offers 2n-1
representations (where n = number of
elements)
16
Self Organizing Maps (SOMs)

User picks number of clusters called nodes
 Nodes randomly mapped to M-dimensional
space (M = # of experiments)
 Node values are adjusted by random vectors
picked from original data
 After node values settle vectors are
clustered to closest node
17
Visualization
1.
2.
Dendrogram
Error Bar Representation
18
Visualization
3.
U-Matrix
19
Advantages

User has partial control over structure
 Fuzzy Clusters
 Variety of visual techniques applicable
20
Disadvantages


Knowledge of number of clusters
beforehand
Many parameters to specify
21
Principle Component Analysis
(PCA)

Mathematical technique that can be used to
reduce the number of dimensions of data
Principal component analysis
22
Visualization
23
Advantages

No parameters required
 3D Visualization
24
Disadvantages

Little control over structure
 Running time of O(N3)
 Not applicable when input is a distance
matrix
25
Biclustering

Clustering of both rows and columns
simultaneously
26
Available Software
Software Name
Description
Available at
F-Scan
Quantification and analysis of fluorescently probed
microarrays; scatterplots; multiple image comparison.
http://abs.cit.nih.gov/fscan/
TIGR SpotFinder
Spot identification.
http://www.tigr.org/software/
Cluster
Hierarchical clustering, K means clustering
Self-Organizing Map (SOM), PCA
http://rana.lbl.gov/EisenSoftware.h
tm
Genesis
A Java suite containing various tools such as filters,
normalization, visualization tools, common clustering
algorithms, SOM, k-means, PCA,
http://genome.tugraz.at/Software/
GenesisCenter.html
J-Express Pro 2.0
Hierarchical clustering, K-means, Principal Component
Analysis, Self-organizing maps, Profile similarity search,
Normalization and filtering, Raw data import, Project
organization
http://www.molmine.com/frameset
/frm_jexpress.htm
TreeView
Cluster output visualization
http://rana.lbl.gov/EisenSoftware.h
tm
27
Protein Folding
28
Reaction Coordinates



Folding determines the function of protein
All-atom recreation of protein unrealistic
Reaction coordinates used to describe protein
structure
1.
2.
3.
4.
5.
6.
7.
Fraction of Native Contacts
Radius of Gyration
RMSD from the native structure
Number of beta-strand Hydrogen Bonds
Number of alpha helix turns
Hydrophobic core radius of gyration
Principle Components
29
Protein States

While folding, a protein goes through certain
states

The raw data is similar to microarray data.
 Dr. Parida and Dr. Zhou have developed their
own techniques and clustered β-Hairpin data.
30
Reaction Coordinates used on
the β-Hairpin
1. Number of Native β-strand hydrogen bonds
2. Radius of gyration of the hydrophobic core
3.
4.
5.
6.
7.
residues
Radius of gyration of entire protein
Fraction of native contacts
Principle component 1
Principle component 2
Root mean square deviation (RMSD) from
the native structure.
31
Raw Data
32
Patterned Cluster
2
3
0 0.1
23 26
4 0.23
27
RED = Number of columns in pattern.
(Also defined as the Pattern Type)
WHITE = Column Number
PURPLE = Column Value
YELLOW = Number of occurrences
GREEN = Occurrences
33
Sample Patterned Cluster File
2
1006
2
1003
3
1036
3
1056
0 7.335
59728
0 7.335
59728
0 7.335
59728
0 7.335
59728
1 0.735
87235
94826-94831
95748-95752
95761-95763
…
120424-120426
94826-94831
95748-95752
95761-95763
…
95769
94826
94828-94831
…
95761-95763
94826
94828-94831
…
95761-95763
…
95748-95752
1 0.736
87235
4 -5.881
72071
4 -5.881
72071
6 3.292
87235
5 2.214
87235
:
5
1089
2 8.144
45533
3 0.899
59728
4 -3.855
72071
5 -33.574
87235
6 3.292
94826
34
The need for Visual Analysis of
Patterned Cluster Data
β-Hairpin file approx 500MB large
 Difficult to study the textual representation
and get a global view
 Very difficult to see interaction of all
patterned clusters in relation to each other
 Also very difficult to remember all
patterned clusters and their occurrence in
time

35
Visual Requirements

Global View
 Navigation & Focus
 Relative growth
 Details of characteristics on demand
36
Need for Customized Tool

All of the existing visualization techniques
on microarrays had one or more drawbacks
 None were able to provide a visual for
depicting relative growth of clusters.
37
Terrain Metaphor

Has been shown to be a useful technique in
searching a corpus of documents
 Very recently the idea has been applied to
gene expression with high density clusters
representing mountains
38
Using a Landscape Metaphor to
solve our requirements

Each mountain represents a patterned
cluster
 Mountain growth represents evolution of
patterned cluster
 Clicking on mountains returns details of
patterned cluster
39
PROTERAN
40
Mapping of Patterned Cluster
Data into Terrain Geometry
41
Mapping of Patterned Cluster
data into Terrain Geometry

Pattern Type: Number of columns in a patterned
cluster
 Column Combination: Unique number that
identifies a combination of columns
2
3
0 0.1 4 0.23
23 26 27
42
Column Combinations
c!
(c – t)! * t!
c = number of characteristics
t = pattern number
Pattern Type
Number of Column
Combinations
2
21
3
35
4
35
5
21
6
7
7
1
43
Layout

We first thought of using an automated layout technique.
 However, one of Dr. Zhou’s requirements was that the
same pattern cluster should appear in the same position for
consistent interpretation.
 Another was that larger pattern types (6 and 7 column)
must be very distinguishably placed.
 Hence it was decided to use a manual layout design
described next.
44
Layout
01
02
03
01234
01235
01236
012
013
014
015
016
04
05
06
01245
01246
01256
023
024
025
026
034
12
13
14
01345
01346
01356
035
036
045
046
056
15
16
23
01456
02345
02346
123
124
125
126
134
24
25
26
02356
02456
03456
135
136
145
146
156
34
35
36
12345
12346
12356
234
235
236
245
246
45
46
56
12456
13456
23456
256
345
346
356
456
0123
0124
0125
0126
0134
0135
0136
0145
0146
0156
0123456
012345
012346
012356
0234
0235
0236
0245
0246
012456
013456
023456
0256
0345
0346
0356
0456
1234
1235
1236
1245
1246
1256
1345
1346
1356
1456
2345
2346
2356
2456
3456
123456
45
Top Patterned Clusters
Visualized

Final requirement by Dr. Parida and Dr. Zhou is that
only the top 10 largest patterned clusters of each
column combination should be visualized
10TH Highest
Occurrence of
combination 01
9TH Highest
Occurrence of
combination 01
2ND Highest
Occurence of
combination 01
3RD Highest
Occurrence of
combination 01
8TH Highest
Occurrence of
combination 01
Highest Occurrence
of combination
01
4TH Highest
Occurrence of
combination 01
7TH Highest
Occurrence of
combination 01
6TH Highest
Occurrence of
combination 01
5TH Highest
Occurrence of
combination 01
46
PROTERAN LAYOUT
47
Animated Terrain Evolution

Time proceeds from 0 to the maximum
number of experiments
 Each time unit all patterned clusters are
checked
 If there is an occurrence the mountain’s
height is increased
48
Mountains of PROTERAN
49
Results & Extensions
50
Results


Very encouraging feedback
Easy to use layout and the interface allows
1. Identification of states
2. Obtain values of patterned clusters
3. Relation of patterned clusters to each other as they grow
over time

In the initial use itself, Dr. Zhou said that “ he
was able to find that the hydrophobic core is
largely formed before the beta-strand hydrogen
bonds are formed.”
51
Future of PROTERAN

Introduced at the Intelligent Systems For
Molecular Biology (ISMB) in Scotland –
Received very well
 Robert-Cedergren Bioinformatics
Colloquium at University of Montreal (Sept
23-24th)
52
Extensions

Analyze with different types of protein data
 More generic layout with more
characteristics
 Application with different types of data
53
Summary
1. Review of existing techniques to cluster and
visualize gene expression data
2. Protein characteristics data is similar to that of
gene expression data
3. None of the existing techniques applied, thus
the need for a customized visual
4. Terrain Metaphor to solve our requirements
implemented in the program PROTERAN
54
Questions
55