Transcript vizstruct

Visualization and Microarray
•
•
•
•
Complement to numerical analysis
Offers insightful information
Detects the structure of dataset
Early / late stage of data mining
•
Challenges of Microarray Visualization
–
–
–
–
High dimensionality
Large data size
Intuitive layout
Low time complexity
University at Buffalo The State University of New York
An Example – Early Stage
University at Buffalo The State University of New York
General Approaches
• Global Visualizations
– Encode each dimension uniformly by the same visual
cue
Parallel coordinates
University at Buffalo The State University of New York
General Approaches, con’t
• Optimal Visualizations
– Estimate the parameters and assess the fit of various
spatial distance models for proximity data
– Multidimensional scaling (MDS)
• Sammon’s mapping: topology preservation. Two samples that
are close to each other have to stay close when projected.
University at Buffalo The State University of New York
Sammon’s mapping
• Sammon’s mapping is a classical case of MDS
• MDS optimizes 2-D presentation to preserve
distances in original N-dimensional space
• Sammon’s mapping iteratively minimizes
*
(
d
1
ij  d ij )
E

 d
d
2
*
i
j i
ij
i
j i
*
ij
dij* is the distance between points i and j in the N-dimensional space
dij* is the distance between points I and j in the visualization.
University at Buffalo The State University of New York
2D to 1D
University at Buffalo The State University of New York
A method for achieving this projection
1. D1, D2 and D3 (the interpoint distances in the higher
dimensional space) are calculated.
2. P1', P2' and P3' are generated randomly in the lower
dimensional space.
3. The mapping error, E, is calculated for all the
interpoint distances in the lower dimensional space.
4. The gradient showing the direction which minimizes
the error is calculated.
5. The points in the lower dimensional space are moved
according to the direction given by the gradient.
6. Steps 3 to 5 are repeated until E is below a given
limit.
University at Buffalo The State University of New York
Sammon’s mapping, con’t
• Some drawbacks
–
–
–
–
Computationally intensive, time complexity O(n2)
How to determine the best initialization
No user interaction is permitted
Addition of new data points requires rerun the process to get
new minimized projection
– Information loss
University at Buffalo The State University of New York
General Approaches, con’t
• Projective Visualizations
– Use projection functions to achieve a low
dimensional display
– Radial Visualizations
• RadViz
• Star Coordinates
• VizStruct
University at Buffalo The State University of New York
Comparison of Approaches
Advantages
Disadvantages
Global visualization Display all
dimensional
information, no
computation
Severe
overlapping, large
space to display
Optimal
visualization
Achieve optimal
result, sound
theoretical basis
Lack user
interaction, heavy
computation
Projection
visualization
Concise display,
little computation
Lack regorous
proof, may not be
optimal
University at Buffalo The State University of New York
Challenges of Microarray Visualization
•
•
•
•
High dimensionality
Large data size
Intuitive layout
Low time complexity
University at Buffalo The State University of New York
Density or Heat Plots
1
Genes
• Widely used with
arrays
• Works well only for
structured data
• Quantitative
information is lost
• Gets easily cluttered
Increased
0
Before IFN
After IFN
Sample
University at Buffalo The State University of New York
TreeView Visualization
University at Buffalo The State University of New York
Principal component analysis
PCA:
• linear projection of data onto major
principal components defined by the
eigenvectors of the covariance matrix.
• PCA is also used for reducing the
dimensionality of the data.
• Criterion to be minimised: square of the
distance between the original and projected
data. This is fulfilled by the Karhuven-Loeve
transformation
x P  Px
Example: Leukemia data sets
by Golub et al.: Classification
of ALL and AML
P is composed by eigenvectors of the
covariance matrix
C
University at Buffalo The State University of New York
1
( xi   )( xi   ) t

n 1 i
Multi-linear scaling
Sammon`s mapping:
•
Non-linear multi-dimensional
scaling such as Sammon's mapping
aim to optimally conserve the
distances in an higher dimensional
space in the 2/3-dimensional space.
• Mathematically: Minimalisation
of error function E by steepest
descent method:
E
1
i  j Dij
N
N
( Dij  d ij ) 2
i j
Dij

University at Buffalo The State University of New York
Example: DLBCL prognosis –
cured vs featal cases
Our Visualization Approach
Gene Space
Fourier Harmonic Projection
Sample Space
University at Buffalo The State University of New York
Geometric Interpretation
N-dimensional space
Two-dimensional space
University at Buffalo The State University of New York
An Example of the Mapping
P=[a,a,…a] -> ?
University at Buffalo The State University of New York
First Fourier Harmonic Projection
N-dimensional space
Two-dimensional space
University at Buffalo The State University of New York
Analytical Properties
University at Buffalo The State University of New York
Scaling and Transpose Property
Transpose
Shift
Scaling
Original
University at Buffalo The State University of New York
Time Shifting Property
University at Buffalo The State University of New York
Visual Exploration Framework
•
•
Explorative Visualization – Sample space
Confirmative Visualization – Gene space
University at Buffalo The State University of New York
VizStruct Architecture
Internet
WebBrowser
Web Server
Matlab
Web Server
WebBrowser
Matlab
Applications
Client
Intranet
Client
Matlab
Libraries
Client
University at Buffalo The State University of New York
VizStruct User Interface
University at Buffalo The State University of New York
VizStruct User Interface (3)
Cartesian Plot
University at Buffalo The State University of New York
Polar plot
VizStruct User Interface (2)
EM Mixture
University at Buffalo The State University of New York
Density contour
Sample Classification
University at Buffalo The State University of New York
Binary Classification
Binary classification: two sample classes
Evaluation: hold out and cross validation
Leukemia-A
72 samples with 7129 genes
38(27+11)Training,34(20+14) Testing,
hold out evaluation
Multiple Sclerosis
44 samples, 4132 genes
MS_IFN(28), MS_CON(30),
cross validation evaluation
University at Buffalo The State University of New York
Multiple Classification
Breast Cancer
22 samples with 3226 genes 3
Classes: BRCA1 (7), BRCA2 (8),
Sporadic (7) cross validation
evaluation
University at Buffalo The State University of New York
SRBCT
88 samples with 2308 genes 4
classes: RMS, BL, NB, EWS, 63
Training and 25 Testing
Classification Summary
University at Buffalo The State University of New York
Temporal Pattern (1)
Nortryptyline
University at Buffalo The State University of New York
10-OH Nortryptyline
Temporal Pattern (2)
Idealized temporal gene expression profiles
•
•
•
Rat Kidney data set of
Stuart et al. (2001) contains
873 genes of 7 time points
during kidney development
There are 5 patterns or
gene groups classified by
the author
Parallel coordinate shows
the actual data comply to
the profiles but with some
noise
Parallel coordinates for each of the gene groups
University at Buffalo The State University of New York
Temporal Pattern (3)
Genes are very
similar except the
last time point
Genes having a
relatively steady
increase in
expression
throughout
development
Genes are
somewhat
symmetric to the
middle time point,
i.e., they are
transposing each
other
Genes having very
high relative levels
of expression in
early development
The first Fourier harmonic projection
University at Buffalo The State University of New York
VizStruct vs. Sammon’s
Mapping
VizStruct
4
Sammon's Mapping
119
123
106
0.2
123
106
108
109
131
136
135
103
133129
105 120
126118130
101
104
144
112
132
114 143
121117
110
113
102
145141147
84 138
125
115
69 73
134
140
137
146
124
148
122
116 128127111142
88
107
149 150
78
77
13991 56
64 74 63
55
53
54
59
71
79
92
87
85 61 67 90 95 81 70 93 57 68 98 75 76
51
97 96
100
66
82 83 72 52
60
62
89 86
58 94
80
65
99
0.1
Imaginary Part of F 1(x[n])
Imaginary Part of F (x[n])
1
119
0
42
26
9
25 13
4 46
30 31
24
10
35
39
21
12 245 38
14 7 48
43
44 3 27
836 40 32
50
29
1 628
22 18
41
520 47
49 11 37 19
23
17 33
34
16
-0.1
15
2
61
0
-2
94
58
42
118
132
136
108
131
126
130
110
103
109
144
105 125121 145
129
133
113140 141
101
115135
104
138
112 117
116 142 137
102 147
143
148 146
120114
149
122
84150 124
111
73 128134
78
69
139127
71
107
88
64
57 77 53
74
85 67
51
79 92 555259 87
56
91
63
95 10097 62 98 7586 76 66
54
9368
96 72
60 90
89
70
83
8281
80 65
99
25 24
44 21 45
3126
30
32
4 46
10 12 8 27
35
9
6 19
132 38
50 40
39
483
47
1828 22
4911
129
20
7
43
36
41
5
17 34
14
37 33
15
23
16
-0.2
-4
0.1
0.12
0.14
0.16
0.18
0.2
Real Part of F (x[n])
1
-2
0
2
Real Part of F (x[n])
1
• VizStruct is similar to Sammon’s mapping
University at Buffalo The State University of New York
VizStruct - Dimension Tour
 Interactively adjust dimension parameters
 Manually or automatically
 May cause false clusters to break
 Create dynamic visualization
University at Buffalo The State University of New York
Visualized Results for a Time Series Data Set
University at Buffalo The State University of New York
Interrelated Dimensional Clustering
The approach is applied on classifying multiple-sclerosis patients and IFN-drug
treated patients.
– (A) Shows the original 28 samples' distribution. Each point represents a
sample, which is a mapping from the sample's 4132 genes intensity vectors.
– (B) Shows 28 samples' distribution on 2015 genes.
– (C) Shows 28 samples' distribution on 312 genes.
– (D) Shows the same 28 samples distribution after using our approach. We
reduce 4132 genes to 96 genes.
University at Buffalo The State University of New York
References
•
•
•
•
•
•
Li Zhang, Aidong Zhang, and Murali Ramanathan VizStruct: Exploratory
Visualization for Gene Expression Profiling. Bioinformatics 2004 20: 85-92, 2004.
Li Zhang, Chun Tang, Yuqing Song, and Aidong Zhang, Murali Ramanathan.
VizCluster and Its Application on Clustering Gene Expression Data. International
Journal of Distributed and Parallel Database, 13(1): 73-97, 2003
Li Zhang, Aidong Zhang, and Murali Ramanathan: Enhanced Visualization of Time
Series through Higher Fourier Harmonics. In proceeding of BIOKDD 2003,
Washington DC, August 2003, pp 49-56.
Li Zhang, Aidong Zhang, and Murali Ramanathan: Fourier Harmonic Approach for
Visualizing Temporal Patterns of Gene Expression Data. In proceeding of IEEE
Computer Society Bioinformatics Conference (CSB 2003). Stanford, CA, August
2003, pp131-141.
Li Zhang, Aidong Zhang, and Murali Ramanathan. Visualized Classification of
Multiple Sample Types. In proceeding of BIOKDD 2002, Edmonton, Alberta,
Canada, July 2002, pp 55-62.
Li Zhang, Chun Tang, Yong Shi, Yuqing Song, and Aidong Zhang, Murali
Ramanathan. VizCluster: An Interactive Visualization Approach to Cluster Analysis
and Its Application on Microarray Data. In proceeding of the Second SIAM
International Conference on Data Mining (SDM02). Arlinton, VA. April 2002, pp 2951.
University at Buffalo The State University of New York