Transcript slides
A Rank-by-Feature
Framework for Interactive
Exploration of
Multidimensional Data
Jinwook Seo, Ben Shneiderman
University of Maryland
Hyun Young Song ([email protected])
Maryam Farboodi ([email protected])
Feb, 09 2006
1
HCE 3.0
HCE (Hierarchical Clustering Explorer)
Main Idea: GRID principles
Graphics, Ranking and Interaction for Discovery
Feature
Application
http://www.cs.umd.edu/hcil/hce/
User Manual
http://www.cs.umd.edu/hcil/hce/hce3manual/hce3_manual.html
Dataset
http://www.cs.umd.edu/hcil/hce/examples/application_e
xamples.html
2
Axis-Parallel vs. Non Axis-Parallel
Approach
Definition
3 dimensions X, Y & Z
Axis-parallel: Projection on either X & Y; X & Z
or Y & Z
Non axis-parallel: Can project on a.X+b.Y & Z
Simplicity vs. power
Users
3
Related Works
Axis-parallel: Machine learning, Info. Vis.
Pattern recognition
Subset of dimensions to find specific patterns
Machine learning and Data mining
Supervised/ Unsupervised classification
Subspace-based clustering analysis
Projections naturally partitioning the data set
Information Visualization
Permutation Matrix
Parallel coordinates: dimension ordering
Conditional Entropy
4
Related Work (cntd.)
Non axis-parallel: statisticians
Two-dimensional projection
SOM (Self Organizing Maps)
XGobi:
Grand tour, Projection pursuit
No ranking
HD-Eye
interactive hierarchical clustering
OptiGrid (partitioning clustering algorithm)
5
Major Contributions
GRID (Graphics, Ranking and Interaction for
Discovery)
Study 1D, study 2D, then find features
Ranking guides insight, statistics confirm
Visualization Techniques
Overview
Coordination (multiple windows)
Dynamic query (item slider)
6
General Overview
Menu
Toolbar
Overviews, Color setting
Dendrogram (binary tree), scatterplot
7 tabs
Color mosaic, Table view, Histogram Ordering,
Scatterplot ordering, Profile search, Gene
ontology, K-means
7
General Overview
back
8
Load/Transformation Data
Natural Log
Standardization
Normalization
To the first column
Median
Linear scaling
back
9
Clustering Algorithm
1. Initially, each data a cluster by itself
2. Merge the pair with highest similarity value
3. Update similarity values
4. Repeat 2 & 3 for n - 1 times to reach one
cluster of size n
No predefined number of clusters
10
Choosing Algorithm Parameters
11
Linkage Method
Average Linkage
Dist (Cn , Ck )
Average Group Linkage
Ci
Ci C j
Dist (Ci , Ck )
Cj
Ci C j
Dist (C j , Ck )
DIST (Cn , Ck ) DIST (Mean(Cn ), Mean(Ck ))
Complete Linkage
DIST (Cn , Ck ) Max ( DIST (Ci , Ck ), DIST (C j , Ck ))
Single Linkage
DIST (Cn , Ck ) Min ( DIST (Ci , Ck ), DIST (C j , Ck ))
Scheinderman’s 1by1 Linkage
Tries to grow the newly merged cluster of last iteration
first
12
Dendrogram View
back
13
7 Tabs
14
1D Histogram Interface
Interface description
Control panel, Score overview, Ordered list,
Histogram browser
15
1D Histogram Ordering
Ranking criteria
Normality of the distribution (0~∞)
| s | | k 3|
s: skewness, k: kurtosis:
Uniformity of the distribution (0~∞)
k
entrpy ( H ) pi log 2 ( pi )
i 1
Number of potential outliers (0~n)
IQR = Q3 – Q1, d: item value
d Q1 1.5 IQR
Suspected outlier: d Q3 1.5 IQR
d 3 IQR
Extreme outlier:
Number of unique values (0~n)
Size of the biggest gap (0~max. dim. range)
mf: max frequency, t: tolerance:
t mf
16
2D Scatterplot Interface
Interface description
Control panel, Score overview, Ordered list,
Scatterplot browser
17
2D Scatterplot Ordering
Ranking criteria
Statistical Relationship
Correlation coefficient(-1~1): Pearson’s coefficient
r ( S ) ( x x)( y y)
( x x) ( y y )
Least square error for curvilinear regression(0~1)
Quadracity(-∞~∞)
Distribution Characteristics
Number of potential outliers(0~n)
n
n
n
2
i 1
i
i
i 1
i
2
i 1
i
LOF-based: Density-based outlier detection
Number of items in area of interest(0~n)
Uniformity(0~∞) : entropy(S ) p log ( p )
k
k
i 1 j 1
ij
2
ij
18
Demo
19
System Constraints
Computational Complexity
n data in m dimensional space : O(nm²)
O(n) : scoring complexity
O(m²) :combination of dimension
Display Constraints
Appropriate number of
dimensions for score
overview component: 0~130
Lack of sliders to adjust
displacement
20
Evaluation of HCE 3.0
Linear color mapping (3 color or 1 color)
Consistent layout of the components
Focus-context
F: dendrogram – C: rank-by-feature
F: ordered list - C: histogram, scatter plot
Item slider
Dynamic query
Multi-window view
Dynamic update of data selection in different
window
21
Futureworks
HCE 3.0 (HCE 3.5)
HCE 4.0 ??
1D, 2D axis parallel projection
3D projection
Numerical data format
Numerical + categorical, binary, Nominal
Limited number of applicable
More meaningful datasets to
datasets
( us cities, cereal, netscan …)
1D - 5 ranking criteria
2D – 6 ranking criteria
demonstrate the power of each ranking
criteria
Incorporate more criterion into rank-byfeature framework
User study
Various statistical tools and data mining algorithms
22
Thank you!
Questions?
23