Transcript slides

A Rank-by-Feature
Framework for Interactive
Exploration of
Multidimensional Data
Jinwook Seo, Ben Shneiderman
University of Maryland
Hyun Young Song ([email protected])
Maryam Farboodi ([email protected])
Feb, 09 2006
1
HCE 3.0
 HCE (Hierarchical Clustering Explorer)
 Main Idea: GRID principles
 Graphics, Ranking and Interaction for Discovery
 Feature
 Application
 http://www.cs.umd.edu/hcil/hce/
 User Manual
 http://www.cs.umd.edu/hcil/hce/hce3manual/hce3_manual.html
 Dataset
 http://www.cs.umd.edu/hcil/hce/examples/application_e
xamples.html
2
Axis-Parallel vs. Non Axis-Parallel
Approach
 Definition



3 dimensions X, Y & Z
Axis-parallel: Projection on either X & Y; X & Z
or Y & Z
Non axis-parallel: Can project on a.X+b.Y & Z
 Simplicity vs. power
 Users
3
Related Works
 Axis-parallel: Machine learning, Info. Vis.
 Pattern recognition
 Subset of dimensions to find specific patterns
 Machine learning and Data mining
 Supervised/ Unsupervised classification
 Subspace-based clustering analysis
 Projections naturally partitioning the data set
 Information Visualization
 Permutation Matrix
 Parallel coordinates: dimension ordering
 Conditional Entropy
4
Related Work (cntd.)
 Non axis-parallel: statisticians



Two-dimensional projection
SOM (Self Organizing Maps)
XGobi:



Grand tour, Projection pursuit
No ranking
HD-Eye


interactive hierarchical clustering
OptiGrid (partitioning clustering algorithm)
5
Major Contributions
 GRID (Graphics, Ranking and Interaction for
Discovery)


Study 1D, study 2D, then find features
Ranking guides insight, statistics confirm
 Visualization Techniques



Overview
Coordination (multiple windows)
Dynamic query (item slider)
6
General Overview
 Menu
 Toolbar
 Overviews, Color setting

Dendrogram (binary tree), scatterplot
 7 tabs

Color mosaic, Table view, Histogram Ordering,
Scatterplot ordering, Profile search, Gene
ontology, K-means
7
General Overview
back
8
Load/Transformation Data
 Natural Log
 Standardization
 Normalization


To the first column
Median
 Linear scaling
back
9
Clustering Algorithm
1. Initially, each data a cluster by itself
2. Merge the pair with highest similarity value
3. Update similarity values
4. Repeat 2 & 3 for n - 1 times to reach one
cluster of size n
 No predefined number of clusters
10
Choosing Algorithm Parameters
11
Linkage Method
 Average Linkage
Dist (Cn , Ck ) 
 Average Group Linkage
Ci
Ci  C j
Dist (Ci , Ck ) 
Cj
Ci  C j
Dist (C j , Ck )
DIST (Cn , Ck )  DIST (Mean(Cn ), Mean(Ck ))
 Complete Linkage
DIST (Cn , Ck )  Max ( DIST (Ci , Ck ), DIST (C j , Ck ))
 Single Linkage
DIST (Cn , Ck )  Min ( DIST (Ci , Ck ), DIST (C j , Ck ))
 Scheinderman’s 1by1 Linkage
 Tries to grow the newly merged cluster of last iteration
first
12
Dendrogram View
back
13
7 Tabs
14
1D Histogram Interface
 Interface description

Control panel, Score overview, Ordered list,
Histogram browser
15
1D Histogram Ordering
 Ranking criteria
 Normality of the distribution (0~∞)
| s |  | k 3|
 s: skewness, k: kurtosis:
 Uniformity of the distribution (0~∞)
k
entrpy ( H )   pi log 2 ( pi )



i 1
Number of potential outliers (0~n)
 IQR = Q3 – Q1, d: item value
d  Q1  1.5  IQR
 Suspected outlier: d  Q3  1.5  IQR
d  3  IQR
 Extreme outlier:
Number of unique values (0~n)
Size of the biggest gap (0~max. dim. range)
 mf: max frequency, t: tolerance:
 t  mf
16
2D Scatterplot Interface
 Interface description

Control panel, Score overview, Ordered list,
Scatterplot browser
17
2D Scatterplot Ordering
 Ranking criteria
 Statistical Relationship
 Correlation coefficient(-1~1): Pearson’s coefficient
r ( S )   ( x  x)( y  y)
 ( x  x)  ( y  y )
 Least square error for curvilinear regression(0~1)
 Quadracity(-∞~∞)
 Distribution Characteristics
 Number of potential outliers(0~n)
n
n
n
2
i 1
i
i
i 1
i
2
i 1
i
 LOF-based: Density-based outlier detection

Number of items in area of interest(0~n)
Uniformity(0~∞) : entropy(S )   p log ( p )
k

k
i 1 j 1
ij
2
ij
18
Demo
19
System Constraints
 Computational Complexity
 n data in m dimensional space : O(nm²)
 O(n) : scoring complexity
 O(m²) :combination of dimension
 Display Constraints
 Appropriate number of
dimensions for score
overview component: 0~130
 Lack of sliders to adjust
displacement
20
Evaluation of HCE 3.0
 Linear color mapping (3 color or 1 color)
 Consistent layout of the components
 Focus-context
 F: dendrogram – C: rank-by-feature
 F: ordered list - C: histogram, scatter plot
 Item slider
 Dynamic query
 Multi-window view
 Dynamic update of data selection in different
window
21
Futureworks
HCE 3.0 (HCE 3.5)
HCE 4.0 ??
1D, 2D axis parallel projection
3D projection
Numerical data format
Numerical + categorical, binary, Nominal
Limited number of applicable
 More meaningful datasets to
datasets
( us cities, cereal, netscan …)
1D - 5 ranking criteria
2D – 6 ranking criteria
demonstrate the power of each ranking
criteria
 Incorporate more criterion into rank-byfeature framework
 User study
 Various statistical tools and data mining algorithms
22
Thank you!
Questions?
23