Transcript Document

Statistical Data Mining - 3
Edward J. Wegman
A Short Course for Interface ‘01
Visual Data Mining
Outline of Lecture


Visual Complexity
Description of Basic Techniques




Parallel Coordinates
Grand Tour
Saturation Brushing
Illustrations of Basic Techniques






Rapid Data Editing, Density Estimation (Pollen Data)
Inverse Regression, Tree Structured Decision Rules (Bank Data)
Classification & Clustering (SALAD Data & Artificial Nose )
Structural Inference (PRIM 7 Data)
Data Mining (BLS Cereal Scanner Data)
Cluster Trees (Oronsay Sand Particle Size Data)
Visual Complexity
Scenarios
Typical high resolution workstations,
1280x1024 = 1.31x106 pixels
Realistic using Wegman, immersion, 4:5 aspect ratio,
2333x1866 = 4.35x106 pixels
Very optimistic using 1 minute arc, immersion, 4:5
aspect ratio, 8400x6720 = 5.65x107 pixels
Wildly optimistic using Maar(2), immersion, 4:5
aspect ratio, 17,284x13,828 = 2.39x108 pixels
Visual Complexity
Visualization for Data Mining can realistically hope
to deal with somewhere on the order of 106 to 107
observations. This coincides with the approximate
limits for interactive computing of O(n2) algorithms
and for data transfer. This also roughly corresponds
to the number of foveal cones in the eye.
Methodologies for Visual Data Mining

Parallel Coordinates
 Effective
Method for High Dimensional Data
 High Dimensions = Multiple Attributes

Grand Tour
 Generalized
Rotation in High Dimensions
 In Depth Study of High Dimensional Data

Saturation Brushing
 Effective
Method for Large Data Sets
Visual Data Mining Techniques
Multidimensional Data Visualization
Scatterplot matrix
Parallel coordinate plots
3-D stereoscopic scatterplots
Grand tour on all plot devices
Density plots
Linked views
Saturation brushing
Pruning and cropping
Crystal Vision
Crystal Vision
Crystal Vision
Crystal Vision
Data Editing and Density Estimation
Pollen Data
3848 points
5 dimensions
C
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Inverse Regression and Tree Structured
Decision Rules with Financial Data

Bank Demographic Data in 8 Dimensions
with 12,000+ points
Inverse Regression and Tree Structured
Decision Rules with Financial Data
Inverse Regression and Tree Structured
Decision Rules with Financial Data
Inverse Regression and Tree Structured
Decision Rules with Financial Data
Classification and Clustering Using
SALAD Data

Chemical Agent Detection Data in 13
Dimensions with 10,000+ points
Classification and Clustering Using
SALAD Data
Classification and Clustering Using
SALAD Data
Artificial Dog Nose
19 dimensional time series in 2 spectral bands
60 time steps for 300 chemical species
c
Artificial Dog Nose
Time series in two spectral bands for same chemical species
Artificial Dog Nose
Phase loop
Artificial Dog Nose
Orthogonal components
Artificial Dog Nose
After grand tour, orthogonal variables x2*, x9*, x15*, x16*, x18* separate the two spectral bands
Artificial Dog Nose
Four chemical species, target highlighted in red
Artificial Dog Nose
Target species separated by x1*, x3*, x5*, x6*, x11*, x15*
PRIM-7
7 dimensional high energy physics data
500 data points
pi-meson proton interaction
Structural Inference Using
PRIM 7 Data
Structural Inference Using
PRIM 7 Data
Structural Inference Using
PRIM 7 Data
Structural Inference Using
PRIM 7 Data
Structural Inference Using
PRIM 7 Data
Scanner Data for
Breakfast Cereals
5.5 gigabytes of scanner data in relational database
Price, sales volume, promotion, store, chain, PSU, UPC
Work done at BLS
Phase 1 – Basic Data Analysis – Single Month
Phase 2 – Price Relative Effects – 1 Year
Phase 3 – Churning Effects – 5 Years
Scanner Data for
Breakfast Cereals
Promotion has huge impact on sales volume
Scanner Data for
Breakfast Cereals
Stores not randomized
Scanner Data for
Breakfast Cereals
Aggressive promotion pays
Scanner Data for
Breakfast Cereals
Scanner Data for
Breakfast Cereals
Scanner Data for
Breakfast Cereals
Phase 2
Scanner Data for
Breakfast Cereals
Scanner Data for
Breakfast Cereals
Outliers belong to same chain
Scanner Data for
Breakfast Cereals
Promotion both years
Scanner Data for
Breakfast Cereals
Range of items with no promotion
Scanner Data for
Breakfast Cereals
One chain ceased promotions
Scanner Data for
Breakfast Cereals
Phase 3
Scanner Data for
Breakfast Cereals
Churning comes from both new items and new stores
Scanner Data for
Breakfast Cereals
Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity
Scanner Data for
Breakfast Cereals
New items tend to have higher prices
Scanner Data for
Breakfast Cereals
Many discontinued items have high expenditures
Scanner Data for
Breakfast Cereals
Effect of item churning
Scanner Data for
Breakfast Cereals
Removing Store Birth-Death Effects
Scanner Data for
Breakfast Cereals
Outlier due to price coding error
Scanner Data for
Breakfast Cereals
Effects of Cereal Types
Scanner Data for
Breakfast Cereals
Quantity Effects
Sands of Time Data
300 Samples of Sand Data from Oronsay Island in the
Scotch Hebrides
Sands of Time - Objective
“The mesolithic shell middens on the island
of Oronsay are one of the most important
archeological sites in Britain. It is of
considerable interest to determine their
position with respect to the mesolithic
coastline. If the sand below the midden were
beach sand and the sand from the upper
layers dune sand, this would indicate a
seaward shift of the beach-dune interface.”
Flenley and Olbricht, 1993
Sands of Time - Objective
Cluster samples of modern sand into “beach-like” or
“dune-like” sand.
Classify archeological sand samples as to whether they
are beach sand or dune sand.
Sands of Time –
Parametric Analysis
Historical strategy is to fit parametric distributions and
compare modern and archeological sands based on
parameters.
Weibull, 1933; lognormal (breakage models), loghyperbolic, log-skew-Laplace, 1937, BarndorffNielsen, 1977.
Models 2 to 4 parameters, theory developed, practice
problematic.
Sands of Time - Graphical Analysis
Multidimensional Parallel Coordinate Display
Combined with Grand Tour.
BRUSH-TOUR strategy
Clusters recognized by gaps in any horizontal axis.
Brush existing clusters with colors.
Execute grand tour until new clusters appear, brush again.
Continue until clusters are exhausted.
Mining the Sands of Time
Mining the Sands of Time
Mining the Sands of Time
Mining the Sands of Time
Mining the Sands of Time
Mining the Sands of Time
Mining the Sands of Time
Sands of Time - Conclusions
Sands from the CC site and the CNG site have
considerably different particle size distributions and
cannot be effectively aggregated.
Data at small and at large particle dimensions is too
quantized to be used effectively.
The visual based BRUSH-TOUR strategy is extremely
effective at clustering.
Sands of Time - Conclusions Continued
Midden sands are neither modern beach sands nor
modern dune sands.
Midden sands are more similar to modern dune sands.
This result does not support the seaward-shift-of-thebeach-dune-interface hypothesis, but suggests the
middens were always in the dunes