lecture_EDA_and_Visualization

Download Report

Transcript lecture_EDA_and_Visualization

ICS 278: Data Mining
Exploratory Data Analysis
and Visualization
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Lecture 4
• Project proposals due next Thursday
• Today’s lecture
– Exploratory Data Analysis and Visualization
•
•
•
•
Summary statistics
1 and 2 dimensional data visualization
Higher dimensional visualization
Examples of complex visualization
– Reading: Chapter 3 in the text
– Note: slides at end of last lecture on covariance, Mahalanobis distance,
etc, will be revisited in next lecture
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Exploratory Data Analysis (EDA)
• get a general sense of the data
• interactive and visual
– (cleverly/creatively) exploit human visual power to see patterns
• 1 to 5 dimensions (e.g. spatial, color, time, sound)
– e.g. plot raw data/statistics, reduce dimensions as needed
• data-driven (model-free)
• especially useful in early stages of data mining
– detect outliers
(e.g. assess data quality)
– test assumptions (e.g. normal distributions or skewed?)
– identify useful raw data & transforms (e.g. log(x))
• http://www.itl.nist.gov/div898/handbook/eda/eda.htm
• Bottom line: it is always well worth looking at your data!
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Summary Statistics
• not visual
• sample statistics of data X
–
–
–
–
mean:  = i Xi / n
{  minimizes i (Xi - )2 }
mode: most common value in X
median: X=sort(X), median = Xn/2 (half below, half above)
quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range:
max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (e.g. you v. Bill Gates)
– number of distinct values for a variable (see unique.m in MATLAB)
– Note: all of these are estimates based on the sample at hand – they
may be different from the “true” values (e.g., median age in US).
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Exploratory Data Analysis
Tools for Displaying Single Variables
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Histogram
•
Most common form: split data range into equal-sized bins Then for each
bin, count the number of points from the data set that fall into the bin.
–
–
•
Vertical axis: Frequency (i.e., counts for each bin)
Horizontal axis: Response variable
The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data.
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Issues with Histograms
•
For small data sets, histograms can be misleading. Small changes in the
data or to the bucket boundaries can result in very different histograms.
•
Interactive bin-width example (online applet)
– http://www.stat.sc.edu/~west/javahtml/Histogram.html
•
For large data sets, histograms can be quite effective at illustrating general
properties of the distribution.
•
Can smooth histogram using a variety of techniques
– E.g., kernel density estimation (pages 59-61 in text)
•
Histograms effectively only work with 1 variable at a time
– Difficult to extend to 2 dimensions, not possible for >2
– So histograms tell us nothing about the relationships among variables
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Histogram Example
classical bell-shaped, symmetric histogram with most of the frequency
counts bunched in the middle and with the counts dying off out in the
tails. From a physical science/engineering point of view, the
Normal/Gaussian distribution often occurs in nature (due in part to the
central limit theorem).
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
ZipCode Data: Population
900
8000
K = 50
7000
K = 500
800
700
6000
600
5000
500
4000
400
3000
300
2000
200
1000
0
100
0
2
4
6
8
10
0
12
0
2
4
6
8
10
12
4
4
x 10
x 10
400
K = 50
350
300
250
200
150
100
50
0
Data Mining Lectures
0
500
1000
1500
2000
2500
3000
3500
4000
Lecture on EDA and Visualization
4500
5000
Padhraic Smyth, UC Irvine
ZipCode Data: Population
• MATLAB code:
X = zipcode_data(:,2)
% second column from zipcode array
histogram(X, 50)
% histogram of X with 50 bins
histogram(X, 500)
% 500 bins
index = X < 5000;
% identify X values lower than 5000
histogram(X(index),50) % now plot just these X values
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Histogram Detecting Outlier (Missing Data)
blood pressure = 0 ?
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Right Skewness Example: Credit Card Usage
similarly right-skewed are
Power law distributions
(Pi ~ 1/ia, where a >= 1)
e.g. for a = 1 we have “Zipf’s law”
For word frequencies in text
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Box (and Whisker) Plots: Pima Indians Data
plots
all data
outside
whiskers
Q3-Q1
box contains middle 50% of data
up to
1.5 x
Q3-Q1
Q2
(median)
healthy
Data Mining Lectures
(or shorter,
if no data
that far
above Q3)
diabetic
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Time Series Example 1
annual fees introduced in UK
(many users cutback to 1 credit card)
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Time Series Example 2
summer bifurcations in air travel
(favor early/late)
summer
peaks
steady growth
trend
New Year bumps
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Time-Series Example 3
mean weight vs mean age
for 10k control group
Scotland experiment:
“ milk in kid diet  better health” ?
20,000 kids:
5k raw, 5k pasteurize,
10k control (no supplement)
Data Mining Lectures
Possible explanations:
Would expect smooth weight growth plot.
Visually reveals
unexpected pattern (steps),
not apparent from raw data table.
Lecture on EDA and Visualization
Grow less early in year than later?
No steps in height plots; so why
height  uniformly, weight  spurts?
Kids weighed in clothes: summer
garb lighter than winter?
Padhraic Smyth, UC Irvine
Non-Stationarity
•
Stationarity:
•
Examples:
•
Non-stationarity is common in real data sets
– (loose definition)
A probability distribution p (x | t) is stationary with respect to t if
p (x | t ) = p (x)
for all t,
where x is the set of variables of interest, and
t is some other varying quantity (e.g., usually t = time, but could
represent spatial information, group information, etc)
– p(customer demographics today) = p(customer demographics next month)?
– p(weights in Scotland) = p(weights in US) ?
– p(income of customers in Bank 1) = p(income of customers in Bank 2)?
– Solutions?
– Model stationarity (e.g., increasing trend over time) and extrapolate
– Build model only on most recent/most similar data
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Exploratory Data Analysis
Tools for Displaying Pairs of Variables
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
2D Scatter Plots
•
standard tool to display relation
between 2 variables
– e.g. y-axis = response, x-axis =
suspected indicator
•
credit card repayment: low-low, high-high
useful to answer:
– x,y related?
• no
• linearly
• nonlinearly
– variance(y) depend on x?
– outliers present?
•
MATLAB:
– plot(X(1,:),X(2,:),’.’);
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter Plot: No apparent relationship
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter Plot: Linear relationship
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter Plot: Quadratic relationship
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter plot: Homoscedastic
Variation of Y Does Not Depend on X
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter plot: Heteroscedastic
variation in Y differs depending on the value of X
e.g., Y = annual tax paid, X = income
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
5
2.5
(from US Zip code data: each point = 1 Zip code)
x 10
MEDIAN
HOUSEHOLD
INCOME
units = dollars
2
1.5
1
0.5
0
0
2
4
6
8
10
MEDIAN PERCAPITA INCOME
Data Mining Lectures
Lecture on EDA and Visualization
12
14
4
x 10
Padhraic Smyth, UC Irvine
Problems with Scatter Plots of Large Data
appears: later apps older; reality: downward slope (more apps, more variance)
96,000 bank loan applicants
scatter plot degrades into black smudge ...
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Contour Plots Can Help
recall:
(same 96,000 bank loan apps as before)
shows variance(y)  with x 
is indeed due to horizontal
skew in density
unimodal
skewed

skewed 
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Problems with Scatter Plots of Large Data
# weeks credit card buys gas vs groceries
(10,000 customers)
actual correlation (0.48) higher than appears (overprinting)
also demands explanation
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
A simple data set
Data
X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
Anscombe, Francis (1973), Graphs in Statistical Analysis,
The American Statistician, pp. 195-199.
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
A simple data set
Data
X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
Summary Statistics
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
A simple data set
Data
X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
3 more data sets
X2
Y2
X3
Y3
X4
Y4
10.00
9.14
10.00
7.46
8.00
6.58
8.00
8.14
8.00
6.77
8.00
5.76
13.00
8.74
13.00
12.74
8.00
7.71
9.00
8.77
9.00
7.11
8.00
8.84
11.00
9.26
11.00
7.81
8.00
8.47
14.00
8.10
14.00
8.84
8.00
7.04
6.00
6.13
6.00
6.08
8.00
5.25
4.00
3.10
4.00
5.39
19.00
12.50
12.00
9.13
12.00
8.15
8.00
5.56
7.00
7.26
7.00
6.42
8.00
7.91
5.00
4.74
5.00
5.73
8.00
6.89
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Summary Statistics
Summary Statistics of Data Set 2
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Summary Statistics
Summary Statistics of Data Set 2
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Summary Statistics of Data Set 3
Summary Statistics of Data Set 4
N = 11
N = 11
Data Mining Lectures
Lecture on EDA and Visualization
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Padhraic Smyth, UC Irvine
Visualization really helps!
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Exploratory Data Analysis
Tools for Displaying More than 2 Variables
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Multivariate Visualization
•
Multivariate -> multiple variables
•
2 variables: scatter plots, etc
•
3 variables:
–
–
–
–
•
4 variables:
–
–
•
3-dimensional plots
Look impressive, but often not used
Can be cognitively challenging to interpret
Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot
3d with color or time
Can be effective in certain situations, but tricky
Higher dimensions
–
–
–
–
Data Mining Lectures
Generally difficult
Scatter plots, icon plots, parallel coordinates: all have weaknesses
Alternative: “map” data to lower dimensions, e.g., PCA or multidimensional scaling
Main problem: high-dimensional structure may not be apparent in low-dimensional views
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Scatter Plot Matrix
For interactive
visualization
the concept of “linked
plots” is generally
useful
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Trellis Plot
Older
Younger
Male
Data Mining Lectures
Female
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Using Icons to Encode Information, e.g., Star Plots
•
•
1
2
3
4
Price
Mileage (MPG)
1978 Repair Record (1 = Worst, 5 = Best)
1977 Repair Record (1 = Worst, 5 = Best)
5
6
7
8
Headroom
Rear Seat Room
Trunk Space
Weight
Each star represents a single
observation. Star plots are used to
examine the relative values for a
single data point
The star plot consists of a
sequence of equi-angular spokes,
called radii, with each spoke
representing one of the variables.
•
Useful for small data sets with up
to 10 or so variables
•
Limitations?
–
–
Small data sets, small dimensions
Ordering of variables may affect
perception
9 Length
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Chernoff’s Faces
•
described by ten facial characteristic parameters: head eccentricity, eye
eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing,
eye size, mouth length and degree of mouth opening
• Chernoff faces applet
http://people.cs.uchicago.edu/~wiseman/chernoff/
• more icon plots
http://www.statsoft.com/textbook/glosi.html
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Parallel Coordinates
(epileptic seizure data from text)
1 (of n)
cases
dimensions
(possibly all p of them!)
(this case is
a “brushed”
one, with a
darker line,
to standout
from the n-1
other cases)
often (re)ordered
to better distinguish
among interesting
subsets of n total cases
interactive
“brushing” is useful
for seeing such
distinctions
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
More elaborate parallel coordinates example (from E. Wegman, 1999).
12,000 bank customers with 8 variables
Additional “dependent” variable is profit (green for positive, red for negative)
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Interactive “Grand Tour” Techniques
•
“Grand Tour” idea
–
–
–
–
–
•
Cycle continuously through multiple projections of the data
Cycles through all possible projections (depending on time constraints)
Projects can be 1, 2, or 3d typically (often 2d)
Can link with scatter plot matrices (see following example)
Asimov (1985)
e.g. XGOBI visualization package (available on the Web)
– http://public.research.att.com/~stat/xgobi/
•
Example on following 2 slides
– 7dimensional physics data, color-coded by group, shown with
(a) Standard scatter matrix
(b) static snapshot of grand tour
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Example of displaying
4d categorical data, e.g.,
as used in OLAP/databases
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Other aspects (not discussed)
•
Cognitive and human-factors aspects of visualization
– In creating visualizations of data it is important to be aware of how the human
brain perceives visual information
– E.g., “Rules and principles of scientific data visualization”
• http://www.siggraph.org/education/materials/HyperVis/percept/visrules.htm
•
Artistic aspects of visualization
– Classic books by Edward Tufte: http://www.edwardtufte.com/tufte/
•
Visualization of other data
– 2d, 3d, 4d “volume” data (fluid flow, brain images, etc)
– Network/graph data
• Issues: graph layout/drawing, issues of graph size
– Many others…., e.g.,
• http://www.cybergeography.org/
• CHI conference, etc
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Visualization of weather
states for Kenya
Daily data from 20 year history
clustered into 3 different
weather “states”
Mean image for each state
- wind direction (arrows)
- wind intensity (size of arrows)
- rainfall (size of circles)
- pressure (contours)
S. Kirshner, A. Robertson,
P. Smyth, 2004.
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Summary of 189k newsgroups and 257M postings
Green = increase in postings in 2004 over 2003, red = decrease
Uses “treemap” technique. Details at http://jcmc.indiana.edu/vol10/issue4/turner.html
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Time-line Visualization of Research Fronts
(Morris et al, JASIST, 2003)
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Interactive MultiTile Visualization
(Falko Kuester’s HIPerWall system, Calit2, UCI)
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine
Summary
• EDA and Visualization
– Can be very useful for
• data checking
• getting a general sense of individual or pairs of variables
– But…
• do not necessarily reveal structure in high dimensions
• Reading: Chapter 3
• Next lecture:
– projecting/mapping data from high dimensions to low dimensions
Data Mining Lectures
Lecture on EDA and Visualization
Padhraic Smyth, UC Irvine