8392_S2a - Lyle School of Engineering

Download Report

Transcript 8392_S2a - Lyle School of Engineering

CSE 8392 SPRING 1999
DATA MINING: CORE TOPICS
Overview / Statistical Foundation
Professor Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Dallas, Texas 75275
(214) 768-3087
fax: (214) 768-3085
email: [email protected]
www: http://www.seas.smu.edu/~mhd
January 1999
CSE 8392 Spring 1999
35
DATA MINING APPLICATION:
MARKETING
• Building, Using, and Managing the Data
Warehouse, Edited by Ramon Barquin and Herb
Edelstein, Prentice Hall PTR, 1997, Ch 8.
• Fig 8-1, p139
• Ex p 138
• Modeling goal
– Pick all (and only) buyers (Precision/recall)
– Lift: Ratio of percent of accurate responders
– Cumulative Lift- Fig 8-6, p 146 (Deciles sorted based
on most likely to buy as predicted by model.)
CSE 8392 Spring 1999
36
DEVELOPING A MODEL
• DM requires fitting a model (pattern) to the data
• Solving Data Mining Problems Through Pattern
Recognition, by Ruby L. Kennedy, Yuchun Lee,
Benjamin Van Roy, Christopher D. Reed, and
Richard D. Lippmann, Prentice Hall PTR, 19951997, Section 1.7 (pp 1-10 - 1-20).
• Fig 1-7, p 1-19
• May preprocess data to reduce overhead.
However, must be careful to avoid introducing a
bias.
CSE 8392 Spring 1999
37
PREDICTIVE DATA MINING
• Predictive Data Mining, by Sholom M. Weiss and
Nitin Indurkhya, Morgan Kaufmann, 1998.
• Table 1.1 p8
• Data Reduction
CSE 8392 Spring 1999
38
DM Human Participation
• Determine how to transform/reduce data
• Identify important features to model
• Correctly interpret results
CSE 8392 Spring 1999
39
FIXED MODELS
• Fixed formula describing how output is derived
from input. Applications fully understood - no
need to look at data.
• Ex: Fixed threshold for loans
CSE 8392 Spring 1999
40
PARAMETRIC MODELS
• Good idea of how the model should be described,
but not exact. Needs some data
• “Explicit mathematical equations characterize the
structure of the relationship between inputs and
outputs, but a few parameters are unspecified.”
(p 1-14,Kennedy)
• Training Sets - Pick parameters by looking at
data.
n
c
• Ex: Linear Regression ̂ j  
ij y i
i 1
• Fig 8-2, p140 (Barquin)
CSE 8392 Spring 1999
41
NONPARAMETRIC MODELS
• Rely on data examination to understand the model.
(A.k.a. Data-Driven Models)
• Large amounts of data.
• Premise - Observations found in current data will
be true in future.
• May preprocess data based on knowledge which
you have.
• Methods rather than models
• Ex: Nearest Neighbor,Neural Nets, Decision-Trees
• Fig 8-3, p142 (Barquin) (More accurately reflects
training sets than linear
model)
CSE 8392 Spring 1999
42
ERROR?
• Bias - “Difference in error between the best
solution and the proposed solution” (p 47, Weiss)
• Variance - “Expected difference in error between a
solution found for a single sample and the average
solution obtained over many random samples” (p
48, Weiss)
• Causes
–
–
–
–
–
Data Warehouse/Reduction/Transformation
Survey Data/Sampling
Medical Studies with volunteers
Erroneous assumptions about the data
Problem Simplification also introduces errors
CSE 8392 Spring 1999
43
STATISTICAL PERSPECTIVES ON DATA
MINING (Elder and Pregibon)
• Development of Statistical Methods
– Common theme: increases in memory and processor
capabilities influence statistical methods
• 1960s (Robust/Resistant)
–
–
–
–
–
–
Estimators sensitive to contamination
Develop new estimators
Identify and study causes of errors
Reflects realism - data does not obey mathematics
Robustness removes limitations of narrow models
No work on how to use these new/improved estimators
CSE 8392 Spring 1999
44
• Early 1970s
–
–
–
–
EDA: Exploratory Data Analysis
Insights and modeling data driven
Look at the data first??? (p 85, Fayyad)
Data = Fit + Residual (Fayyad formula 4.1.1, p. 85)
• Structure and noise; Iterative
– Graphical methods assist in visualization
– Data transformation: Reexpression/Splitting (p86)
• Late 1970s
– Generalized linear models (nonlinear, not normal dist)
– EM algorithm
• Solves estimation problem with incomplete data
• Treat as incomplete for “computational purposes”
CSE 8392 Spring 1999
45
• Early 1980s
– Resampling methods
• Replace n observations with estimates
(pseudovalues pi )
ˆ  (n  k )
ˆ
p  n
i
all
i
p   pi n
• Jackknife estimate:
– Bias reduction tool
– Estimate of error in estimate
CSE 8392 Spring 1999
46
• Late 1980s
– Globally linear  locally linear
– Scatterpoint smoothing
• Early 1990s
– Projection pursuit and squashing
– Focus shifts from model estimation to model selection
CSE 8392 Spring 1999
47
Statistical Perspective
• Interpretability
• Characterization of uncertainty
• Borrowing strength
– Artificial stability
• Examine model for
– residuals,
– diagnostics,
– parameter covariances
• Regularization
– Ockham’s razor
– Simpler model yields best generalization
CSE 8392 Spring 1999
48
• Common Statistical Methods
–
–
–
–
–
–
n
Linear models
Nonparametric methods
Projection pursuit
Neural networks
Polynomial networks
Decision trees
• Classification
• Estimation
– Splines
CSE 8392 Spring 1999
49
• Summary
– Statistical Approach Strengths
• Uncertainty controlled
• Explicit assumptions
• Stability
• Interpretability
• Quantification of Variance
– Statistical methods are essential knowledge discovery
tools
CSE 8392 Spring 1999
50