Introduction to KDD for Tony's MI Course

Download Report

Transcript Introduction to KDD for Tony's MI Course

1
COMP 3503
Data Preparation
and Meta Data
with
Daniel L. Silver
2
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Pre-processing
Data
Consolidation
Patterns &
Models
Warehouse
Consolidated
Data
Data Sources
p(x)=0.02
Prepared Data
3
Selection and Pre-processing
Core Problems & Approaches
 Problems:
•
•
•
identification of relevant data
representation of data
search for valid pattern or model
Probability
of sale
Age
 Approaches:
Income
• top-down verification by expert
OLAP
• interactive visualization of data/models
•
* bottom-up induction from data *
Data
Mining
4
Selection and Pre-processing
As much effort is expended preparing data as
applying a data mining tool
 Iterative approach: prepare
develop
data
model
 Data Mining phase will benefit from any
insight that leads to improved set of
attributes
 Representation can facilitate or frustrate the
Search for the most accurate model (hypothesis)

Spreadsheet, OLAP and visualization tools are very helpful
5
Selection and Pre-processing
Data and Variable Characteristics
 Three basic variable data types:
•
•
•
Nominal (catagorical) qualitative values marital
status = single, married, divorced, widowed
Ordinal (ranked) values have rank order
grade = A,B,C
Interval values have order plus a metric scale for
comparisons and arithmetic operations
temperature = 2, 10, 20 or 10.5, 15.2, 19.3
date = 12Aug99, 13Feb02
6
Selection and Pre-processing
Data and Variable Characteristics

Variables can be either discrete or continuous
•

In addition data can be of various formats:
•

Only interval numeric values are continuous
Text, numeric, logical, binary, date, money, …
Data mining software will vary in its ability
to accept these types and formats
7
Selection and Pre-processing
Data Selection and Sampling

Select response (dependent) variable
•
•
determine prior probability of categories
deal with volume bias issues
Select predictor (independent) attributes
 Generate a set of examples

•
•

choose sampling method (random, stratified)
consider sample complexity: How many
examples do I need to develop a reliable model?
Handle outliers (obvious exceptions)
• Remove the row or replace with imputed value
8
Selection and Pre-processing
Data Reduction
 The curse of dimensionality
number of attributes / number of values
 Reduce number of attributes
•
•
•

remove redundant and correlating attributes
combine attributes (logically, arithmetically,
statistically (Principal Components Analysis)
Reduce attribute value ranges
•
•
group symbolic discrete values
quantize continuous numeric values
9
Selection and Pre-processing
Preliminary Statistical Analysis

Coefficient of correlation , r, measures the
linear dependence of two variables X and Y,
-1 < r < +1; r 2 shows magnitude of r.
pos r
neg r
?
Y

X
X
X
Select attributes which correlate strongly with
the response variable and pertain to problem
10
Selection and Pre-processing
Preliminary Statistical Analysis
Remove or combine attributes that correlate
with each other or try to de-correlate through
transformation
 Factor Analysis - ANOVA can be used to compare

relative contribution of each attribute to outcomes

Principal Component Analysis - generates
variates - linear combinations of original attributes
Tools such as Minitab, SAS, SPSS can be used
11
Selection and Pre-processing
 Transform data
• de-correlate and normalize values
• map time-series data to static representation
 Encode data
• representation must be appropriate for the Data
Mining tool which will be used
• continue to reduce attribute dimensionality where
possible without loss of information
Use spreadsheet functions or transformation
and encoding software within DM tool
12
Selection and Pre-processing
Transformation and Encoding
Discrete variable values
 If necessary transform to discrete numeric
values
 Example, encode the value 4 as follows:
•
•
•

Nominal: one-of-N code (0 1 0 0 0) - five inputs
Ordinal: thermometer code ( 1 1 1 1 0) - five inputs
Interval: real value (0.4)* - one input
Consider relationship between values
•
(single, married, divorce) vs. (youth, adult, senior)
13
Selection and Pre-processing
Transformation and Encoding
Continuous numeric values
 De-correlate via normalization of values:
Min-max:
x’ = [(newmax – newmin) (x – min) / (max – min)]
+ newmin
• Euclidean: x’ = x / sqrt(sum of all x^2)
• Percentage: x’ = x/(sum of all x)
• Variance based: x’ = (x - (mean of all x))/variance
 Scale values using a linear transform if data is
uniformly distributed or use non-linear (log, power)
if skewed distribution
•
14
Selection and Pre-processing
Transformation and Encoding
Other encodings for continuous numeric values
Example: 1.6 meters could be encoded as:
•
Single real-valued number (0.16)*
o
•
Bits of a binary number (010000)
o
•
BAD! Modeling system must now learning binary encoding
One-of-N quantized intervals (0 1 0 0 0)
o
•
OK! But what if data is skewed
NOT GREAT! Presents discontinuties
Distributed (fuzzy) overlapping intervals
( 0.3 0.8 0.1 0.0 0.0)
o
BEST! Deals well with skewing but no discontinuities
15
Selection and Pre-processing
Extracting Features from a Single
Variable

From dates:
• Day, week, month, quarter, holiday, weekend day

From time:
• Hour, minute, morning, afternoon, evening

From address:
• Postal Code components mean something

Telephone number:
• NPA-NNX-9999
16
Selection and Pre-processing
Time Series Data
Of great interest to business, science, medicine
 Time series data has high dimensional

•

T1, T2, … , Tn
Approaches to summary/characterization
•
•
•
Current value = Ti
Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3
Trends = Ti - MAi or = MAi - MAi-k
17
Selection and Pre-processing
Textual Data

A difficult data type
•

Can have very high dimensions
•

freeform, open-ended, syntax -vs- semantics
thousands of potential values
Approaches to summary/characterization
•
•
•
define a fixed set of N word classes based on
frequency analysis
map word combinations to one of the N classes
automate via specialty software
Data Warehousing and
Preparation
Access to Recent Information






www.datawarehousing.com
DWI - Data Warehouse Institute www.dw-institute.com
Wikipedia http://en.wikipedia.org/wiki/Data_warehouse
DW Information Centre http://www.dwinfocenter.org
A DW Tutorial: http://www.planet-sourcecode.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378
Text Books:
Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
18
19
THE END
[email protected]