DataMining04

Download Report

Transcript DataMining04

Database Management Systems:
Data Mining
Data Compression
Jerry Post
Copyright © 2003
1
D
A
T
A
M
i
n
i
n
g
The Need for Data Compression
 Noisy data
 Need to combine data to smooth it (bins)
 Too many values in a dimension
 Need to combine data to get smaller number of sets
 Hierarchies
 Rollup data into natural hierarchies
 Create additional hierarchies
 Data compression
 Large text, images, time series (wavelet compression)
2
D
A
T
A
Bins
 Numerical data
 Create bins
 Equal-depth partitions
 Equal-width partitions
 Bin mean/smoothing
Data: 3, 9, 17, 27, 32, 32, 37, 48, 55
 Choose number of bins?
M
i
n
i
n
g
Bin 1: 3, 9, 17
Bin 2: 27, 32, 32
Bin 3: 37, 48, 55
Bin 1: 3 – 20.3
Bin 2: 20.3 – 37.6
Bin 3: 37.6 – 55
3, 9, 17
27, 32, 32, 37
48, 55
Bin 1: 9.6, 9.6, 9.6
Bin 2: 30.3, 30.3, 30.3
Bin 3: 46.6, 46.6, 46.6
3
D
A
T
A
M
i
n
i
n
g
What is Cluster Analysis?
 The process of placing items into groups, where
items within a group are similar to each other, and
dissimilar to items in other groups.
 Similar to Classification Analysis, but in classification,
the group characteristics are known in advance (e.g.,
borrowers who successfully repaid loans).
4
D
A
T
A
M
i
n
i
n
g
Clustering
1.
Find groups statistically
2.
Eliminate outliers
5
D
A
T
A
M
i
n
i
n
g
Standardization
 If there are multiple attributes with continuous variables, you
should standardize them.
 Z-Scores are a common approach, but using a standard
squared-deviation places too much weight on outliers.
1
s f  (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
n
1
m f  ( x1 f  x2 f  ...  xnf )
n
xif  m f
zif 
sf
6
D
A
T
A
Distance Measure
 A simple distance measure is the l1 norm:
 D(i,j) = |xi1-xj1| + |xi2-xj2| + … + |xin-xjn|
3
M
i
n
i
n
g
5
 A more common measure is the Euclidean or l2 norm:
 D(i,j) = sqrt((xi1-xj1)2 + (xi1-xj1)2 + … + (xi1-xj1)2)
x2
x1
7
D
A
T
A
M
i
n
i
n
g
General Distance Measure
 In general form:
d (i, j )  (| xi1  x j1 |q  | xi 2  x j 2 |q ... | xin  x jn |q )1/ q
|| x ||  max (| xik  x jk |)
k 1... n
8
D
A
T
A
M
i
n
i
n
g
Hierarchies
Dates
Year – Quarter – Month – Day
Year – Quarter – Week – Day
Business hierarchies
Division/Product
Function: Marketing, HRM, Production, …
Region
Region hierarchies
World – Continent – Nation – State – City
9
D
A
T
A
M
i
n
i
n
g
Clustering: Principal Components




Find the primary factors that define the data (vectors Y1, Y2)
Statistical packages can compute them quickly.
Map raw data (x1, x2 points) into the two vectors..
Use the vectors instead of the raw data.
X2
Y1
Y2
X1
10
D
A
T
A
M
i
n
i
n
g
Factor Analysis: Latent Variables
 Some data is not observable, but can be measured
through indicator variables. Classic example: human
intelligence which can be evaluated through a variety
of tests
IQ Test
Intelligence
SAT
ACT
11
D
A
T
A
M
i
n
i
n
g
Exploratory Factor Analysis




Survey (marketing)
Many items (questions)
Are they related?
Can the data be described by a small number of factors?
Q1
Factor 1
Q2
Q3
Q4
Factor 2
Q5
Q6
12
D
A
T
A
M
i
n
i
n
g
Regression to Reduce Values
 Estimate regression coefficients
 If good fit, use the coefficients to pick categories
13
D
A
T
A
M
i
n
i
n
g
Wavelet Compression




Find patterns (wavelets) in the data
Standard compression methods (GIF)
Reduces data to a small number of wavelet patterns
Particularly useful for complex data
14
D
A
T
A
M
i
n
i
n
g
Clustering Technologies
 K-means
 User specifies number of clusters (k)
 System minimizes intracluster distance (usually L2: sum of squared
errors).





K-medoids (selects representative center point)
Hierarchical (BIRCH)
Non-spherical (CURE)
Density-based (DBSCAN, DENCLUE [best of group])
Grid-based (STING, CLIQUE)
Data Mining, Jiawei Han and Micheline Kamber, 2001; chapter 8
15
D
A
T
A
M
i
n
i
n
g
Classification
 Classify an outcome based on attributes. Similar to
prediction and attribute valuation, but generally splits
attributes into groups.
 Example: Decision Tree
 Will customers buy a new item?
Age
<=25
26…50
Student
Yes
No
Yes
No
Yes
>50
Income
Low
High
No
Yes
16
D
A
T
A
M
i
n
i
n
g
Decision Tree
 Modeler specifies
 The dependent variable (outcome)
 The attribute variables
 Sample/training data
 The system selects the nodes and the criteria to best
fit the model. Possibly using the gain in information
criteria.
 Results are useful in data mining/OLAP/cube
browser. The tree nodes become important sides of
the cube, and the classification levels specify
hierarchies.
17
D
A
T
A
M
i
n
i
n
g
Bayesian Classifier
 Goal: predict the probability that an
item belongs to a class.
 Begin with the naïve assumption: with
m classes, the a priori probability
P(Ci) = 1/m
 Bayes’ Theorem: The probability that
sample X belongs in Class i is
P(Ci|X).
 Assume the attributes are
independent, so you can multiply the
probabilities. Estimate the attribute
probabilities from the sample. For
categorical data, sik is the number of
training samples of Class Ci having
the value xk for Ak and si is the
number of training samples in Ci. For
continuous variables, use a Gaussian
(normal) distribution.
P(Ci | X ) 
P( X | Ci) P(Ci)
P( X )
n
P ( X | Ci )   P ( x k | Ci )
k 1
sik
P( xk | Ci ) 
si
18
D
A
T
A
Neural Networks
Output Cells
3
-2
M
i
n
i
n
g
Input weights
7
4
Hidden Layer
Some of the connections
Incomplete
pattern/missing inputs.
Sensory Input Cells
19
D
A
T
A
M
i
n
i
n
g
Neural Networks: Back Propogation
 Training sample with known results
 Define the network
 Number of layers (You must have one hidden layer)
 Number of cells in each layer
 Normalize the data between 0.0 and 1.0
 Initialize the weights (small random numbers)
 Propagate forward and compute the error
 Update the weights and biases working backwards
X0
X1
Xn
Inputs
w0j
f
Σ
wnj
i
Oj 
θj
w1j
I j   wijOi   j
1
I
1 e j
Output
bias
Weighted sum
20
D
A
T
A
M
i
n
i
n
g
NN: Error Calculations
 Errj = Oj (1-Oj)(Tj-Oj)
 Tj is the true output, Oj (1-Oj
) is the derivative of the
logistic function
 For the hidden layer:
 (l) is the learning rate
Err j  O j (1  O j ) Errk w jk
k
wij  (l ) Err j Oi
 j  (l ) Err j
21