Transcript 04_dti_ad
Fast Kernel-Density-Based
Classification and Clustering
Using P-Trees
Anne Denton
Major Advisor: William Perrizo
Outline
Introduction
P-Trees
Concepts
Implementation
Kernel Methods
Paper 1: Rule-Based Classification
Paper 2: Kernel-Based Semi-Naïve Bayes
Classifier
Paper 3: Hierarchical Clustering
Outlook
Introduction
Data Mining
Information from data
Considers storage issues
P-Tree Approach
Bit-column-based storage
Compression
Hardware optimization
Simple index construction
Flexibility in storage organization
P-Tree Concepts
Ordering (details)
New: Generalized Peano order sorting
Compression
Impact of Peano Order Sorting
60000
50000
40000
30000
20000
10000
0
Unsorted
Simple Sorting
cr
op
fu
nc
tio
n
Impact of Sorting on Execution Speed
120
Unsorted
100
80
Simple Sorting
60
40
Generalized Peano
Sorting
cr
op
fu
nc
tio
n
us
hr
oo
m
m
sp
am
20
0
ad
ul
t
Time in Seconds
m
us
hr
oo
m
Generalized Peano
Sorting
sp
am
ad
ul
t
P-tree Nodes
Impact of Sorting on Storage Requirements
P-Tree Implementation
Implementation in Java
Was ported to C / C++ (Amal Perera, Masum Serazi)
Fastest compressing P-tree implementation so far
Array indices as pointers (details)
Grandchild purity
Kernel-Density-Based
Classification
Probability of an attribute vector x
Conditional on class label value ci
[] is 1 if is true, 0 otherwise
Depending on N training points xt
Kernel function K(x,xt) can be, e.g.,
Gaussian function or step function
1
P(x | C ci )
N
N
K (x, x ) [C c ]
t 1
t
i
Higher Order Basic Bit Distance
HOBbit
P-trees make count evaluation efficient for the
following intervals
Paper 1:
Rule-Based Classification
Goal: High accuracy on large data sets
including standard ones (UCI ML Repository)
Neighborhood evaluated through
Equality of categorical attributes
HOBbit interval for continuous attributes
Curse of dimensionality
Volume empty with high likelihood
Information gain to select attributes
Attributes considered binary, based on test sample
(“Lazy decision trees”, Friedman ’96 [4])
Continuous data: Interval around test sample
Exact information gain (details)
Pursuing multiple paths
Results: Accuracy
Crop
Gene-function
Improvement through
multiple paths (20)
C4.5
20 paths
adult
15.54 14.93
kr-vs-kp
0.8
0.84
mushroom 0
0
20 Paths vs. 1 Path
6
Relative Accuracy
Comparable to C4.5
after much less
development time
5 data sets from UCI
Machine Learning
Repository (details)
2 additional data sets
4
2
0
adult
-2
spam
sickeuthyroid
kr-vs-kp
genefunction
crop
Results: Speed
Used on largest UCI data sets
Scaling of execution time as a function of
training set size
Time per Test Sample in
Milliseconds
80
60
Measured Execution Time
40
Linear Interpolation
20
0
0
10000
20000
Number of Training Points
30000
Paper 2: Kernel-Based
Semi-Naïve Bayes Classifier
Goal: Handling many attributes
M
Naïve Bayes
(k )
x(k) is value
of kth attribute
P(x | C ci ) P( x
| C ci )
k 1
Semi-naïve Bayes
Correlated attributes are joined
Has been done for categorical data
Kononenko ’91 [5], Pazzani ’96 [6]
Previously: Continuous data discretized
Kernel-Based Naïve Bayes
Alternatives for continuous data
Discretization
Distribution function
Gaussian with mean
and standard
deviation from data
No alternative for
semi-naïve approach
Kernel density
estimate (Hastie [7])
0.1
distribution
function
0.08
0.06
0.04
kernel
density
estimate
data points
0.02
0
Correlations
Correlation between
attributes a and b
N: Number of
training points t Corr (a, b)
N
(k )
(k )
(k )
K
(
x
,
x
t )
t 1 k a ,b
N
(k )
(k )
(k )
K
(
x
,
x
t )
Kernel function for
k a ,b t 1
continuous data
dEH: Exponential HOBbit distance
(k )
(k )
d
(
x
,
x
1
EH
t )
(k )
(k )
(k )
K Gauss ( x , xt )
exp
2 k2
2 k
2
1
Results
12
Decrease in Error Rate
P-tree Naïve Bayes
Difference only for
continuous data
Semi-Naïve Bayes
4
0
spam
crop
adult
sickeuthyroid
waveform
-4
3 parameter
combinations
30
20
10
kp
kr
-v
s-
av
ef
or
m
w
sp
li c
e
eu
th
yr
oi
d
m
us
hr
oo
m
ge
ne
-fu
nc
ti o
n
ad
ul
t
si
ck
-
-10
cr
op
0
sp
am
Decrease in Error Rate
Blue: t = 1
3 iterations
Red: t = 0.3
incl. anti-corr.
White: t = 0.05
(t: threshold)
8
Paper 3:
Hierarchical Clustering [10]
Goal:
Understand relationship between standard
algorithms
Combine the “best” aspects of three major ones
Partition-based
Relationship to k-medoids [8] demonstrated
Same cluster boundary definition
Density-based
(kernel-based, DENCLUE [9])
Similar cluster center definition
Hierarchical
Follows naturally from above definitions
Results: Speed Comparison
with K-Means
10000
1000
K-Means
Leaf Node in
Hierarchy
100
Internal Node
10
1
1000
10000
100000
1000000
10000000
Results:
Clustering Effectiveness
Histogram
K-means
S31
Our Algorithm
S28
S25
S22
S16
S13
S10
S7
S4
S1
1
4
7 10 13 16 19 22 25 28 31
Attribute 1
Attribute 2
S19
20-25
15-20
10-15
5-10
0-5
Summary
P-tree representation for non-spatial data
Fast implementation
Paper1: Rule-Based Algorithm
Test-sample-centered intervals, multiple paths
Competitive on “standard” (UCI) data
Paper 2: Kernel-Based Semi-Naïve Bayes
New algorithm to handle large attribute numbers
Attribute joining shown to be beneficial
Paper 3: Hierarchical Clustering [10]
Competitive for speed and effectiveness
Hierarchical structure
Outlook
Software engineering aspects
Column-oriented design
Relationship with P-tree API
“Non-standard” data
Data with graph structure
Hierarchical data, concept slices [11]
Visualization
Visualization of data on a graph
Software Engineering
Business-problems row-based
Match between database tables and objects
Scientific / engineering problems columnbased
Collective properties of interest
Standard OO unsuitable, instead
Fortran
Array-based languages (ZPL)
Solution?
Design pattern?
Library?
Ptree API
“Non-standard” Data
Types of data
Biological (KDD-cup ’02: Our team got honorary
mention!)
Sensor Networks
Types of problems
Small probability of minority class label
A-ROC Evaluation
Multi-valued attributes
Bit-vector representation ideal for P-trees
Graphs
Rich supply of new problems / techniques (work with
Chris Besemann)
Hierarchical categorical attributes [11]
Visualization
Idea:
Use graph visualization tool
E.g. http://www.touchgraph.com/
Visualize node data through glyphs
Visualize edge data