Transcript 04_dti_ad

Fast Kernel-Density-Based
Classification and Clustering
Using P-Trees
Anne Denton
Major Advisor: William Perrizo
Outline
 Introduction
 P-Trees
 Concepts
 Implementation
 Kernel Methods
 Paper 1: Rule-Based Classification
 Paper 2: Kernel-Based Semi-Naïve Bayes
Classifier
 Paper 3: Hierarchical Clustering
 Outlook
Introduction
 Data Mining
 Information from data
 Considers storage issues
 P-Tree Approach
 Bit-column-based storage




Compression
Hardware optimization
Simple index construction
Flexibility in storage organization
P-Tree Concepts
 Ordering (details)
 New: Generalized Peano order sorting
 Compression
Impact of Peano Order Sorting
60000
50000
40000
30000
20000
10000
0
Unsorted
Simple Sorting
cr
op
fu
nc
tio
n
Impact of Sorting on Execution Speed
120
Unsorted
100
80
Simple Sorting
60
40
Generalized Peano
Sorting
cr
op
fu
nc
tio
n
us
hr
oo
m
m
sp
am
20
0
ad
ul
t
Time in Seconds
m
us
hr
oo
m
Generalized Peano
Sorting
sp
am
ad
ul
t
P-tree Nodes
Impact of Sorting on Storage Requirements
P-Tree Implementation
 Implementation in Java
 Was ported to C / C++ (Amal Perera, Masum Serazi)
 Fastest compressing P-tree implementation so far
 Array indices as pointers (details)
 Grandchild purity
Kernel-Density-Based
Classification





Probability of an attribute vector x
Conditional on class label value ci
[] is 1 if  is true, 0 otherwise
Depending on N training points xt
Kernel function K(x,xt) can be, e.g.,
Gaussian function or step function
1
P(x | C  ci ) 
N
N
 K (x, x ) [C  c ]
t 1
t
i
Higher Order Basic Bit Distance
HOBbit
 P-trees make count evaluation efficient for the
following intervals
Paper 1:
Rule-Based Classification
 Goal: High accuracy on large data sets
including standard ones (UCI ML Repository)
 Neighborhood evaluated through
 Equality of categorical attributes
 HOBbit interval for continuous attributes
 Curse of dimensionality
 Volume empty with high likelihood
 Information gain to select attributes
 Attributes considered binary, based on test sample
(“Lazy decision trees”, Friedman ’96 [4])
 Continuous data: Interval around test sample
 Exact information gain (details)
 Pursuing multiple paths
Results: Accuracy


Crop
Gene-function
 Improvement through
multiple paths (20)
C4.5
20 paths
adult
15.54 14.93
kr-vs-kp
0.8
0.84
mushroom 0
0
20 Paths vs. 1 Path
6
Relative Accuracy
 Comparable to C4.5
after much less
development time
 5 data sets from UCI
Machine Learning
Repository (details)
 2 additional data sets
4
2
0
adult
-2
spam
sickeuthyroid
kr-vs-kp
genefunction
crop
Results: Speed
 Used on largest UCI data sets
 Scaling of execution time as a function of
training set size
Time per Test Sample in
Milliseconds
80
60
Measured Execution Time
40
Linear Interpolation
20
0
0
10000
20000
Number of Training Points
30000
Paper 2: Kernel-Based
Semi-Naïve Bayes Classifier
 Goal: Handling many attributes
M
 Naïve Bayes
(k )
 x(k) is value
of kth attribute
P(x | C  ci )   P( x
| C  ci )
k 1
 Semi-naïve Bayes
 Correlated attributes are joined
 Has been done for categorical data
 Kononenko ’91 [5], Pazzani ’96 [6]
 Previously: Continuous data discretized
Kernel-Based Naïve Bayes
 Alternatives for continuous data
 Discretization
 Distribution function
 Gaussian with mean
and standard
deviation from data
 No alternative for
semi-naïve approach
 Kernel density
estimate (Hastie [7])
0.1
distribution
function
0.08
0.06
0.04
kernel
density
estimate
data points
0.02
0
Correlations
 Correlation between
attributes a and b
N: Number of
training points t Corr (a, b) 
N
(k )
(k )
(k )
K
(
x
,
x

t )
t 1 k  a ,b
N
(k )
(k )
(k )
K
(
x
,
x

t )
 Kernel function for
k  a ,b t 1
continuous data
dEH: Exponential HOBbit distance

(k )
(k )

d
(
x
,
x
1
EH
t )
(k )
(k )
(k )

K Gauss ( x , xt ) 
exp 

2 k2
2  k


2




1
Results
12
Decrease in Error Rate
P-tree Naïve Bayes
 Difference only for
continuous data
Semi-Naïve Bayes
4
0
spam
crop
adult
sickeuthyroid
waveform
-4
 3 parameter
combinations
30
20
10
kp
kr
-v
s-
av
ef
or
m
w
sp
li c
e
eu
th
yr
oi
d
m
us
hr
oo
m
ge
ne
-fu
nc
ti o
n
ad
ul
t
si
ck
-
-10
cr
op
0
sp
am
Decrease in Error Rate
 Blue: t = 1
3 iterations
 Red: t = 0.3
incl. anti-corr.
 White: t = 0.05
(t: threshold)
8
Paper 3:
Hierarchical Clustering [10]
 Goal:
 Understand relationship between standard
algorithms
 Combine the “best” aspects of three major ones
 Partition-based
 Relationship to k-medoids [8] demonstrated
 Same cluster boundary definition
 Density-based
(kernel-based, DENCLUE [9])
 Similar cluster center definition
 Hierarchical
 Follows naturally from above definitions
Results: Speed Comparison
with K-Means
10000
1000
K-Means
Leaf Node in
Hierarchy
100
Internal Node
10
1
1000
10000
100000
1000000
10000000
Results:
Clustering Effectiveness
Histogram
K-means
S31
Our Algorithm
S28
S25
S22
S16
S13
S10
S7
S4
S1
1
4
7 10 13 16 19 22 25 28 31
Attribute 1
Attribute 2
S19
20-25
15-20
10-15
5-10
0-5
Summary
 P-tree representation for non-spatial data
 Fast implementation
 Paper1: Rule-Based Algorithm
 Test-sample-centered intervals, multiple paths
 Competitive on “standard” (UCI) data
 Paper 2: Kernel-Based Semi-Naïve Bayes
 New algorithm to handle large attribute numbers
 Attribute joining shown to be beneficial
 Paper 3: Hierarchical Clustering [10]
 Competitive for speed and effectiveness
 Hierarchical structure
Outlook
 Software engineering aspects
 Column-oriented design
 Relationship with P-tree API
 “Non-standard” data
 Data with graph structure
 Hierarchical data, concept slices [11]
 Visualization
 Visualization of data on a graph
Software Engineering
 Business-problems row-based
 Match between database tables and objects
 Scientific / engineering problems columnbased
 Collective properties of interest
 Standard OO unsuitable, instead
 Fortran
 Array-based languages (ZPL)
 Solution?
 Design pattern?
 Library?
Ptree API
“Non-standard” Data
 Types of data
 Biological (KDD-cup ’02: Our team got honorary
mention!)
 Sensor Networks
 Types of problems
Small probability of minority class label
A-ROC Evaluation
Multi-valued attributes
Bit-vector representation ideal for P-trees
Graphs
Rich supply of new problems / techniques (work with
Chris Besemann)
 Hierarchical categorical attributes [11]






Visualization
Idea:
 Use graph visualization tool
 E.g. http://www.touchgraph.com/
 Visualize node data through glyphs
 Visualize edge data