A K-Means Based Bayesian Classifier Programmed Within a DBMS

Download Report

Transcript A K-Means Based Bayesian Classifier Programmed Within a DBMS

A K-Means Based Bayesian
Classifier Inside a DBMS Using
SQL & UDFs
Ph.D Showcase, Dept. of Computer Science
Sasi Kumar Pitchaimalai
Ph.D Candidate
Database Systems Group, Department of Computer Science
University of Houston
Advisor: Dr. Carlos Ordonez
1
Motivation
• Naïve Bayes Classifier(NB)
– One of the most popular and important
classifiers in Machine Learning
– Robust, Powerful, Fast to Compute And
Easy to Understand
• Programming Inside A DBMS
– SQL can easily handle complex
computations
– UDFs can use arrays and processed in
memory
2
Data Mining Inside A DBMS
•Avoids Exporting the data outside the DBMS
•Major overhead
•Data Security
•Scales Linearly with large data sets
•Exploit parallelism provided by a DBMS
•Use optimized queries with simple database
operations
•Objective: Push computations involving
large data sets inside the DBMS
Bayesian Classifier Based On K-Means
(BKM)
• A Generalization Of Naïve Bayes(NB)
• The Algorithm
– Initialization: Randomly initialize k clusters per class
from the data set.
– E-Step: Compute Euclidean distance, find nearest
cluster and then compute sufficient statistics.
– M-Step: Re-compute cluster centers and radii. Check
Convergence.
• The E-Step and M-Step are repeated until model
converges i.e clusters do not move
4
BKM: Finding the clusters per class
Database Optimizations
• Five different query optimization techniques for
distance computation were introduced.
• User Defined Functions (UDFs) – Computing
distance and nearest cluster in a single UDF.
• Using CASE statement instead of aggregations.
• Sufficient Statistics of the clusters were
computed in a single table scan.
6
Comparing Accuracy – NB Vs BKM Vs
DT
•Global Accuracy: BKM better
than NB and worse than
DT(Decision Tree) in most
cases
Data Set
Algorithm
Global
Class-0
Class-1
pima
NB
76%
80%
68%
BKM
76%
87%
53%
DT
68%
76%
53%
NB
70%
87%
45%
BKM
73%
91%
43%
DT
80%
85%
72%
NB
50%
51%
30%
BKM
59%
59%
60%
DT
89%
96%
0%
NB
93%
91%
95%
BKM
93%
84%
97%
DT
95%
94%
96%
Spam
•Class Breakdown Accuracy:
BKM better than NB except 2
cases proving class
decomposition is a positive
step towards increasing NB
accuracy. DT performs poorly
here and really worse in case of
the bscale.
Bscale
Wbcancer
7
BKM Scalability- Varying n,d,k
Times per Iteration. Defaults: d=4,k=4,n=100k
8
Comparing DBMS with
MapReduce
MapReduce: A distributed non-transactional
high performance data intensive processing
framework.
Incremental Mining
• An UDF performing incremental data mining
exploiting data parallelism
• Minimizing the number of scans(1-3) on the data
set
• Provides an approximation of the model before
we scan through the complete data set
• Requires thread safe sharing of the model
without affecting performance
Papers
• Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data
mining algorithms in a DBMS with UDFs. SIGMOD
Conference 2011: 1217-1220
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia
Alvarado : Comparing SQL and MapReduce to compute
Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010
• Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to
compute sufficient statistics on large data sets exploiting
caching and sampling, DKE 2010
• Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian
Classifiers Programmed in SQL, TKDE 2008
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia
Alvarado – Efficient Distance Computation Using SQL
Queries and UDFs, ICDM 2008