Final Project presentation (20 min)
Download
Report
Transcript Final Project presentation (20 min)
Data Mining with Oracle
using Clustering and
Classification Algorithms
Presented by Nhamo Mdzingwa
Supervisor: John Ebden
Overview of Presentation
Objective of Research
Background
Methodology
Approach
Implementation
Results
Conclusions
Questions
Problem statement 1
Objective of Research
Evaluate two types of algorithms available
in Oracle10g for data mining (ODM)
To determine which algorithm builds the
most effective model and under what
circumstances
And which model produces the most
accurate results when applied to new data
Problem statement 2
Objective of Research
Gather information from mined dataset
Find
prevention predictors of HIV AIDS
To do this distinguish clusters
Or use other mining algorithms to achieve goal
Introduction
Background
Data mining is a powerful and new
technology.
Steered by the revolutionary progress in
digital data acquisition and storage which
has resulted in the creation of huge
databases
Definition
Background
It is a process of extracting knowledge
from large amounts of data,
or simply knowledge discovery in
databases
Is the finding of interesting patterns in data
Data mining tool
Methodology
Oracle10g database release 1 was installed and
configured
Oracle data miner 10g (ODM) was also installed
and configured for use with the database
Algorithms in ODM
Methodology
Classification
Adaptive Bayes
Network
Naive Bayes
Model Seeker
Association rules
Apriori
Clustering
k-Means
O-Cluster
Clustering Algorithms
Methodology
Clustering algorithms support identifying
naturally occurring groupings within the data
population.
K-Means
Minimum Error
Tolerance and Maximum Iterations
Maximum number of Clusters (k)
O-Cluster
Sensitivity
Maximum number of Clusters (k)
Dataset used
Methodology
Obtained from the Centre for AIDS
Development, Research and Evaluation
Institute for Social and Economic
Research, Rhodes University
Bases on a questionnaire survey
HIV AIDS related
Tsha Tsha - HIV AIDS awareness program
Dataset used
Methodology
2 Data sets put into database tables
TSHA_TSHA_BUILD1
500 records
Used to build and test models
TSHA_TSHA_APPLY1
399 records
Used to validate models
Methodology
Determining model accuracy
Confidence is a measure of the homogeneity of the cluster; that is, how
close together are the cluster members
The support is a measure of the relative size of a cluster (the total need
not be 1.00), such that the higher the value the larger the cluster
Methodology
Building and Testing the Models
20 models built in total
The building done in 2 phases
1) Distinct number of clusters
2) Equal number of clusters
Algorithm settings:
based on Trial and Error
Methodology settings
1st phase model building
Methodology
1st phase model Accuracy
Methodology
nd
2
phase model building
To overcome the problem (bias)
I decided to set k the maximum number of
clusters to a fixed value.
I set the value k to 7 for all cluster build in
this phase
Methodology
2nd phase model results
Methodology
Applying the best models
The most accurate models
BUILD3_OC_TSHATSHA2
from the O-Cluster
BUILD5_KM_TSHATSHA2 from the K-Means
were applied to the new data
TSHA_TSHA_APPLY1
Methodology
Determining Cluster Quality
Adopt and implement the evaluation
technique by [Roiger et al, 2003]
involves employing supervised learning to
evaluate unsupervised learning.
Decide to use classification (ABN)
ODM
has classification algorithms
ABN algorithm has been identified as most
accurate in previous research
MethodologyTechnique
Supervised Learning for
Unsupervised Model Evaluation
Designate each formed cluster as a class and
assign each class an arbitrary name.
Choose a random sample of instances from
each class for supervised learning.
Build a supervised model from the chosen
instances. Employ the remaining instances to
test the correctness of the model.
MethodologyTechnique
Apply ABN model to
remaining instances
Build Classification model
Using ABN
Methodology
Comparison of ClusterIDs
CLASSIFICATION
TABLE
OC_APPLY_ABN
KM_APPLY_ABN
CLUSTER TABLE
Vs APPLY_OC3_TSHATSHA
remaining instances
O-cluster model results
Vs APPLY_KM5_TSHATSHA
remaining instances
K-Means models results
Methodology
Comparison of ClusterIDs
DATA
SOURCE
ClusterIDs in
BOTH
TABLES
PERCENTAGE
of ClusterIDs in
both models
For
O-Cluster
results
42 out of 107
39%
For
K-Means
results
18 out of 107
17%
defining predictors
Determining HIV Predictors
HIV AIDS predictors of prevention behavior are
attributes within our dataset that influence an
individual to:
(A)
use a condom when he/she decides to be
sexually active
(B) lead to abstaining from having sexual intercourse
for at least a year or more
(C) attributes that lead to one having fewer sexual
partners.
Methodology
Determining HIV Predictors
2 techniques used to achieve these
Distinguishing
the clusters found by the O-
Cluster model
and employing association rule (Apriori)
Applied to 2 datasets
Cluster found by O-Cluster model
Dataset O-Cluster model was applied to.
predictors found
Determining HIV Predictors
On distinguishing clusters found, the
attributes HIV test and Know Aids were
identified as predictors of condom use and
abstinence
While from the associations the attributes
HIV test and talk openly have been
identified as predictors of condom use.
The predictors
Determining HIV Predictors
HIV test – if one has had an HIV test
Know Aids – if one knows about AIDS
Talk openly – if one talks openly about HIV
AIDS or not
Regarding the evaluation
Conclusions
The O-Cluster algorithm produced most
effective model:
accuracy 95.5%
When applied to new data 39%
Most effective model by K-Means:
accuracy
of 86.9%
When applied to new data 17%
Regarding ODM Algorithms
Conclusions
classification
1.
2.
3.
Adaptive Bayes Network
Naive Bayes
Model Seeker
clustering
1.
2.
k-Means
O-Cluster
association rules
1.
Apriori (association rules)
observations
Conclusions
Model accuracy somehow indicates
performance of model on new data
Therefore it is recommended that one
finds the most accurate model for accurate
results