Final Project presentation (20 min)

Download Report

Transcript Final Project presentation (20 min)

Data Mining with Oracle
using Clustering and
Classification Algorithms
Presented by Nhamo Mdzingwa
Supervisor: John Ebden
Overview of Presentation



Objective of Research
Background
Methodology
 Approach
 Implementation



Results
Conclusions
Questions
Problem statement 1
Objective of Research
Evaluate two types of algorithms available
in Oracle10g for data mining (ODM)
 To determine which algorithm builds the
most effective model and under what
circumstances
 And which model produces the most
accurate results when applied to new data

Problem statement 2
Objective of Research

Gather information from mined dataset
 Find
prevention predictors of HIV AIDS
To do this distinguish clusters
 Or use other mining algorithms to achieve goal

Introduction
Background
Data mining is a powerful and new
technology.
 Steered by the revolutionary progress in
digital data acquisition and storage which
has resulted in the creation of huge
databases

Definition
Background
It is a process of extracting knowledge
from large amounts of data,
 or simply knowledge discovery in
databases
 Is the finding of interesting patterns in data

Data mining tool
Methodology


Oracle10g database release 1 was installed and
configured
Oracle data miner 10g (ODM) was also installed
and configured for use with the database
Algorithms in ODM
Methodology
Classification



Adaptive Bayes
Network
Naive Bayes
Model Seeker
Association rules

Apriori
Clustering


k-Means
O-Cluster
Clustering Algorithms
Methodology



Clustering algorithms support identifying
naturally occurring groupings within the data
population.
K-Means
 Minimum Error
 Tolerance and Maximum Iterations
 Maximum number of Clusters (k)
O-Cluster
 Sensitivity
 Maximum number of Clusters (k)
Dataset used
Methodology
Obtained from the Centre for AIDS
Development, Research and Evaluation
Institute for Social and Economic
Research, Rhodes University
 Bases on a questionnaire survey


HIV AIDS related
 Tsha Tsha - HIV AIDS awareness program
Dataset used
Methodology
2 Data sets put into database tables
 TSHA_TSHA_BUILD1
500 records
 Used to build and test models

 TSHA_TSHA_APPLY1
399 records
 Used to validate models

Methodology
Determining model accuracy
 Confidence is a measure of the homogeneity of the cluster; that is, how
close together are the cluster members
 The support is a measure of the relative size of a cluster (the total need
not be 1.00), such that the higher the value the larger the cluster
Methodology
Building and Testing the Models
20 models built in total
 The building done in 2 phases

1) Distinct number of clusters
2) Equal number of clusters

Algorithm settings:

based on Trial and Error
Methodology settings
1st phase model building
Methodology
1st phase model Accuracy
Methodology
nd
2
phase model building
To overcome the problem (bias)
 I decided to set k the maximum number of
clusters to a fixed value.
 I set the value k to 7 for all cluster build in
this phase

Methodology
2nd phase model results
Methodology
Applying the best models

The most accurate models
 BUILD3_OC_TSHATSHA2
from the O-Cluster
 BUILD5_KM_TSHATSHA2 from the K-Means

were applied to the new data
TSHA_TSHA_APPLY1
Methodology
Determining Cluster Quality
Adopt and implement the evaluation
technique by [Roiger et al, 2003]
 involves employing supervised learning to
evaluate unsupervised learning.
 Decide to use classification (ABN)

 ODM
has classification algorithms
 ABN algorithm has been identified as most
accurate in previous research
MethodologyTechnique
Supervised Learning for
Unsupervised Model Evaluation



Designate each formed cluster as a class and
assign each class an arbitrary name.
Choose a random sample of instances from
each class for supervised learning.
Build a supervised model from the chosen
instances. Employ the remaining instances to
test the correctness of the model.
MethodologyTechnique
Apply ABN model to
remaining instances
Build Classification model
Using ABN
Methodology
Comparison of ClusterIDs
CLASSIFICATION
TABLE
OC_APPLY_ABN
KM_APPLY_ABN
CLUSTER TABLE
Vs APPLY_OC3_TSHATSHA
remaining instances
O-cluster model results
Vs APPLY_KM5_TSHATSHA
remaining instances
K-Means models results
Methodology
Comparison of ClusterIDs
DATA
SOURCE
ClusterIDs in
BOTH
TABLES
PERCENTAGE
of ClusterIDs in
both models
For
O-Cluster
results
42 out of 107
39%
For
K-Means
results
18 out of 107
17%
defining predictors
Determining HIV Predictors

HIV AIDS predictors of prevention behavior are
attributes within our dataset that influence an
individual to:
 (A)
use a condom when he/she decides to be
sexually active
 (B) lead to abstaining from having sexual intercourse
for at least a year or more
 (C) attributes that lead to one having fewer sexual
partners.
Methodology
Determining HIV Predictors

2 techniques used to achieve these
 Distinguishing
the clusters found by the O-
Cluster model
 and employing association rule (Apriori)

Applied to 2 datasets
Cluster found by O-Cluster model
 Dataset O-Cluster model was applied to.

predictors found
Determining HIV Predictors
On distinguishing clusters found, the
attributes HIV test and Know Aids were
identified as predictors of condom use and
abstinence
 While from the associations the attributes
HIV test and talk openly have been
identified as predictors of condom use.

The predictors
Determining HIV Predictors
HIV test – if one has had an HIV test
 Know Aids – if one knows about AIDS
 Talk openly – if one talks openly about HIV
AIDS or not

Regarding the evaluation
Conclusions

The O-Cluster algorithm produced most
effective model:

accuracy 95.5%
 When applied to new data  39%

Most effective model by K-Means:
 accuracy
of 86.9%
 When applied to new data  17%
Regarding ODM Algorithms
Conclusions
classification
1.
2.
3.
Adaptive Bayes Network
Naive Bayes
Model Seeker
clustering
1.
2.
k-Means
O-Cluster
association rules
1.
Apriori (association rules)
observations
Conclusions
Model accuracy somehow indicates
performance of model on new data
 Therefore it is recommended that one
finds the most accurate model for accurate
results
