Project Proposal presentation (10 min)

Download Report

Transcript Project Proposal presentation (10 min)

Data Mining with Oracle using
Classification and Clustering
Algorithms
Proposed and Presented by
Nhamo Mdzingwa
Supervisor: John Ebden
Presentation Outline









Problem Statement
Objective
Background
Expected Results
Possible Extensions
Plan of action
Timeline
Literature Survey
Questions
Problem Statement
 The commercial world is fast reacting to the growth &
potential in the DM area, as a wide range of tools are
being marketed as DM suites.
 Examples of these are:








Oracle DM
DB2’s Intelligent Miner
Informix’s Data Mine
SQL Data miner
Ghost miner
Clementine 9.0 (SPSS)
SAS
Gornish systems, etc
Problem
 It is vital to know the algorithms a DM suite
uses and which algorithm to use on a
particular data set.
 Secondly, how well each algorithm performs
in terms of accuracy, efficiency and
effectiveness when using a particular DM
suite e.g. Oracle DM.
Objective
 Investigate two types of algorithms available
in Oracle for data mining (ODM).
 Apply the two algorithms to actual data.
Analyse &
 Evaluate
results in terms of performance.


What is Data Mining?
(Background)
 Simply put, DM is knowledge discovery.
 DM is the process of automatic discovery of [hidden]
patterns and relationships within enormous amounts
of data.
 It is a powerful & new technology that allows
businesses to make proactive, knowledge-driven
decisions as it tries to predict the future.
 Data (represents knowledge) normally stored in
databases and data warehouses ( typical size in terabytes).
Automatic discovery is implemented by the use of
algorithms provided by DM suites


1.
2.
3.
4.
5.
E.g. oracle offers:
Adaptive Bayes Network supporting
decision trees (classification)
Naive Bayes (classification)
Model Seeker (classification)
k-Means (clustering)
O-Cluster (clustering)
Predictive variance (attribute importance)
Apriori (association rules)
 Algorithms are grouped as either supervised or
unsupervised learning strategies.
DM strategies
Input attributes but
have no output
attributes
Unsupervised
learning
Supervised
learning
Input attributes and
output one or more
attributes
Classification
Clustering
Naive Bayes
Model Seeker
Adaptive Bayes
k-Means
O-Cluster
Estimation
Prediction
Predictive variance
The data mining process
involves a series of
steps to define a
business problem,
gather and prepare the
data, build and evaluate
mining models, and
apply the models and
disseminate the new
information.
Expected Results
 Aim at conclusively saying which algorithm
will be most effective and suitable for the
process of data mining on any dataset
- since datasets are different.
Possible Extensions to the Project:
 testing of the same algorithms with different tools
offered by other vendors.
e.g. testing with the DM suite in SQL and
checking if the results are similar.
 If not, investigating why the results are different,
could be another extension.
Plan of Action
 Carry out a literature search:

mainly to obtain background knowledge and
understanding of field.
 Get to know Oracle DM Suite:



Do DM tutorials provided by oracle.
The server Ora1 is the machine I’ll be working with.
It is already installed with JDeveloper & oracle 10g
database, oracle 9i DM.
Timeline
Continuation from literature and tutorials
done
Investigate Clustering & Classification
2nd term- 15 to 30 April
algorithms (theory)
Find suitable computerised case studies of
the use of above algorithms – with or
without Oracle.
2nd term- End of May
Search databases for testing (possibilities:
AIDS data & faculty data)
2nd term- End of May
Apply algorithms to data found then
Critically Analyse & assess results
Second semester
Write up paper
September vacation and 3rd term
Final project write up
Due 7/11
Literature Survey
 Richard J. Roiger and Michael W. Geatz,
Data mining: a tutorial- based primer. Boston,
Massachusetts, Addison Wesley, 2003;
 This book will provide the necessary
background and practical knowledge required
for the project research and also presents
different methodologies used in data mining
that may be useful.
 David Hand, Heikki Mannila and Padhraic Smyth, Principles of data mining.
Cambridge Massachusetts, MIT Press, 2001.
 Jesus Mena, Data mining your website. Digital Press, 1999.
 Jiawei Han and Micheline Kamber, Data mining: concepts and techniques
San Francisco, California, Morgan Kauffmann, 2001
 Robert P. Trueblood and John N. Lovett, Jnr. Data Mining and Statistical
Analysis Using SQL, USA, Apress,
 http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.ht
m
 http://www.oracle.com/technology/products/oracle9i/htdocs/o9idm_faq.html
 http://fas.sfu.ca/cs/research/groups/DB/sections/publication/kdd/kdd.html .
Questions?
Thank you