2nd Presentation

Download Report

Transcript 2nd Presentation

An Investigation of
Commercial Data
Mining
Presented by Emily Davis
Supervisor: John Ebden
Outline







Data Mining
Classification
Data Mining
Algorithms
Choice of Technique
Data Mining Process
Evaluation of Results
Oracle Data Mining
Progress
Data Mining Classification




Directed data mining builds a model that
describes one particular variable in terms of the
rest of the data.
Includes: Classification, Estimation and
Prediction
Undirected data mining builds a model to
establish the relationships amongst all the
variables.
Includes: Affinity Groupings or Association
Discovery, Clustering and Description or
Visualization.
Data Mining Algorithms
Clustering: Groups instances of data into
classes and allows for the discovery of
structures in the data.
 Neural Networks: Segments the state
space of the data with gradients or sloping
lines.
 Estimation: Determines the value of an
unknown output attribute that is numerical.

Prediction: Determines future outcomes of
data (similar to estimation).
 Classification: Assigns new instances of
data to categorical classes.
 Association discovery: Discovery of
associations between data fields (includes
market basket analysis).



Decision Trees: Uses data splitting rules to split
data and then apply more data splitting rules to
the resulting subsets of data.
Association rules: Rule induction to generate
patterns relating business goals to other data
fields. The patterns are generated as trees with
splits on data fields.
Choosing a Technique

Supervised Learning:
 set
of input and output data
 clear explanation of results

Association rules:
 input

and output data have interesting interactions
Decision trees:
 known
 faster
which attributes best define the data

Clustering and neural networks:
 all
attributes are of equal importance
 perform well on noisy data (neural networks)

When increased accuracy is required
create multiple models using the same
data mining technique until the optimal
model is created.
Data Mining Process
Too much focus on the automatic
techniques.
 Not enough focus on the exploration and
analysis of the problem and the data.
 Common to all the presented processes:

 Thorough
data preparation and exploration
 Interpretation and validation of the resulting
models
Evaluating the Output
Evaluation of supervised learning models
involves determining the level of predictive
accuracy.
 Evaluated using test data sets.
 Compare error rates of models created
from the same training data to determine
accuracy.

Model A
Model Accept
Model Reject
Actual Accept
600
25
Actual Reject
75
300



When evaluating numerical output use error
rates - the percentage of correct predictions.
Mean absolute error = average absolute
difference between computed and predicted
outcome.
Mean squared error rate = average squared
difference between computed and desired
outcome.
Cumulative Gains Chart
Evaluating unsupervised learning
models using supervised learning






Perform clustering.
A cluster is thought of as a class and assigned a name.
Random samples are chosen from instances of each
class.
A supervised model is then built with the class names as
output.
Random samples are the training set.
The remaining instances are used to test the accuracy of
the clustering model
Measures of interestingness

These include whether
the pattern:






is easily understood
is valid with a degree of
certainty
is potentially useful
is novel
confirms a hypothesis of
some kind
represents knowledge.
Oracle







Adaptive Bayes Network supporting decision
trees (classification)
Naive Bayes (classification)
Model Seeker (classification)
k-Means (clustering)
O-Cluster (clustering)
Predictive variance (attribute importance)
Apriori (association rules)
ODM
public class Sample_NaiveBayesBuild_short extends Object {
public static void main ( String[] args ) {
System.out.println("Start: " + new java.util.Date());
DataMiningServer dms = null;
oracle.dmt.odm.Connection dmsConnection = null;
try {
// Create an instance of the data mining server and get a connection
// The mining server URL, user_name and password need to be specified
dms = new DataMiningServer("ora1.ict.ru.ac.za", "system", "emily");
dmsConnection = dms.login();
// Create PhysicalDataSpecification object
// First create a LocationAccessData using the table name and schema name
LocationAccessData lad = new LocationAccessData("CENSUS_2D_BUILD_UNBINNED", "odm_mtr");
// Create a NonTransactionalDataSpecification object since the dataset is nontransactional
PhysicalDataSpecification m_PhysicalDataSpecification = new NonTransactionalDataSpecification(lad);
Data Mining for Java(DM4J)
Progress
Literature Survey
 Oracle installed on Ora in COE
 Exploring the Oracle Suite including
JDeveloper
 Member of MetaLink(Oracle’s online
support service)

Addressing the Problem:
Run the different algorithms available in
the data mining suite on sample data
using ODM and DM4J.
 Document and evaluate results using
techniques discussed.
