Evaluation of MineSet 3.0

Download Report

Transcript Evaluation of MineSet 3.0

Evaluation of MineSet 3.0
By
Rajesh Rathinasabapathi
S Peer Mohamed Raja
Guided By
Dr. Li Yang
MineSet
•
•
•
•
Introduction
Problem in existing analytical tools
MineSet Client/Server Architecture
MineSet Enterprise Manager
INTRODUCTION
• Product of Silicon Graphics Inc.
• Supported by
Windows NT 4.0 (Server & Client)
Windows 95 & 98 (Client)
Memory varies with the size of data
64MB RAM
1024  768 Resolution with 65K colors
IRIX 6.4 and above (for server parellelization)
MineSet
• Helps to pinpoint and understand the
complex patterns, relationships, and
anomalies that are implicitly present in your
data.
Problems in Existing Tools
• You must specify directly any relationships
between data elements.
• Example
Query for all the sales by region.
• Presupposes you have an idea that sales
vary by region.
• Relationships may be uncovered that you
did not know existed.
MineSet
MineSet Client/Server
Architecture
• Client and Server can be on a same system
or on a different system
• Server responsibilities
 Accessing Data files
 Data Transformations (Data Mover)
 Running Mining operations (Classification,
Association, etc…)
 Generating visualization files
MineSet Client/Server
Architecture
• Client’s Responsibility
Providing GUI
• Integration with other systems
– Support for ODBC complaint database
for example SQL Server, DB2, Oracle ,Sybase
– Open Architecture allows you to coexist with
other tools. For example SAS
– inegrate with web using hotlinks
– Custom Algorithms
MineSet Enterprise Manager
•
•
•
•
•
MineSet Tool Manager
MineSet 3D Visualizer
MineSet Cluster Visualizer
MineSet Record Visualizer
MineSet Statistics Visualizer
MineSet Tool Manager
• Data Access and Data Transformation.
• Data Destinations.
MineSet Tool Manager
Basic Transformations
•
•
•
•
•
•
•
Adding New Columns
Removing existing Columns
Aggregation
Filtering
Sampling
Binning
Apply Classifier
Adding New Columns
Addition of new columns is possible to the existing
dataset. Columns added can be derived from existing
column by using expressions.
Removing Existing Columns
Removing columns that are not persistent, are
redundant, or contain obvious, uninteresting
predictors.
Aggregation
Grouping records together and finding the sum,
maximum, minimum, or average.
Filtering Visualization
To view strongest rules or the most profitable
customer segments
Sampling
Sampling the data to get a random subset of the data
Binning
Breaking up of continuous range of data into discrete
segments
Apply Classifier
Data Mining Tools
•Association.
•Classification.
•Cluster.
•Regression.
•Column Importance.
Column Importance
Column importance helps one to discover
which are the most important columns in
predicting different values for a label
column one chooses. This unlike
clustering lets one to decide which label
one will use to determine the importance
of columns.
Column Importance
Options when finding
column importance
•One can specify Num of
columns to find.
•Either to use weights or
not.
• Specify the weight.
•No of additional
importance columns.
•Specify purity of the
columns present on right
or left.
Association Rules
Options
•Confidence (1-100)
•Support (1-100)
•Use weights or not.
•Unlimited items per
rule / the no of items
per rule.
•Height –Bars.
•Height – Disks.
•Color – Bars.
•Color – Disks.
•Label – Bars.
Association Rules
Association Rules
Interpreting association rules in Scatter
Visualizer:
•The LHS represents items in this axis.
•The RHS represents items in this axis.
•Bar height corresponds to support.
•Bar colors represent lift.
•By pointing on the object on the bar one can
get the specifications of the bar.
Clustering
•Single K-means Default method
•Iterative K-means
Clustering
•Single K-means Default method
In single K means clustering one
specifies the number of clusters
•Iterative K-means
In iterative K – means one
specifies the minimum, maximum no
of clusters
Clustering
Options present in creating clusters
are :
•The distance measure (Euclidean /
Manhattan).
•The number of iterations.
•The Random seeds.
•The Random seeds.
Clustering
Clustering
 The orders in which attributes are
displayed represent the importance of the
attributes.
 The population shows the default settings.
 Every column represents the different
cluster.
 On clicking each column at the top its
attribute importance is shown.
 Each box represents the max, min, median
and deviation of the values in them.
Classifier
Classification is the task of assigning
a discrete label value to an unlabeled
record
Different modes :
•Classifier and Error
•Classifier Only
•Estimate Error
•Learning Curve
Classifier
Classifier Mode
Classifier mode uses all the available data to build the
classifier. It is useful when you are not concerned with error
estimation.
Classifier and Error
It uses the Holdout Error Estimation. Instead of using
all the data to build the model, you can hold out the part of
the data as a training set to induce the classifier. The
classifier and error mode automatically partitions the data
set into independent training and test subsets.
Holdout ratio/ Random seed.
Classifier
Error Estimate
It uses the Cross Validation Error Estimation. Crossvalidation is used for building the final classifier or for small
datasets. Cross-validation is a method for getting a more
precise estimate of error.
Learning Curve
The Learning Curve shows the error of the classifier
generated by an inducer in proportion to the number of
records used to create the classifier.
Classifier
Classifier
The classification process can be
induced by the following methods
•Decision Tree
•Option Tree
•Evidence
•Decision Table