The Data Mining Course

Download Report

Transcript The Data Mining Course

Dr. Russell Anderson
Dr. Musa Jafar
West Texas A&M University

The process of discovering useful information
in large data repositories. (Tan, P-N., Steinbach, M., and
Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)

Discovered information should be:



Valid
Previously unknown
Actionable

Seven objectives of Lenox and Cuff in 2002 (based
on ACM 2001 Ironman Report)








Prepare and warehouse data
Process data based on set of DM algorithms
Analyze results
Make predictions
Select proper algorithm
Make application
Motivated to continue graduate studies in DM
We have added
Get to know data using statistical analysis tools
 Use visualization tools for analysis and review

1.
2.
3.
4.
5.
Get to know the data.
Select an appropriate data mining algorithm
based on the data and the mining objective.
Construct a model using the selected
algorithm.
Analyze the results.
Make application.

How is it structured?






Single table/flat-file.
Multi-table – relationships
Number of observations
Number of dimensions (attributes)
Compute summary statistics using tool such as
MS-Excel
Visually evaluate characteristics of the data

Tools developed:



Correlation Matrix
Scatter Plot
Parallel Coordinate Plot

Distributions of data



Data ranges of numeric attributes
Cardinality of discrete attributes
Shape of distribution
 Skewed
 Multi-model



Location of outliers
Identification possible relationships between
attributes
Identification of subpopulations within the
data

Microsoft Business Intelligence Tools
Association Analysis – aka market basket analysis
 Classification

 Decision Trees
 Artificial Neural Network
 Bayesian Analysis
Regression
 Cluster Analysis


Custom Tools with Embedded Visual Presentation
Artificial neural network for both classification and
regression
 Self-Organizing Map (SOM) for cluster analysis





Purpose of each methodology
Steps of underlying algorithm
Data types supported
Issues in construction and application


Parameter settings
Results interpretation



Does the model fit the training data too well?
Need to separate available into training and
validation subsets.
Visual view of training progress valuable.

Mushroom edibility classifiers
Classifier A
Predicted
Actual
Edible
Poisonous
Edible
38%
8%
Classifier B
Predicted
Poisonous
0%
54%
Actual
Edible
Poisonous
Edible
44%
2%
Poisonous
1%
53%




Black Box - models built using sophisticated
methodologies (ANN’s for example) perform
very well, but gaining an understanding of the
model itself is difficult.
Contribution of individual input attributes
Nature of contribution (shape of curve)
Interaction between input attributes

For a detailed presentation of the mechanics of
the software deployed, attend our workshop
tomorrow morning.
Saturday: 8-10 AM
 Kachina A



Microsoft SQL Server Business Intelligence
Studio
Visualization Tools