The Data Mining Course
Download
Report
Transcript The Data Mining Course
Dr. Russell Anderson
Dr. Musa Jafar
West Texas A&M University
The process of discovering useful information
in large data repositories. (Tan, P-N., Steinbach, M., and
Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)
Discovered information should be:
Valid
Previously unknown
Actionable
Seven objectives of Lenox and Cuff in 2002 (based
on ACM 2001 Ironman Report)
Prepare and warehouse data
Process data based on set of DM algorithms
Analyze results
Make predictions
Select proper algorithm
Make application
Motivated to continue graduate studies in DM
We have added
Get to know data using statistical analysis tools
Use visualization tools for analysis and review
1.
2.
3.
4.
5.
Get to know the data.
Select an appropriate data mining algorithm
based on the data and the mining objective.
Construct a model using the selected
algorithm.
Analyze the results.
Make application.
How is it structured?
Single table/flat-file.
Multi-table – relationships
Number of observations
Number of dimensions (attributes)
Compute summary statistics using tool such as
MS-Excel
Visually evaluate characteristics of the data
Tools developed:
Correlation Matrix
Scatter Plot
Parallel Coordinate Plot
Distributions of data
Data ranges of numeric attributes
Cardinality of discrete attributes
Shape of distribution
Skewed
Multi-model
Location of outliers
Identification possible relationships between
attributes
Identification of subpopulations within the
data
Microsoft Business Intelligence Tools
Association Analysis – aka market basket analysis
Classification
Decision Trees
Artificial Neural Network
Bayesian Analysis
Regression
Cluster Analysis
Custom Tools with Embedded Visual Presentation
Artificial neural network for both classification and
regression
Self-Organizing Map (SOM) for cluster analysis
Purpose of each methodology
Steps of underlying algorithm
Data types supported
Issues in construction and application
Parameter settings
Results interpretation
Does the model fit the training data too well?
Need to separate available into training and
validation subsets.
Visual view of training progress valuable.
Mushroom edibility classifiers
Classifier A
Predicted
Actual
Edible
Poisonous
Edible
38%
8%
Classifier B
Predicted
Poisonous
0%
54%
Actual
Edible
Poisonous
Edible
44%
2%
Poisonous
1%
53%
Black Box - models built using sophisticated
methodologies (ANN’s for example) perform
very well, but gaining an understanding of the
model itself is difficult.
Contribution of individual input attributes
Nature of contribution (shape of curve)
Interaction between input attributes
For a detailed presentation of the mechanics of
the software deployed, attend our workshop
tomorrow morning.
Saturday: 8-10 AM
Kachina A
Microsoft SQL Server Business Intelligence
Studio
Visualization Tools