XLMiner PPT - Daniel Yerelian

Download Report

Transcript XLMiner PPT - Daniel Yerelian

1
Introduction to Data Mining with
XLMiner
Business Intelligence
2
Where Data Mining fits in
3
Getting started with Data Mining
• Preparing for Data Mining
• Types of DM methods
– Descriptive
– Predictive
– Prescriptive
Data Mining Methods in XLMiner
•
Data Preparation and
Exploration
–
–
–
•
Prediction
–
–
–
–
•
Linear Regression
K-nearest Neighbors
Regression Trees
Neural Nets
Time Series Forecasting
–
–
•
Data Preparation
Data Visualization
Dimension Reduction
Smoothing Methods
Regression Based
Classification
–
–
–
–
–
–
Naïve Bayes
K-nearest Neigbors
Classification Trees
Logistic Regression
Neural Nets
Discriminant Analysis
•
•
•
Affinity Analysis
Association Rules
Segmentation
–
Cluster Analysis
4
5
First Part of DM: Data
• Data Selection
– Operational Data
– External sources
• Preprocessing
–
–
–
–
–
Remove duplicates
Remove common errors or noisy data
Domain Consistency
Coding etc.
ETL and Data Mining Tools can help
6
Details on Data Preprocessing
•
Enrichment
– Adding additional data
– Sometimes only to a subset of
the data
– Need to organize the data to
enable adding the additional
values
•
Coding
– Modify the data to make it work
more effectively
– Without losing the meaning
– Take out some detail while
retaining relative value
– Categorical data to dummy
variables
•
Categorize
– Address to region
– Birth date to age
•
Scale appropriately
– Divide by 1000 for $ values
where appropriate
•
Binary Attributes
– Yes-No to 0-1
•
Time Series
– Convert date to month numbers
starting from a fixed point (1900)
•
Dimension Reduction
Dealing with Data: Excel and
XLMiner
•
Transformation
– Flat file (CSV, Delimited, etc)
– Various Functions in Excel (Text, Date, etc.)
•
Missing Data identification
–
Excel functions
•
•
•
Absurd Data
–
–
–
–
•
Date of Birth as 01-01-01
Phone number as 123-456-7890, etc.
Needs review and sometime domain knowledge
Pivot Tables and conditional formatting
Outliers Identification
–
–
–
•
Count and Countif
Countblank
XLMiner Boxplots
Excel Descriptive Statistics
Histograms
Normalization
–
–
Needed when you have mixed data and the actual scale does not matter (you do
not have to have the income reported to the exact dollars, it is as good when
reported in units of $1000)
Essentially creation of z-score [ (X-mean)/sigma]
7
8
Changing the Categorical Variables
• Create Dummy Variables
– Depending on the categories, you may have to create a lot of
dummy variables
– Rule:
• Total number of dummy variables should be equal to the nominal
categories.
• Usually the last dummy is not needed as it is determined by the
absence of other categorical values
• Example:
–
–
–
–
–
–
Color = Red, Blue and Green.
We will need three dummies, XR, XB, and XG
For Red the values will be 1, 0, 0
For Blue the values will be 0, 1, 0
And for Green the values will be 0, 0, 1
As you can see, we could have done without the dummy for the green as the
value of 0, 0 for the first two variables will automatically indicate green
(assuming only three colors), so XR and XB would have been enough.
• For ordinal categorical variables, use one dummy and give it
various values based on the order of the categories.
– Example: Grades can be F, D, C, B and A
– Possible values of XGrade will be 0, 1, 2, 3, and 4.
9
Examples of Creating Dummy
• Experience values: 1, 2, 3 or 4
– Dummies are:
• 000 = 1, 100 = 2, 010 = 3, 001 = 4
• Education Values are: Bachelors, Masters and PhD
– Dummies are:
10
XLMiner Data Preparation
• Dummy Variables
– XLMiner-> Data Utilities->Transform Categorical Data
– Try it with Universal Bank Data with Education and Family
• Outliers
–
–
–
–
Histogram (needs Binning)
Box Plot (under charting Data)
Can also be done in Excel
Rule is to find the outright outliers and then cutting off the
values which are asymmetric on either side
– Box Plot and Histogram tell you the skewness of the distribution
11
Data Mining Data Groups
• Given a collection of records (training set )
– A model is created for a particular prediction or classification
• Goal: previously unseen records should be predicted or
classified as accurately as possible.
– A validation/test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and
test sets, with training set used to build the model and test set
used to validate it.
• Finally, a previously unseen data group is used for
final model performance (called Test set in XLMiner)
XLMiner: Opening and Partitioning
Datasets
•
•
12
The demo version of the program only allows 600 rows. We will use
the educational version to go beyond that.
Partitioning of the data
– Training
• Used for building the model
– Validation
• Used for validating the quality of the model
– Test
• Used for testing the model, specially for the algorithms that may use both
training and validation data recursively to build the model
– Oversampling
• Employed for spare record sets.
13
XL Miner Outputs
• Module dependent
– Typically has a summary report
– Has the option of full reports
– All Excel worksheets, so can be used for further
calculations
– Includes Lift Charts and Decile Lift Charts