Transcript Slides Ch 2

Chapter 2
Overview of the Data Mining
• Data Mining
– Predictive analysis
• Tasks of Classification & Prediction
• Core of Business Intelligence
• Data Base Methods
– Do not involve statistical modeling
Core Ideas in Data Mining
• Analytical Methods Used in Predictive Analytics
– Classification
• Used with categorical response variables
• E.g. Will purchase be made / not made?
– Prediction
• Predict (estimate) value of continuous response variable
• Prediction used with categorical as well
– Association Rules
• Affinity analysis – “what goes with what”
• Seeks correlations among data
Core Ideas in Data Mining
• Data Reduction
– Reduce variables
– Group together similar variables
• Data Exploration
– View data as evidence
– Get “a feel” for the data
• Data Visualization
– Graphical representation of data
– Locate tends, correlations, etc.
Supervised Learning
• “Supervised learning" algorithms are those used in classification
and prediction.
– Data is available in which the value of the outcome of interest is known.
• “Training data" are the data from which the classification or
prediction algorithm “learns," or is “trained," about the
relationship between predictor variables and the outcome
• This process results in a “model”
– Classification Model
– Predictive Model
Supervised Learning
• Model is then run with another sample of data
– “validation data"
– the outcome is known but we wish to see how well the model performs
– If many different models are being tried out, a third sample of known
outcomes -“test data” is used with the final, selected model to predict
how well it will do.
• The model can then be used to classify or predict the outcome
of interest in new cases where the outcome is unknown.
Supervised Learning
• Linear regression analysis is an example of
supervised Learning
– The Y variable is the (known) outcome variable
– The X variable is some predictor variable.
– A regression line is drawn to minimize the sum of squared deviations
between the actual Y values and the values predicted by this line.
– The regression line can now be used to predict Y values for new values
of X for which we do not know the Y value.
Unsupervised Learning
• No outcome variable to predict or classify
• No “learning” from cases
• Unsupervised leaning methods
– Association Rules
– Data Reduction Methods
– Clustering Techniques
The Steps in Data Mining
• 1. Develop an understanding of the purpose of the data
mining project
– It is a one-shot effort to answer a question or questions or
– Application (if it is an ongoing procedure).
• 2. Obtain the dataset to be used in the analysis.
– Random sampling from a large database to capture records to be used
in an analysis
– Pulling together data from different databases.
• Internal (e.g. Past purchases made by customers)
• External (credit ratings).
– Usually the analysis to be done requires only thousands or tens of
thousands of records.
The Steps in Data Mining
• 3. Explore, clean, and preprocess the data
– Verifying that the data are in reasonable condition.
– How missing data should be handled?
– Are the values in a reasonable range, given what you would expect for
each variable?
– Are there obvious “outliers?"
– Data are reviewed graphically –
• For example, a matrix of scatter plots showing the relationship of each
variable with each other variable.
– Ensure consistency in the definitions of fields, units of measurement,
time periods, etc.
The Steps in Data Mining
• 4. Reduce the data
– If supervised training is involved separate them into training,
validation and test datasets.
– Eliminating unneeded variables,
• Transforming variables
– Turning “money spent" into “spent > $100" vs. “Spent · $100"),
• Creating new variables
– A variable that records whether at least one of several products was purchased
– Make sure you know what each variable means, and whether it is
sensible to include it in the model.
• 5. Determine the data mining task
– Classification, prediction, clustering, etc.
• 6. Choose the data mining techniques to be used
– Regression, neural nets, hierarchical clustering, etc.
The Steps in Data Mining
• 7. Use algorithms to perform the task.
– Iterative process - trying multiple variants, and often using multiple variants of
the same algorithm (choosing different variables or settings within the
– When appropriate, feedback from the algorithm's performance on validation
data is used to refine the settings.
• 8. Interpret the results of the algorithms.
– Choose the best algorithm to deploy,
– Use final choice on the test data to get an idea how well it will perform.
• 9. Deploy the model.
– Integrate the model into operational systems
– Run it on real records to produce decisions or actions.
– For example, the model might be applied to a purchased list of possible
customers, and the action might be “include in the mailing if the predicted
amount of purchase is > $10."
Preliminary Steps
• Organization of datasets
– Records in rows
– Variables in columns
• In supervised learning one of these will be the outcome variable
• Labels the first or last column
• Sampling from a database
– Use a samples to create, validate, & test model
• Oversampling rare events
– If response variable value is seldom found in data then
sample size increase
– Adjust algorithm as necessary
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Types of variables
– Continuous – assumes a any real numerical value
(generally within a specified range)
– Categorical – assumes one of a limited number of values
Text (e.g. Payments e {current, not current, bankrupt}
Numerical (e.g. Age e {0 … 120} )
Nominal (payments)
Ordinal (age)
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Handling categorical variables
– If categorical is ordered then it can be used as continuous variable
(e..G. Age, level of credit, etc.)
– Use of “dummy” variables when range of values not large
• e.g. Variable occupation e {student, unemployed, employed, retired}
• Create binary (yes/no) dummy variables
Student – yes/no
Unemployed – yes/no
Employed – yes/no
Retired – yes/no
• Variable selection
– The more predictor variables the more records need to build the
– Reduce number of variables whenever appropriate
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Overfitting
– Building a model - describe relationships among variables in order to
predict future outcome (dependent) values on the basis of future
predictor (independent) values.
– Avoid “explaining“ variation in the data that was nothing more than
chance variation. Avoid mislabeling “noise” in the data as if it were a
– Caution - if the dataset is not much larger than the number of
predictor variables, then it is very likely that a spurious relationship
like this will creep into the model
Preliminary Steps
(Pre-processing and Cleaning the Data)
• How many variables & how much data
• A good rule of thumb is to have ten records for every predictor
• For classification procedures
– At least 6xmxp records,
– Where m = number of outcome classes, and p = number of variables
• Compactness or parsimony is a desirable feature in a model.
• A matrix of x-y plots can be useful in variable selection.
• Can see at a glance x-y plots for all variable combinations.
– A straight line would be an indication that one variable is exactly correlated
with another.
– We would want to include only one of them in our model.
• Weed out irrelevant and redundant variables from our model
• Consult domain expert whenever possible
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Outliers
– Values that lie far away from the bulk of the data are called outliers
– no statistical rule can tell us whether such an outlier is the result of an
– these are judgments best made by someone with “domain"
– if the number of records with outliers is very small, they might be
treated as missing data.
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Missing values
– If the number of records with missing values is small, those records
might be omitted
– The more variables, the more records to dropped
• Solution - use average value computed from records with valid data for
variable with missing data
• Reduces variability in data set
– Human judgment can be used to determine best way to handle
missing data
Preliminary Steps
(Pre-processing and Cleaning the Data)
• Normalizing (standardizing) the data
– To normalize the data, we subtract the mean from each value, and divide
by the standard deviation of the resulting deviations from the mean
• Expressing each value as “number of standard deviations away from the
mean“ – the z-score
• Needed if variables are in different units e.G. Hours, thousands of dollars, etc.
– Clustering algorithms measure variables values in distance from each
other – need a standard value for distance.
– Data mining software, including XLMiner, typically has an option that
normalizes the data in those algorithms where it may be required
Preliminary Steps
• Use and creation of partition
– Training partition
• The largest partition
• Contains the data used to build the various models
• Same training partition is generally used to develop multiple models.
– Validation partition
• Used to assess the performance of each model,
• Used to compare models and pick the best one.
• In classification and regression trees algorithms the validation partition
may be used automatically to tune and improve the model.
– Test partition
• Sometimes called the “holdout" or “evaluation" partition is used to assess
the performance of a chosen model with new data.
The Three Data Partitions and Their
Role in the Data Mining Process
Simple Regression Example
Simple Regression Model
• Make prediction about the starting salary of a current college
• Data set of starting salaries of recent college graduates
Data Set
Compute Average Salary
How certain are of this prediction?
There is variability in the data.
Simple Regression Model
• Use total variation as an index of uncertainty about our prediction
Compute Total Variation
• The smaller the amount of total variation the more accurate
(certain) will be our prediction.
Simple Regression Model
• How “explain” the variability - Perhaps it depends on
the student’s GPA
Salary GPA
Simple Regression Model
• Find a linear relationship between GPA and starting salary
• As GPA increases/decreases starting salary increases/decreases
Simple Regression Model
• Least Squares Method to find regression model
– Choose a and b in regression model (equation) so that it minimizes the sum
of the squared deviations – actual Y value minus predicted Y value (Y-hat)
Simple Regression Model
• How good is the model?
a= 4,779 & b = 5,370
A computer program computed these values
u-hat is a “residual” value
The sum of all u-hats is zero
The sum of all u-hats squared is the total variance not explained by the model
“unexplained variance” is 7,425,926
Simple Regression Model
Total Variation = 23,000,000
Simple Regression Model
Total Unexplained Variation = 7,425,726
Simple Regression Model
• Relative Goodness of Fit
– Summarize the improvement in prediction using regression model
• Computer R2 – coefficient of determination
Regression Model (equation) a better predictor than guessing the average salary
The GPA is a more accurate predictor of starting salary than guessing the average
R2 is the “performance measure“ for the model.
Predicted Starting Salary = 4,779 + 5,370 * GPA
Building a Model - An Example with
Linear Regression
• Problem 2.11 Page 33