Data Mining in Forecasting

Download Report

Transcript Data Mining in Forecasting

Chapter 9
DATA MINING
PAULA JENSEN
SDSM&T
ENGM 745
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
DATA MINING
 DATA-DATA
 Extracting
of useful information
from large databases
 Tools of Data Mining
 Looking at where to find the data
TOOLS OF DATA MINING
Prediction
 Classification
 Clustering
 Association

PREDICTION

Predict the value of a numeric variable
Customer’s expenditure
 Will they purchase
 What are their interests
 Do their interests predict a purchase

CLASSIFICATION
Classes of objects or actions
 Reliability of customer
 Income
 Location

CLUSTERING
Analysis tools analyze objects viewed as a class
 Where is the cut off of income or size
 How do I group the information

ASSOCIATION
Patterns based on likes
 Netflix
 Facebook
 Google

CLASSIFICATION
 k-nearest
neighbor
 Naïve Bayes
 Classification/regression trees
 Logistic Regression
DATA MINING TERMINOLOGY
9-10
9-11
9-12
K-NEAREST NEIGHBOR
Use Subset of total data called training data
 Select closest neighbor with Euclidian distance
shown in previous slide other metrics available to
measure to define neighbors
 Validation data is a separate set of data
 Test statistic important on the validation data
versus the training data
 60% of data training data and 40% validation
data acceptable mix

9-14
9-15
K-NEAREST NEIGHBOR ANALYSIS
Multidimentional
 Program is going to compute a distance
associated to each attribute
 Continuous Variables are measured in different
scales
 Categorical attributes will use a weighted
mechanism
 Example is will they respond to marketing to
take a loan

K=3 means used 3 neighbors to classify all records
9-17
Type 1 would take a loan – Type 0 would not take a loan
9-18
9-19
TERMS
Lift – measures the change in concentration of a
particular class when the model is used to select
a group from the general population. Significant
lift on the example.
 Decile Wise chart- Pick the top 10% of our
records classified by our model our selection
would include approximately 7 times as many
correct classifications.

Classification Trees
9-21
9-22
9-23
CLASSIFICATION TREES

Advantages
Decision rules are easy
 Easy to understand


Disadvantages
Overfit data
 Correlated attributes will cause multicollinearity

9-25
9-26
9-27
9-28
NAÏVE BAYSES
Statistical Classification
 Bayes Therom: predicts the probability of a prior
event given a certain subsequent event has taken
place
 Called Naïve because each attribute is assumed
as independent

9-30
9-31
9-32
9-33
BAYESIAN THEOREM

P (A|B) = (P(B|A))* P(A)
P(B)
P(A) is the prior probability
P (A|B) is conditional probability of A, given B
P (B|A) is the conditional probability of B given A
P (B) is the prior probability of B
9-35
APPLYING BAYES’ THEROM
REGRESSION
Logistic regression or Logit analysis
 Difference between logics regression and
ordinary regression is that the dependent
variable in logistic regression is categorical not
continuous
 Dependent Variable is Dichtomous- either yes or
no
 Dependent variable is either will be limited to
values between 0 and 1

9-38
9-39
9-40
9-41
9-42
9-43
9-44
9-45
9-46
WHERE DO I FIND THE DATA???

Current Customer Activity

Collect in your database
Family names
 Sales software
 Forms from your website Wufoo.com
 Track inquiries

Current Facebook Activity
 BUY IT!
 Mailing lists
 How to use it???
