Data Mining in Market Research

Download Report

Transcript Data Mining in Market Research

Data Mining in Market Research
• What is data mining?
– Methods for finding interesting structure in large databases
• E.g. patterns, prediction rules, unusual cases
– Focus on efficient, scalable algorithms
• Contrasts with emphasis on correct inference in statistics
– Related to data warehousing, machine learning
• Why is data mining important?
– Well marketed; now a large industry; pays well
– Handles large databases directly
– Can make data analysis more accessible to end users
• Semi-automation of analysis
• Results can be easier to interpret than e.g. regression models
• Strong focus on decisions and their implementation
CRISP-DM Process Model
Data Mining Software
• Many providers of data mining software
– SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner,
MS SQL Server, Polyanalyst, KnowledgeSTUDIO, …
– See http://www.kdnuggets.com/software/suites.html for a list
– Good algorithms important, but also need good facilities for
handling data and meta-data
• We’ll use:
– WEKA (Waikato Environment for Knowledge Analysis)
• Free (GPLed) Java package with GUI
• Online at www.cs.waikato.ac.nz/ml/weka
• Witten and Frank, 2000. Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations.
– R packages
• E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree,
mlbench, randomForest, subselect
Data Mining Terms
• Different names for familiar statistical concepts,
from database and AI communities
– Observation = case, record, instance
– Variable = field, attribute
– Analysis of dependence vs interdependence =
Supervised vs unsupervised learning
– Relationship = association, concept
– Dependent variable = response, output
– Independent variable = predictor, input
Common Data Mining Techniques
• Predictive modeling
– Classification
• Derive classification rules
• Decision trees
– Numeric prediction
• Regression trees, model trees
• Association rules
• Meta-learning methods
– Cross-validation, bagging, boosting
• Other data mining methods include:
– artificial neural networks, genetic algorithms, density estimation,
clustering, abstraction, discretisation, visualisation, detecting
changes in data or models
Classification
• Methods for predicting a discrete response
– One kind of supervised learning
– Note: in biological and other sciences, classification
has long had a different meaning, referring to cluster
analysis
• Applications include:
– Identifying good prospects for specific marketing or
sales efforts
• Cross-selling, up-selling – when to offer products
• Customers likely to be especially profitable
• Customers likely to defect
– Identifying poor credit risks
– Diagnosing customer problems
Weather/Game-Playing Data
• Small dataset
– 14 instances
– 5 attributes
•
•
•
•
•
Outlook - nominal
Temperature - numeric
Humidity - numeric
Wind - nominal
Play
– Whether or not a certain game would be played
– This is what we want to understand and predict
ARFF file for the weather data.
% ARFF file for the we ather d ata w ith s ome numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook { sunny, overcast, rainy }
temperature numeric
humidity numeric
windy { tr ue, false }
play? { ye s, no }
@data
%
% 14 instances
%
sunny, 8 5, 85, false, n o
sunny, 8 0, 90, true, no
overcast, 83, 86 , false, yes
rainy, 7 0, 96, false, y es
rainy, 6 8, 80, false, y es
rainy, 6 5, 70, true, no
overcast, 64, 65 , true, ye s
sunny, 7 2, 95, false, n o
sunny, 6 9, 70, false, y es
rainy, 7 5, 80, false, y es
sunny, 7 5, 70, true, yes
overcast, 72, 90 , true, ye s
overcast, 81, 75 , false, yes
rainy, 7 1, 91, true, no
German Credit Risk Dataset
• 1000 instances (people), 21 attributes
– “class” attribute describes people as good or bad
credit risks
– Other attributes include financial information and
demographics
• E.g. checking_status, duration, credit_history, purpose,
credit_amount, savings_status, employment, Age, housing,
job, num_dependents, own_telephone, foreign_worker
• Want to predict credit risk
• Data available at UCI machine learning data
repository
• http://www.ics.uci.edu/~mlearn/MLRepository.html
– and on 747 web page
• http://www.stat.auckland.ac.nz/~reilly/credit-g.arff
Classification Algorithms
• Many methods available in WEKA
– 0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
Instance-based learner (IB1, IBk), C4.5 (J48), PART,
Support vector machine (SMO)
• Usually train on part of the data, test on the rest
• Simple method – Zero-rule, or 0R
– Predict the most common category
• Class ZeroR in WEKA
– Too simple for practical use, but a useful baseline for
evaluating performance of more complex methods
1-Rule (1R) Algorithm
• Based on single predictor
– Predict mode within each value of that predictor
• Look at error rate for each predictor on training
dataset, and choose best predictor
• Called OneR in WEKA
• Must group numerical predictor values for this
method
– Common method is to split at each change in the
response
– Collapse buckets until each contains at least 6
instances
1R Algorithm (continued)
• Biased towards predictors with more categories
– These can result in over-fitting to the training data
• But found to perform surprisingly well
– Study on 16 widely used datasets
• Holte (1993), Machine Learning 11, 63-91
– Often error rate only a few percentages points higher
than more sophisticated methods (e.g. decision trees)
– Produced rules that were much simpler and more
easily understood
Naïve Bayes Method
• Calculates probabilities of each response value,
assuming independence of attribute effects
PX A, B, C  
PA X PB X PC X P X 
P( A, B, C )
• Response value with highest probability is
predicted
• Numeric attributes are assumed to follow a
normal distribution within each response value
– Contribution to probability calculated from normal
density function
– Instead can use kernel density estimate, or simply
discretise the numerical attributes
Naïve Bayes Calculations
• Observed counts and probabilities above
– Temperature and humidity have been discretised
• Consider new day
–
–
–
–
Outlook=sunny, temperature=cool, humidity=high, windy=true
Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0053
Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206
Probability(play=no) = 0.0206/(0.0053+0.0206) = 79.5%
• “no” four times more likely than “yes”
Naïve Bayes Method
• If any of the component probabilities are zero,
the whole probability is zero
– Effectively a veto on that response value
– Add one to each cell’s count to get around this
problem
• Corresponds to weak positive prior information
• Naïve Bayes effectively assumes that attributes
are equally important
– Several highly correlated attributes could drown out
an important variable that would add new information
• However this method often works well in practice
Decision Trees
• Classification rules can be expressed in a tree
structure
– Move from the top of the tree, down through various
nodes, to the leaves
– At each node, a decision is made using a simple test
based on attribute values
– The leaf you reach holds the appropriate predicted value
• Decision trees are appealing and easily used
– However they can be verbose
– Depending on the tests being used, they may obscure
rather than reveal the true pattern
• More info online at http://recursive-partitioning.com/
If x=1 and y=1
then class = a
If z=1 and w=1
then class = a
Otherwise class = b
Decision tree with a replicated subtree
Problems with Univariate Splits
Constructing Decision Trees
• Develop tree recursively
– Start with all data in one root node
– Need to choose attribute that defines first split
• For now, we assume univariate splits are used
– For accurate predictions, want leaf nodes to be as
pure as possible
– Choose the attribute that maximises the average
purity of the daughter nodes
• The measure of purity used is the entropy of the node
• This is the amount of information needed to specify the value
of an instance in that node, measured in bits
n
entropy  p1 , p2 ,  , pn     pi log 2 pi
i 1
Tree stumps for the weather data
(a)
(b)
(c)
(d)
Weather Example
• First node from outlook split is for “sunny”, with
entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.971
• Average entropy of nodes from outlook split is
5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693
• Entropy of root node is 0.940 bits
• Gain of 0.247 bits
• Other splits yield:
– Gain(temperature)=0.029 bits
– Gain(humidity)=0.152 bits
– Gain(windy)=0.048 bits
• So “outlook” is the best attribute to split on
Expanded tree stumps for weather data
(a)
(b)
(c)
Decision tree for the weather data
Decision Tree Algorithms
• The algorithm described in the preceding slides
is known as ID3
– Due to Quinlan (1986)
• Tends to choose attributes with many values
– Using information gain ratio helps solve this problem
• Several more improvements have been made to
handle numeric attributes (via univariate splits),
missing values and noisy data (via pruning)
– Resulting algorithm known as C4.5
• Described by Quinlan (1993)
– Widely used (as is the commercial version C5.0)
– WEKA has a version called J4.8
Classification Trees
• Described (along with regression trees) in:
– L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984.
Classification and Regression Trees.
• More sophisticated method than ID3
– However Quinlan’s (1993) C4.5 method caught up with CART in
most areas
• CART also incorporates methods for pruning, missing
values and numeric attributes
– Multivariate splits are possible, as well as univariate
• Split on linear combination Σcjxj > d
– CART typically uses Gini measure of node purity to determine
best splits
• This is of the form Σp(1-p)
– But information/entropy measure also available
Regression Trees
• Trees can also be used to predict numeric
attributes
– Predict using average value of the response in the
appropriate node
• Implemented in CART and C4.5 frameworks
– Can use a model at each node instead
• Implemented in Weka’s M5’ algorithm
• Harder to interpret than regression trees
• Classification and regression trees are
implemented in R’s rpart package
– See Ch 10 in Venables and Ripley, MASS 3rd Ed.
Problems with Trees
• Can be unnecessarily verbose
• Structure often unstable
– “Greedy” hierarchical algorithm
• Small variations can change chosen splits at high level nodes, which then
changes subtree below
• Conclusions about attribute importance can be unreliable
• Direct methods tend to overfit training dataset
– This problem can be reduced by pruning the tree
• Another approach that often works well is to fit the tree, remove all
training cases that are not correctly predicted, and refit the tree on
the reduced dataset
– Typically gives a smaller tree
– This usually works almost as well on the training data
– But generalises better, e.g. works better on test data
• Bagging the tree algorithm also gives more stable results
– Will discuss bagging later
Classification Tree Example
• Use Weka’s J4.8 algorithm on German
credit data (with default options)
– 1000 instances, 21 attributes
• Produces a pruned tree with 140 nodes,
103 leaves
•
=== Run information ===
•
•
•
•
Scheme:
weka.classifiers.j48.J48 -C 0.25 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
•
Number of Leaves :
103
•
Size of the tree :
140
•
•
=== Stratified cross-validation ===
=== Summary ===
•
•
•
•
•
•
•
•
Correctly Classified Instances
739
73.9
Incorrectly Classified Instances
261
26.1
Kappa statistic
0.3153
Mean absolute error
0.3241
Root mean squared error
0.4604
Relative absolute error
77.134 %
Root relative squared error
100.4589 %
Total Number of Instances
1000
•
=== Detailed Accuracy By Class ===
•
•
•
TP Rate FP Rate Precision Recall F-Measure Class
0.883 0.597
0.775 0.883 0.826 good
0.403 0.117
0.596 0.403 0.481 bad
•
=== Confusion Matrix ===
•
•
•
a b <-- classified as
618 82 | a = good
179 121 | b = bad
%
%
Cross-Validation
• Due to over-fitting, cannot estimate prediction
error directly on the training dataset
• Cross-validation is a simple and widely used
method for estimating prediction error
• Simple approach
– Set aside a test dataset
– Train learner on the remainder (the training dataset)
– Estimate prediction error by using the resulting
prediction model on the test dataset
• This is only feasible where there is enough data
to set aside a test dataset and still have enough
to reliably train the learning algorithm
k-fold Cross-Validation
• For smaller datasets, use k-fold cross-validation
– Split dataset into k roughly equal parts
Tr
Tr
Test
Tr
Tr
Tr
Tr
Tr
Tr
– For each part, train on the other k-1 parts and use this part as
the test dataset
– Do this for each of the k parts, and average the resulting
prediction errors
• This method measures the prediction error when training
the learner on a fraction (k-1)/k of the data
• If k is small, this will overestimate the prediction error
– k=10 is usually enough
Regression Tree Example
• data(car.test.frame)
• z.auto <- rpart(Mileage ~ Weight,
car.test.frame)
• post(z.auto,FILE=“”)
• summary(z.auto)
24.58
|
n=60
Weight>=2568
Weight< 2568
22.47
n=45
30.93
n=15
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Weight>=2748
Weight< 2748
23.8
n=15
25.63
n=8
•
•
•
Call:
rpart(formula = Mileage ~ Weight, data = car.test.frame)
n= 60
•
•
•
•
•
1
2
3
4
•
•
•
•
•
Node number 1: 60 observations, complexity param=0.5953491
mean=24.58333, MSE=22.57639
left son=2 (45 obs) right son=3 (15 obs)
Primary splits:
Weight < 2567.5 to the right, improve=0.5953491, (0 missing)
•
•
•
•
•
•
Node number 2: 45 observations, complexity param=0.1345282
mean=22.46667, MSE=8.026667
left son=4 (22 obs) right son=5 (23 obs)
Primary splits:
Weight < 3087.5 to the right, improve=0.5045118, (0 missing)
…(continued on next page)…
CP nsplit rel error
xerror
xstd
0.59534912
0 1.0000000 1.0322233 0.17981796
0.13452819
1 0.4046509 0.6081645 0.11371656
0.01282843
2 0.2701227 0.4557341 0.09178782
0.01000000
3 0.2572943 0.4659556 0.09134201
•
•
Node number 3: 15 observations
mean=30.93333, MSE=12.46222
•
•
Node number 4: 22 observations
mean=20.40909, MSE=2.78719
•
•
•
•
•
Node number 5: 23 observations, complexity param=0.01282843
mean=24.43478, MSE=5.115312
left son=10 (15 obs) right son=11 (8 obs)
Primary splits:
Weight < 2747.5 to the right, improve=0.1476996, (0 missing)
•
•
Node number 10: 15 observations
mean=23.8, MSE=4.026667
•
•
Node number 11: 8 observations
mean=25.625, MSE=4.984375
Regression Tree Example
(continued)
• plotcp(z.auto)
• z2.auto <- prune(z.auto,cp=0.1)
• post(z2.auto, file="", cex=1)
Complexity Parameter Plot
size of tree
2
3
4
Inf
0.28
0.042
0.011
0.8
0.6
0.4
X-val Relative Error
1.0
1.2
1
cp
24.58
|
n=60
Weight>=2568
Weight< 2568
22.47
n=45
30.93
n=15
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Weight>=2748
Weight< 2748
23.8
n=15
25.63
n=8
Pruned Regression Tree
Endpoint = Mileage
24.58
|
n=60
Weight>=2568
Weight< 2568
30.93
n=15
22.47
n=45
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Classification Methods
• Project the attribute space into decision regions
– Decision trees: piecewise constant approximation
– Logistic regression: linear log-odds approximation
– Discriminant analysis and neural nets: linear & non-linear separators
• Density estimation coupled with a decision rule
– E.g. Naïve Bayes
• Define a metric space and decide based on proximity
– One type of instance-based learning
– K-nearest neighbour methods
• IBk algorithm in Weka
– Would like to drop noisy and unnecessary points
• Simple algorithm based on success rate confidence intervals available in Weka
– Compares naïve prediction with predictions using that instance
– Must choose suitable acceptance and rejection confidence levels
• Many of these approaches can produce probability distributions as well
as predictions
– Depending on the application, this information may be useful
• Such as when results reported to expert (e.g. loan officer) as input to their
decision
Numeric Prediction Methods
• Linear regression
• Splines, including smoothing splines and
multivariate adaptive regression splines (MARS)
• Generalised additive models (GAM)
• Locally weighted regression (lowess, loess)
• Regression and Model Trees
– CART, C4.5, M5’
• Artificial neural networks (ANNs)
Artificial Neural Networks (ANNs)
• An ANN is a network of many simple processors (or units), that are
connected by communication channels that carry numeric data
• ANNs are very flexible, encompassing nonlinear regression models,
discriminant models, and data reduction models
– They do require some expertise to set up
– An appropriate architecture needs to be selected and tuned for each
application
• They can be useful tools for learning from examples to find patterns
in data and predict outputs
– However on their own, they tend to overfit the training data
– Meta-learning tools are needed to choose the best fit
• Various network architectures in common use
– Multilayer perceptron (MLR)
– Radial basis functions (RBF)
– Self-organising maps (SOM)
• ANNs have been applied to data editing and imputation, but not
widely
Meta-Learning Methods - Bagging
• General methods for improving the performance of most learning
algorithms
• Bootstrap aggregation, bagging for short
– Select B bootstrap samples from the data
• Selected with replacement, same # of instances
– Can use parametric or non-parametric bootstrap
– Fit the model/learner on each bootstrap sample
– The bagged estimate is the average prediction from all these B models
• E.g. for a tree learner, the bagged estimate is the average prediction
from the resulting B trees
• Note that this is not a tree
– In general, bagging a model or learner does not produce a model or
learner of the same form
• Bagging reduces the variance of unstable procedures like
regression trees, and can greatly improve prediction accuracy
– However it does not always work for poor 0-1 predictors
Meta-Learning Methods - Boosting
• Boosting is a powerful technique for improving
accuracy
• The “AdaBoost.M1” method (for classifiers):
– Give each instance an initial weight of 1/n
– For m=1 to M:
• Fit model using the current weights, & store resulting model m
• If prediction error rate “err” is zero or >= 0.5, terminate loop.
• Otherwise calculate αm=log((1-err)/err)
– This is the log odds of success
• Then adjust weights for incorrectly classified cases by
multiplying them by exp(αm), and repeat
– Predict using a weighted majority vote: ΣαmGm(x),
where Gm(x) is the prediction from model m
Meta-Learning Methods - Boosting
• For example, for the German credit dataset:
– using 100 iterations of AdaBoost.M1 with the
DecisionStump algorithm,
– 10-fold cross-validation gives an error rate of
24.9% (compared to 26.1% for J4.8)
Association Rules
• Data on n purchase baskets in form (id, item1, item2, …, itemk)
– For example, purchases from a supermarket
• Association rules are statements of the form:
– “When people buy tea, they also often buy coffee.”
• May be useful for product placement decisions or cross-selling
recommendations
• We say there is an association rule i1 ->i2 if
– i1 and i2 occur together in at least s% of the n baskets (the support)
– And at least c% of the baskets containing item i1 also contain i2 (the
confidence)
• The confidence criterion ensures that “often” is a large enough
proportion of the antecedent cases to be interesting
• The support criterion should be large enough that the resulting rules
have practical importance
– Also helps to ensure reliability of the conclusions
Association rules
• The support/confidence approach is widely used
– Efficiently implemented in the Apriori algorithm
• First identify item sets with sufficient support
• Then turn each item set into sets of rules with sufficient confidence
• This method was originally developed in the database
community, so there has been a focus on efficient
methods for large databases
– “Large” means up to around 100 million instances, and about ten
thousand binary attributes
• However this approach can find a vast number of rules,
and it can be difficult to make sense of these
• One useful extension is to identify only the rules with
high enough lift (or odds ratio)
Classification vs Association Rules
• Classification rules predict the value of a
pre-specified attribute, e.g.
• If outlook=sunny and humidity=high then play =no
• Association rules predict the value of an
arbitrary attribute (or combination of
attributes)
• E.g. If temperature=cool then humidity=normal
• If humidity=normal and play=no then windy=true
• If temperature=high and humidity=high then
play=no
Clustering – EM Algorithm
• Assume that the data is from a mixture of normal
distributions
– I.e. one normal component for each cluster
• For simplicity, consider one attribute x and two
components or clusters
– Model has five parameters: (p, μ1, σ1, μ2, σ2) = θ
• Log-likelihood:
l  X    log  p1 xi   1  p 2 xi ,
n
i 1
where  j xi  is the normal density for the j th component.
• This is hard to maximise directly
– Use the expectation-maximisation (EM) algorithm instead
Clustering – EM Algorithm
• Think of data as being augmented by a latent 0/1
variable di indicating membership of cluster 1
• If the values of this variable were known, the loglikelihood would be:
l  X , D    log d i1  xi   1  d i 2  xi 
n
i 1
• Starting with initial values for the parameters, calculate
the expected value of di
• Then substitute this into the above log-likelihood and
maximise to obtain new parameter values
– This will have increased the log-likelihood
• Repeat until the log-likelihood converges
Clustering – EM Algorithm
• Resulting estimates may only be a local
maximum
– Run several times with different starting points
to find global maximum (hopefully)
• With parameter estimates, can calculate
segment membership probabilities for
each case
1 xi 
P Di  1 , xi  
1 xi   2 xi 
Clustering – EM Algorithm
• Extending to more latent classes is easy
– Information criteria such as AIC and BIC are often used to decide
how many are appropriate
• Extending to multiple attributes is easy if we assume they
are independent, at least conditioning on segment
membership
– It is possible to introduce associations, but this can rapidly
increase the number of parameters required
• Nominal attributes can be accommodated by allowing
different discrete distributions in each latent class, and
assuming conditional independence between attributes
• Can extend this approach to a handle joint clustering and
prediction models, as mentioned in the MVA lectures
Clustering - Scalability Issues
•
•
•
•
•
•
k-means algorithm is also widely used
However this and the EM-algorithm are slow on large databases
So is hierarchical clustering - requires O(n2) time
Iterative clustering methods require full DB scan at each iteration
Scalable clustering algorithms are an area of active research
A few recent algorithms:
– Distance-based/k-Means
•
•
•
•
Multi-Resolution kd-Tree for K-Means [PM99]
CLIQUE [AGGR98]
Scalable K-Means [BFR98a]
CLARANS [NH94]
– Probabilistic/EM
• Multi-Resolution kd-Tree for EM [Moore99]
• Scalable EM [BRF98b]
• CF Kernel Density Estimation [ZRL99]
Ethics of Data Mining
• Data mining and data warehousing raise ethical and legal
issues
• Combining information via data warehousing could violate
Privacy Act
– Must tell people how their information will be used when the data
is obtained
• Data mining raises ethical issues mainly during application
of results
– E.g. using ethnicity as a factor in loan approval decisions
– E.g. screening job applications based on age or sex (where not
directly relevant)
– E.g. declining insurance coverage based on neighbourhood if this
is related to race (“red-lining” is illegal in much of the US)
• Whether something is ethical depends on the application
– E.g. probably ethical to use ethnicity to diagnose and choose
treatments for a medical problem, but not to decline medical
insurance