Chapter 11 Statistical Method

Download Report

Transcript Chapter 11 Statistical Method

Chapter 11
Statistical Techniques
Chapter Objectives
 Understand when linear regression is an appropriate
data mining technique.
 Know how to perform linear regression with Microsoft
Excel’s LINEST function.
 Know that logistic regression can be used to build
supervised learner models for datasets having a binary
outcome.
 Understand how Bayes classifier is able to build
supervised models for datasets having categorical data,
numeric data, or a combination of both data types.
Data Warehouse and Data Mining
2
Chapter 11
Chapter Objectives
 Know how agglomerative clustering is applied partition
data instances into disjoint clusters.
 Understand that conceptual clustering is an unsupervised
data mining technique that builds a concept hierarchy to
partition data instances.
 Know that the EM algorithm uses a statistical parameter
adjustment technique to cluster data instances.
 Understand the basic features that differentiate statistical
and machine learning data mining methods
Data Warehouse and Data Mining
3
Chapter 11
Linear Regression Analysis
Data Warehouse and Data Mining
4
Chapter 11
Linear Regression Analysis
Data Warehouse and Data Mining
5
Chapter 11
Linear Regression Analysis
Data Warehouse and Data Mining
6
Chapter 11
Linear Regression Analysis
Data Warehouse and Data Mining
7
Chapter 11
Linear Regression Analysis
Data Warehouse and Data Mining
8
Chapter 11
Logistic Regression
Data Warehouse and Data Mining
9
Chapter 11
Logistic Regression
Data Warehouse and Data Mining
10
Chapter 11
Bayes Classifier
Data Warehouse and Data Mining
11
Chapter 11
Bayes Classifier
Data Warehouse and Data Mining
12
Chapter 11
Bayes Classifier
Data Warehouse and Data Mining
13
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
14
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
15
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
16
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
17
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
18
Chapter 11
Clustering Algorithms
Data Warehouse and Data Mining
19
Chapter 11
Heuristics or Statistics?
Here is one way to categorize inductive problemsolving methods:
• Query and visualization techniques
• Machine learning techniques
• Statistical techniques
Query and visualization techniques generally fall into
one of three group:
• Query tools
• OLAP tools
• Visualization tools
Data Warehouse and Data Mining
20
Chapter 11
Chapter Summary
Data mining techniques come in many shapes and
forms. A favorite statistical technique for estimation and
prediction problems is linear regression. Linear regression
attempts to model the variation in a dependent variable as
a linear combination of one or more independent
variables. Linear regression is an appropriate data mining
strategy when the relationship between the dependent and
independent variables is nearly linear. Microsoft Excel’s
LINEST function provides an easy mechanism for
performing multiple linear regression.
Data Warehouse and Data Mining
21
Chapter 11
Chapter Summary
Linear regression is a poor choice when the
outcome is binary. The problem lies in the fact that the
value restriction placed on the dependent variable is
not observed by the regression equation. That is,
because linear regression produces a straight-line
function, values of the dependent variable are
unbounded in both the positive and negative
directions. For the two-outcome case, logistic
regression is a better choice. Logistic regression is a
nonlinear regression technique that associates a
conditional probability value with each data instance.
Data Warehouse and Data Mining
22
Chapter 11
Chapter Summary
Bayes classifier offers a simple yet powerful
supervised classification technique. The model
assumes all input attributes to be of equal importance
and independent of one another. Even though these
assumptions are likely to be false, Bayes classifier
still works quite well in practice. Bayes classifier can
be applied to datasets containing both categorical and
numeric data. Also, unlike many statistical classifiers,
Bayes classifier can be applied to datasets containing
a wealth of missing items.
Data Warehouse and Data Mining
23
Chapter 11
Chapter Summary
Agglomerative clustering is a favorite unsupervised
clustering technique. Agglomerative clustering begins
by assuming each data instance represents its own
cluster. Each iteration of the algorithm merges the
most similar pair of clusters. Several options for
computing instance and cluster similarity scores and
cluster merging procedures exist. Also, when the data
to be clustered is real-valued, defining a measure of
instance similarity can be a challenge. One common
approach is to use simple Euclidean distance. A
widespread application of agglomerative clustering is
its use as a prelude to other clustering techniques.
Data Warehouse and Data Mining
24
Chapter 11
Chapter Summary
Conceptual clustering is an unsupervised technique
that incorporates incremental learning to form a hierarchy
of concepts. The concept hierarchy takes the form of a
tree structure where the root node represents the highest
level of concept generalization. Conceptual clustering
systems are particularly appealing because the trees they
form have been shown to consistently determine
psychologically preferred levels in human classification
hierarchies. Also, conceptual clustering systems lend
themselves well to explaining their behavior. A major
problem with conceptual clustering systems is that
instance ordering can have a marked impact on the results
of the clustering. A nonrepresentative ordering of data
instances can lead to a less than optimal clustering.
Data Warehouse and Data Mining
25
Chapter 11
Chapter Summary
The EM (expectation-maximization) algorithm is a
statistical technique that makes use of the finite Gaussian
mixtures model. The mixtures model assigns each
individual data instance a probability that it would have a
certain set of attribute values given it was a member of a
specified cluster. The model assumes all attributes to be
independent random variables. The EM algorithm is similar
to the K-Means procedure in that a set of parameters are
recomputed until a desired convergence value is achieved.
A lack of explanation about what has been discovered is a
problem with EM as it is with many clustering systems.
Applying a supervised model to analyze the results of an
unsupervised clustering is one technique to help explain the
results of an EM clustering.
Data Warehouse and Data Mining
26
Chapter 11
Key Terms
A priori probability. The probability a hypothesis is
true lacking evidence to support or reject the
hypothesis.
Agglomerative clustering. An unsupervised
technique where each data instance initially
represents its own cluster. Successive iterations of
the algorithm merge pairs of highly similar clusters
until all instance become members of a single
cluster. In the last step, a decision is made about
which clustering is a best final result.
Basic-level nodes. The nodes in a concept hierarchy
that represent concepts easily identified by humans.
Data Warehouse and Data Mining
27
Chapter 11
Key Terms
Bayes classifier. A supervised learning approach that
classifies new instances by using Bayes theorem.
Bayes theorem. The probability of a hypothesis given
some evidence is equal to the probability of the
evidence given the hypothesis, times the probability
of the hypothesis, divided by the probability of the
evidence.
Bayesian Information Criterion (BIC). The BIC
gives the posterior odds for one data mining model
against another model assuming neither model is
favored initially.
Data Warehouse and Data Mining
28
Chapter 11
Key Terms
Category utility. An unsupervised evaluation function that
measures the gain in the “expected number” of correct
attribute-value predictions for a specific object if it
were placed within a given category or cluster.
Coefficient of determination. For a regression analysis,
the correlation between actual and estimated values for
the dependent variable.
Concept hierarchy. A tree structure where each node of
the tree represents a concept at some level of
abstraction. Nodes toward the top of the tree are the
most general. Leaf nodes represent individual data
instances.
Data Warehouse and Data Mining
29
Chapter 11
Key Terms
Conceptual clustering. An incremental unsupervised
clustering method that creates a concept hierarchy
from a set of input instances.
Conditional probability. The conditional probability
of evidence E given hypothesis H denoted by P(E |
H), is the probability E is true given H is true.
Incremental learning. A form of learning that is
supported in an unsupervised environment where
instances are presented sequentially. As each new
instance is seen, the learning model is modified to
reflect the addition of the new instance.
Data Warehouse and Data Mining
30
Chapter 11
Key Terms
Linear regression. A statistical technique that
models the variation in a numeric dependent
variable as a linear combination of one or
several independent variables.
Logistic regression. A nonlinear regression
technique for problems having a binary
outcome. A created regression equation limits
the values of the output attribute to values
between 0 and 1.This allows output values to
represent a probability of class membership.
Data Warehouse and Data Mining
31
Chapter 11
Key Terms
Logit. The natural logarithm of the odds ratio p(y
= 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the
conditional probability that the value of the
linear regression equation determined by feature
vector x is 1.
Mixture. A set of n probability distributions where
each distribution represent a cluster.
Model tree. A decision tree where each leaf node
contains a linear regression equation.
Data Warehouse and Data Mining
32
Chapter 11
Key Terms
Regression. The process of developing an expression
that predicts a numeric output value.
Regression tree. A decision tree where leaf nodes
contain averaged numeric values.
Simple linear regression. A regression equation
with a single independent variable.
Slope-intercept form. A linear equation of the
form y = ax + b where a is the slope of the line
and b is the y-intercept.
Data Warehouse and Data Mining
33
Chapter 11
Data Warehouse and Data Mining
34
Chapter 11