Introduction to KDD for Tony's MI Course
Download
Report
Transcript Introduction to KDD for Tony's MI Course
1
COMP3503
Intro to Inductive Modeling
with
Daniel L. Silver
2
Agenda
Deductive
and Inductive Modeling
Learning Theory and Generalization
Common Statistical Methods
3
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Patterns &
Models
Data
Warehouse
Consolidated
Data
Data Sources
p(x)=0.02
Prepared Data
4
Deductive and Inductive
Modeling
5
Induction versus Deduction
Top-down verification
Deduction
Model or
General Rule
Example A
Example B
Example C
Induction
Bottom-up construction
6
Deductive Modeling
Top-down
(toward the data)
verification of an hypothesis
The hypothesis is generated within the
mind of the data miner
Exploratory tools such as OLAP and
data visualization software are used
Models tend to be used for description
7
Inductive Modeling
Bottom-up
(from the data) development
of an hypothesis
The hypothesis is generated by the
technology directly from the data
Statistical and machine learning tools
such as regression, decision trees and
artificial neural networks are used
Models can be used for prediction
8
Inductive Modeling
Objective: Develop a general model or
hypothesis from specific examples
Function approximation
(curve fitting)
f(x)
x
Classification (concept learning, pattern
recognition)
A
x2
B
x1
9
Learning Theory and
Generalization
10
Inductive Modeling = Learning
Basic Framework for Inductive Learning
Testing
Examples
Environment
Training
Examples
(x, f(x))
Inductive
Learning System
Induced Model
or
Hypothesis
~ f(x)?
h(x) =
A problem of representation and
search for the best hypothesis, h(x).
Output Classification
(x, h(x))
Inductive Modeling
= Data Mining
Ideally, an hypothesis (model) is:
•
•
•
•
•
Complete – covers all potential examples
Consistent – no conflicts
Accurate - able to generalize to previously
unseen examples
Valid – presents a truth
Transparent – human readable knowledge
11
12
Inductive Modeling
Generalization
The objective of learning is to achieve good
generalization to new cases, otherwise just use
a look-up table.
Generalization can be defined as a
mathematical interpolation or regression over a
set of training points:
f(x)
x
13
Inductive Modeling
Generalization
Generalization
accuracy can be
guaranteed for a specified confidence
level given sufficient number of
examples
Models can be validated for accuracy by
using a previously unseen test set of
examples
14
Learning Theory
Probably Approximately Correct (PAC)
theory of learning (Leslie Valiant, 1984)
Poses questions such as:
• How many examples are needed for good
generalization?
• How long will it take to create a good model?
Answers depend on:
• Complexity of the actual function
• The desired level of accuracy of the model (75%)
• The desired confidence in finding a model with
this the accuracy (19 times out of 20 = 95%)
15
Learning Theory
-
c
-
+
+
Where c and
h disagree
h
+
-
-
-
Space of all possible examples
The true error of a hypothesis h is the probability that
h will misclassify an instance drawn at random from X,
error(h) = P[c(x) h(x)]
16
Learning Theory
Three notions of error:
Training Error
• How often training set is misclassified
Test Error
• How often an independent test set is misclassified
True Error
• How often the entire population of possible
examples would be misclassified
• Must be estimated from the Test Error
17
Linear and Non-Linear Problems
Linear
Problems
• Linear functions
• Linearly separable
classifications
Non-linear
f(x)
Problems
• Non-linear functions
• Not linearly separable
classifications
x
A
B
x2
x1
f(x)
B
A
B
x2
x1
18
Inductive Bias
Every
inductive modeling system has
an Inductive Bias
Consider a simple set of training
examples like the following:
f(x)
x
Go to generalize.xls
19
Inductive Bias
Can
you think of any biases that you
commonly use when you are learning
something new?
Is
there one best inductive bias?
20
Inductive Modeling Methods
Automated Exploration/Discovery
•
•
Prediction/Classification
•
•
•
e.g.. discovering new market segments
distance and probabilistic clustering algorithmsx2
•
•
B
x1
e.g.. forecasting gross sales given current factors
statistics (regression, K-nearest neighbour)
artificial neural networks, genetic algorithms
f(x)
Explanation/Description
•
A
e.g.. characterizing customers by demographics
inductive decision trees/rules
rough sets, Bayesian belief nets if age > 35
and income < $35k
then ...
x
21
Common Statistical Methods
22
Linear Regression
Y = b0 + b1 X1 + b2 X2 +...
The coefficients b0, b1 … determine a line (or
hyperplane for higher dim.) that fits the data
Closed form solution via least squares
(computes the smallest sum of squared
distances between the examples and
predicted values of Y)
Inductive bias: The solution can be modeled
by a straight line or hyperplane
23
Linear Regression
Y
= b0 + b1 X1 + b2 X2 +...
A great way to start since it assumes
you are modeling a simple function
… Why?
24
Logistic Regression
0
Y
Z
Y=
Where Z = b0 + b1 X1 + b2 X2 +…
Output is [0,1] and represents probability
The coefficients b0, b1 … determine an
S-shaped non-linear curve that best fits data
The coefficients are estimated using an
iterative maximum-likelihood method
Inductive bias: The solution can be modeled
by this S-shaped non-linear surface
1/(1+e-Z)
1
25
Logistic Regression
1/(1+e-Z)
Y=
Where Z = b0 + b1 X1 + b2 X2 +…
1
0
Y
Z
Can be used for classification problems
The output can be used as the probability of
being of the class (or positive)
Alternatively, any value above a cut-off
(typically 0.5) is classified as being a positive
example
… A logistic regression Javascript page
26
THE END
[email protected]
27
Learning Theory
Example Space X(x,c(x))
c
x = input attributes
c = true class function
(e.g. “likes product”)
h = hypothesis (model)
Where c and
h disagree
h
+ +
-
-
The true error of a hypothesis h is the probability that
h will misclassify an instance drawn at random from X,
err(h) = P[c(x) h(x)]
28
Generalization
PAC - A Probabilistic Guarantee
H = # possible hypotheses in modeling system
= desired true error, where (0 < < 1)
= desired confidence (1- ), where (0 < < 1)
The the number of training examples required to
select (with confidence ) a hypothesis h with
err(h) < is given by
m
1
(ln
|H |
)