#### Transcript Introduction to KDD for Tony's MI Course

```1
COMP3503
Intro to Inductive Modeling
with
Daniel L. Silver
2
Agenda
 Deductive
and Inductive Modeling
 Learning Theory and Generalization
 Common Statistical Methods
3
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Patterns &
Models
Data
Warehouse
Consolidated
Data
Data Sources
p(x)=0.02
Prepared Data
4
Deductive and Inductive
Modeling
5
Induction versus Deduction
Top-down verification
Deduction
Model or
General Rule
Example A
Example B
Example C
Induction
Bottom-up construction
6
Deductive Modeling
 Top-down
(toward the data)
verification of an hypothesis
 The hypothesis is generated within the
mind of the data miner
 Exploratory tools such as OLAP and
data visualization software are used
 Models tend to be used for description
7
Inductive Modeling
 Bottom-up
(from the data) development
of an hypothesis
 The hypothesis is generated by the
technology directly from the data
 Statistical and machine learning tools
such as regression, decision trees and
artificial neural networks are used
 Models can be used for prediction
8
Inductive Modeling
Objective: Develop a general model or
hypothesis from specific examples
 Function approximation
(curve fitting)
f(x)
x
 Classification (concept learning, pattern
recognition)
A
x2
B
x1
9
Learning Theory and
Generalization
10
Inductive Modeling = Learning
Basic Framework for Inductive Learning
Testing
Examples
Environment
Training
Examples
(x, f(x))
Inductive
Learning System
Induced Model
or
Hypothesis
~ f(x)?
h(x) =
A problem of representation and
search for the best hypothesis, h(x).
Output Classification
(x, h(x))
Inductive Modeling
= Data Mining
Ideally, an hypothesis (model) is:
•
•
•
•
•
Complete – covers all potential examples
Consistent – no conflicts
Accurate - able to generalize to previously
unseen examples
Valid – presents a truth
11
12
Inductive Modeling
Generalization
The objective of learning is to achieve good
generalization to new cases, otherwise just use
a look-up table.
 Generalization can be defined as a
mathematical interpolation or regression over a
set of training points:

f(x)
x
13
Inductive Modeling
Generalization
 Generalization
accuracy can be
guaranteed for a specified confidence
level given sufficient number of
examples
 Models can be validated for accuracy by
using a previously unseen test set of
examples
14
Learning Theory
Probably Approximately Correct (PAC)
theory of learning (Leslie Valiant, 1984)

Poses questions such as:
• How many examples are needed for good
generalization?
• How long will it take to create a good model?

• Complexity of the actual function
• The desired level of accuracy of the model (75%)
• The desired confidence in finding a model with
this the accuracy (19 times out of 20 = 95%)
15
Learning Theory
-
c
-
+
+
Where c and
h disagree
h
+
-
-
-
Space of all possible examples
The true error of a hypothesis h is the probability that
h will misclassify an instance drawn at random from X,
error(h) = P[c(x)  h(x)]
16
Learning Theory
Three notions of error:



Training Error
• How often training set is misclassified
Test Error
• How often an independent test set is misclassified
True Error
• How often the entire population of possible
examples would be misclassified
• Must be estimated from the Test Error
17
Linear and Non-Linear Problems
 Linear
Problems
• Linear functions
• Linearly separable
classifications
 Non-linear
f(x)
Problems
• Non-linear functions
• Not linearly separable
classifications
x
A
B
x2
x1
f(x)
B
A
B
x2
x1
18
Inductive Bias
 Every
inductive modeling system has
an Inductive Bias
 Consider a simple set of training
examples like the following:
f(x)
x
Go to generalize.xls
19
Inductive Bias
 Can
you think of any biases that you
commonly use when you are learning
something new?
 Is
there one best inductive bias?
20
Inductive Modeling Methods

Automated Exploration/Discovery
•
•

Prediction/Classification
•
•
•

e.g.. discovering new market segments
distance and probabilistic clustering algorithmsx2
•
•
B
x1
e.g.. forecasting gross sales given current factors
statistics (regression, K-nearest neighbour)
artificial neural networks, genetic algorithms
f(x)
Explanation/Description
•
A
e.g.. characterizing customers by demographics
inductive decision trees/rules
rough sets, Bayesian belief nets if age > 35
and income < \$35k
then ...
x
21
Common Statistical Methods
22
Linear Regression
Y = b0 + b1 X1 + b2 X2 +...
 The coefficients b0, b1 … determine a line (or
hyperplane for higher dim.) that fits the data
 Closed form solution via least squares
(computes the smallest sum of squared
distances between the examples and
predicted values of Y)
 Inductive bias: The solution can be modeled
by a straight line or hyperplane

23
Linear Regression
Y
= b0 + b1 X1 + b2 X2 +...
 A great way to start since it assumes
you are modeling a simple function
… Why?
24
Logistic Regression
0
Y
Z
Y=
Where Z = b0 + b1 X1 + b2 X2 +…
 Output is [0,1] and represents probability
 The coefficients b0, b1 … determine an
S-shaped non-linear curve that best fits data
 The coefficients are estimated using an
iterative maximum-likelihood method
 Inductive bias: The solution can be modeled
by this S-shaped non-linear surface

1/(1+e-Z)
1
25
Logistic Regression

1/(1+e-Z)
Y=
Where Z = b0 + b1 X1 + b2 X2 +…
1
0
Y
Z
Can be used for classification problems
 The output can be used as the probability of
being of the class (or positive)
 Alternatively, any value above a cut-off
(typically 0.5) is classified as being a positive
example
… A logistic regression Javascript page

26
THE END
[email protected]
27
Learning Theory
Example Space X(x,c(x))
c
x = input attributes
c = true class function
(e.g. “likes product”)
h = hypothesis (model)
Where c and
h disagree
h
+ +
-
-
The true error of a hypothesis h is the probability that
h will misclassify an instance drawn at random from X,
err(h) = P[c(x)  h(x)]
28
Generalization
PAC - A Probabilistic Guarantee
H = # possible hypotheses in modeling system
 = desired true error, where (0 <  < 1)
 = desired confidence (1- ), where (0 <  < 1)
The the number of training examples required to
select (with confidence ) a hypothesis h with
err(h) <  is given by
m
1

(ln
|H |

)
```