#### Transcript Introduction to KDD for Tony's MI Course

1 COMP3503 Intro to Inductive Modeling with Daniel L. Silver 2 Agenda Deductive and Inductive Modeling Learning Theory and Generalization Common Statistical Methods 3 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation Patterns & Models Data Warehouse Consolidated Data Data Sources p(x)=0.02 Prepared Data 4 Deductive and Inductive Modeling 5 Induction versus Deduction Top-down verification Deduction Model or General Rule Example A Example B Example C Induction Bottom-up construction 6 Deductive Modeling Top-down (toward the data) verification of an hypothesis The hypothesis is generated within the mind of the data miner Exploratory tools such as OLAP and data visualization software are used Models tend to be used for description 7 Inductive Modeling Bottom-up (from the data) development of an hypothesis The hypothesis is generated by the technology directly from the data Statistical and machine learning tools such as regression, decision trees and artificial neural networks are used Models can be used for prediction 8 Inductive Modeling Objective: Develop a general model or hypothesis from specific examples Function approximation (curve fitting) f(x) x Classification (concept learning, pattern recognition) A x2 B x1 9 Learning Theory and Generalization 10 Inductive Modeling = Learning Basic Framework for Inductive Learning Testing Examples Environment Training Examples (x, f(x)) Inductive Learning System Induced Model or Hypothesis ~ f(x)? h(x) = A problem of representation and search for the best hypothesis, h(x). Output Classification (x, h(x)) Inductive Modeling = Data Mining Ideally, an hypothesis (model) is: • • • • • Complete – covers all potential examples Consistent – no conflicts Accurate - able to generalize to previously unseen examples Valid – presents a truth Transparent – human readable knowledge 11 12 Inductive Modeling Generalization The objective of learning is to achieve good generalization to new cases, otherwise just use a look-up table. Generalization can be defined as a mathematical interpolation or regression over a set of training points: f(x) x 13 Inductive Modeling Generalization Generalization accuracy can be guaranteed for a specified confidence level given sufficient number of examples Models can be validated for accuracy by using a previously unseen test set of examples 14 Learning Theory Probably Approximately Correct (PAC) theory of learning (Leslie Valiant, 1984) Poses questions such as: • How many examples are needed for good generalization? • How long will it take to create a good model? Answers depend on: • Complexity of the actual function • The desired level of accuracy of the model (75%) • The desired confidence in finding a model with this the accuracy (19 times out of 20 = 95%) 15 Learning Theory - c - + + Where c and h disagree h + - - - Space of all possible examples The true error of a hypothesis h is the probability that h will misclassify an instance drawn at random from X, error(h) = P[c(x) h(x)] 16 Learning Theory Three notions of error: Training Error • How often training set is misclassified Test Error • How often an independent test set is misclassified True Error • How often the entire population of possible examples would be misclassified • Must be estimated from the Test Error 17 Linear and Non-Linear Problems Linear Problems • Linear functions • Linearly separable classifications Non-linear f(x) Problems • Non-linear functions • Not linearly separable classifications x A B x2 x1 f(x) B A B x2 x1 18 Inductive Bias Every inductive modeling system has an Inductive Bias Consider a simple set of training examples like the following: f(x) x Go to generalize.xls 19 Inductive Bias Can you think of any biases that you commonly use when you are learning something new? Is there one best inductive bias? 20 Inductive Modeling Methods Automated Exploration/Discovery • • Prediction/Classification • • • e.g.. discovering new market segments distance and probabilistic clustering algorithmsx2 • • B x1 e.g.. forecasting gross sales given current factors statistics (regression, K-nearest neighbour) artificial neural networks, genetic algorithms f(x) Explanation/Description • A e.g.. characterizing customers by demographics inductive decision trees/rules rough sets, Bayesian belief nets if age > 35 and income < $35k then ... x 21 Common Statistical Methods 22 Linear Regression Y = b0 + b1 X1 + b2 X2 +... The coefficients b0, b1 … determine a line (or hyperplane for higher dim.) that fits the data Closed form solution via least squares (computes the smallest sum of squared distances between the examples and predicted values of Y) Inductive bias: The solution can be modeled by a straight line or hyperplane 23 Linear Regression Y = b0 + b1 X1 + b2 X2 +... A great way to start since it assumes you are modeling a simple function … Why? 24 Logistic Regression 0 Y Z Y= Where Z = b0 + b1 X1 + b2 X2 +… Output is [0,1] and represents probability The coefficients b0, b1 … determine an S-shaped non-linear curve that best fits data The coefficients are estimated using an iterative maximum-likelihood method Inductive bias: The solution can be modeled by this S-shaped non-linear surface 1/(1+e-Z) 1 25 Logistic Regression 1/(1+e-Z) Y= Where Z = b0 + b1 X1 + b2 X2 +… 1 0 Y Z Can be used for classification problems The output can be used as the probability of being of the class (or positive) Alternatively, any value above a cut-off (typically 0.5) is classified as being a positive example … A logistic regression Javascript page 26 THE END [email protected] 27 Learning Theory Example Space X(x,c(x)) c x = input attributes c = true class function (e.g. “likes product”) h = hypothesis (model) Where c and h disagree h + + - - The true error of a hypothesis h is the probability that h will misclassify an instance drawn at random from X, err(h) = P[c(x) h(x)] 28 Generalization PAC - A Probabilistic Guarantee H = # possible hypotheses in modeling system = desired true error, where (0 < < 1) = desired confidence (1- ), where (0 < < 1) The the number of training examples required to select (with confidence ) a hypothesis h with err(h) < is given by m 1 (ln |H | )