Inductive learning
Download
Report
Transcript Inductive learning
Logistics
•
•
•
•
PS1 back
Project writeups
Project
PS
© Daniel S. Weld
1
Weekly Course Outline
1. Overview, agents, problem spaces
2.
Search
3.
Knowledge Representation
4.
5. Learning
6. Planning (and CSPs)
7. Uncertainty
8. Planning under uncertainty
9. Reinforcement learning
10.Topics
11. Project Presentations & Contest
}
© Daniel S. Weld
2
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
3
Inductive Learning
• Inductive learning or “Prediction”:
Given examples of a function (X, F(X))
Predict function F(X) for new examples X
• Classification
F(X) = Discrete
• Regression
F(X) = Continuous
• Probability estimation
F(X) = Probability(X):
© Daniel S. Weld
4
© Daniel S. Weld
5
Inductive Bias
• What is space of hypotheses?
Restriction bias
• Is there an ordering over hypotheses?
Preference bias
• No bias no generalization
© Daniel S. Weld
6
Widely-used ML Approaches
•
•
•
•
•
•
•
•
Decision trees
Rule induction
Bayesian learning
Neural networks
Genetic algorithms
Instance-based learning
Support Vector Machines
Etc.
More important: choice of features
© Daniel S. Weld
7
Evaluating Attributes
Yes
Humid
Gain(S,Humid)
=0.151
Wind
Outlook
Temp
Gain(S,Wind)
=-0.029
Value is actually different than
this, but ignore this detail
Gain(S,Outlook)
=0.170
© Daniel S. Weld
Gain(S,Temp)
=0.029
8
Decision Tree Algorithm
BuildTree(TraingData)
Split(TrainingData)
Split(D)
If (all points in D are of the same class)
Then Return
For each attribute A
Evaluate splits on attribute A
Use best split to partition D into D1, D2
Split (D1)
Split (D2)
© Daniel S. Weld
9
Movie Recommendation
• Features?
Rambo
Matrix
Rambo 2
© Daniel S. Weld
10
Issues
• Features: Content vs. Social
• Missing Data
• Non-Boolean Attributes
• Scaling up
© Daniel S. Weld
11
Missing Data 1
• Assign attribute most common value at node
Day Temp Humid Wind
Tennis?
d1
h
h
weak n
d2
h
h
s
n
d8
m
h
weak n
d9
c
weak yes
d11
m
n
s
yes
• Or …
most common value, given classification
© Daniel S. Weld
12
Fractional Values
Day Temp Humid Wind
Tennis?
d1
h
h
weak n
d2
h
h
s
n
d8
m
h
weak n
d9
c
weak yes
d11
m
n
s
yes
•
•
•
•
[0.75+, 3-]
[1.25+, 0-]
75% h and 25% n
Use in gain calculations
Further subdivide if other missing attributes
Same approach to classify test ex with missing attr
Classification is most probable classification
Summing over leaves where it got divided
© Daniel S. Weld
13
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
14
Non-Boolean Features
• Features with multiple discrete values
Construct a multi-way split
Test for one value vs. all of the others?
Group values into two disjoint subsets?
• Real-valued Features
Discretize?
Consider a threshold split using observed values?
© Daniel S. Weld
15
Attributes with many values
• So many values that it
Divides examples into tiny sets
Each set likely uniform high info gain
But poor predictor…
• Need to penalize these attributes
© Daniel S. Weld
16
One approach: Gain ratio
SplitInfo entropy of S wrt values of A
(Contrast with entropy of S wrt target value)
attribs with many uniformly distrib values
e.g. if A splits S uniformly into n sets
SplitInformation = log2(n)… = 1 for Boolean
© Daniel S. Weld
17
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
18
Overfitting
Accuracy
On training data
On test data
0.9
0.8
0.7
0.6
Number of Nodes in Decision tree
© Daniel S. Weld
19
Overfitting 2
Figure from w.w.cohen
© Daniel S. Weld
20
Overfitting…
• DT is overfit when exists another DT’ and
DT has smaller error on training examples, but
DT has bigger error on test examples
• Causes of overfitting
Noisy data, or
Training set is too small
© Daniel S. Weld
21
© Daniel S. Weld
22
© Daniel S. Weld
23
Effect of Reduced-Error Pruning
© Daniel S. Weld
24
© Daniel S. Weld
25
Converting a Tree to Rules
© Daniel S. Weld
26
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
27
Ensembles of Classifiers
• Assume
Errors are independent
Majority vote
• Probability that majority is wrong…
= area under binomial distribution
Prob 0.2
0.1
Number of classifiers in error
• If individual area is 0.3
• Area under curve for 11 wrong is 0.026
• Order of magnitude improvement!
© Daniel S. Weld
28
Constructing Ensembles
Bagging
• Generate k sets of training examples
• For each set
Draw m examples randomly (with replacement)
From the original set of m examples
• Each training set corresponds to
63.2% of original
(+ duplicates)
• Now train classifier on each set
© Daniel S. Weld
29
Ensemble Construction II
Cross-validated committees
• Partition examples into k disjoint equiv classes
• Now create k training sets
Each set is union of all equiv classes except one
So each set has (k-1)/k of the original training data
• Now train a classifier on each set
© Daniel S. Weld
30
Ensemble Creation III
Boosting
• Maintain prob distribution over set of training ex
• Create k sets of training data iteratively:
• On iteration i
Draw m examples randomly (like bagging)
But use probability distribution to bias selection
Train classifier number i on this training set
Test partial ensemble (of i classifiers) on all training exs
Modify distribution: increase P of each error ex
• Create harder and harder learning problems...
• “Bagging with optimized choice of examples”
© Daniel S. Weld
31
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
32
Bayesian Methods
• Learning based on probability theory.
Bayes theorem plays a critical role in
probabilistic learning and classification.
• Uses prior probability of each category
given no information about an item.
• Categorization produces a posterior prob
distribution over possible categories
given a description of an item.
© Daniel S. Weld
33
Axioms of Probability Theory
• All probabilities between 0 and 1
0 P( A) 1
• True proposition has probability 1, false has
probability 0.
P(true) = 1
P(false) = 0.
• The probability of disjunction is:
P( A B) P( A) P( B) P( A B)
A
© Daniel S. Weld
A B
B
34
Conditional Probability
• P(A | B) is the probability of A given B
Assumes that B is all and only information known.
• Defined by:
P( A B)
P( A | B)
P( B)
A
© Daniel S. Weld
A B
B
35
Independence
• A and B are independent iff:
P( A | B) P( A)
P( B | A) P( B)
These two constraints are
logically equivalent
• Therefore, if A and B are independent:
P( A B)
P( A | B)
P?( A)
P( B)
P( A B) P
?( A) P( B)
© Daniel S. Weld
36
Bayes Theorem
Simple proof:
P( H E )
1. P( H | E )
P( E )
(Def. conditional probability)
P( H E )
2. P( E | H )
P( H )
(Def. conditional probability)
3. P( H E ) P( E | H ) P( H )
QED:
© Daniel S. Weld
P( H | E )
Multiply both sides of 2 by P(H)
P( E | H ) P( H )
P( E )
Substitute 3 in 1
37
Bayesian Categorization
Let set of categories be {c1, c2,…cn}
Let E be description of an instance.
Choose category of E by determining for each ci
P(ci ) P( E | ci )
P(ci | E )
P( E )
P(E) can be determined since categories are
complete and disjoint.
n
n
i 1
i 1
P(ci | E )
P(ci ) P( E | ci )
1
P( E )
n
P( E ) P(ci ) P( E | ci )
i 1
© Daniel S. Weld
38
Bayesian Categorization (cont.)
Need to know:
Priors: P(ci)
Conditionals: P(E | ci)
P(ci) are easily estimated from data.
If ni of the examples in D are in ci
Then P(ci) = ni / |D|
Assume instance is conjunction of binary features:
E e1 e2 em
Problem:
Can’t calculate all P(E | ci)
Too many possible instances (exponential in m)
© Daniel S. Weld
39
Naïve Bayesian Categorization
Assume features conditionally independent
I.e. probability of two features are independent
given the category (ci)
m
P( E | ci ) P(e1 e2 em | ci ) P(e j | ci )
j 1
Now, only need to know:
P(ej | ci) for each feature and category.
© Daniel S. Weld
40
Naïve Bayes Example
• C = {allergy, cold, well}
• e1 = sneeze; e2 = cough; e3 = fever
• E = {sneeze, cough, fever}
Prob
P(ci)
P(sneeze|ci)
P(cough|ci)
P(fever|ci)
© Daniel S. Weld
Well
Cold
Allergy
0.9
0.05
0.05
0.1
0.9
0.9
0.1
0.8
0.7
0.01
0.7
0.4
41
Naïve Bayes Example (cont.)
Probability
Well
Cold
Allergy
P(ci)
0.9
0.05
0.05
P(sneeze | ci)
0.1
0.9
0.9
P(cough | ci)
0.1
0.8
0.7
P(fever | ci)
0.01
0.7
0.4
E={sneeze, cough, fever}
P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E)
P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E)
P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E)
Most probable category: allergy
P(E) = 0.089 + 0.01 + 0.019 = 0.0379
P(well | E) = 0.23
P(cold | E) = 0.26
P(allergy | E) = 0.50
© Daniel S. Weld
42
Estimating Probabilities
• Normally, probabilities are estimated based on
observed frequencies in the training data.
• If D contains ni examples in category ci, and nij of
these ni examples contains feature ej, then:
nij
P(e j | ci )
ni
• However, estimating such probabilities from small
training sets is error-prone.
• If due only to chance, a rare feature, ek, is always
false in the training data, ci :P(ek | ci) = 0.
• If ek then occurs in a test example, E, the result is
that ci: P(E | ci) = 0 and ci: P(ci | E) = 0
© Daniel S. Weld
43
Smoothing
• To account for small samples, probability
estimates are adjusted or smoothed
• Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability, p,
that is assumed to have been previously
observed in a “virtual” sample of size m.
P(e j | ci )
nij mp
ni m
• For binary features, p is assumed to be 0.5.
© Daniel S. Weld
44
Naïve Bayes for Text
• Modeled as generating a bag of words
for a document in a given category
by repeatedly sampling with replacement
from a vocabulary V = {w1, w2,…wm} based on
the probabilities P(wj | ci).
• Smooth probability estimates with
Laplace m-estimates assuming a uniform
distribution over all words (p = 1/|V |)
and m = |V |
Equivalent to a virtual sample of seeing each word
in each category exactly once.
© Daniel S. Weld
45
Text Naïve Bayes Algorithm
(Train)
Let V be the vocabulary of all words in the documents in D
For each category ci C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj V
Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
© Daniel S. Weld
46
Text Naïve Bayes Algorithm
(Test)
Given a test document X
Let n be the number of word occurrences in X
Return the category:
n
argmax P(ci ) P(ai | ci )
ci C
i 1
where ai is the word occurring the ith position in X
© Daniel S. Weld
47
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document
in D.
Assumes V and all Di , ni, and nij pre-computed in
O(|D|Ld) time during one pass through all of the
data.
Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt)
where Lt is average length of a test document.
Very efficient overall, linearly proportional to the
time needed to just read in all the data.
• Similar to Rocchio time complexity.
© Daniel S. Weld
48
Underflow Prevention
• Multiplying lots of probabilities,
which are between 0 and 1 by definition,
can result in floating-point underflow.
• Since log(xy) = log(x) + log(y),
it is better to sum logs of probabilities
rather than multiplying probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
© Daniel S. Weld
49
Naïve Bayes Posterior
Probabilities
• Classification results of naïve Bayes (the
class with maximum posterior probability)
are usually fairly accurate.
• However, due to the inadequacy of the
conditional independence assumption, the
actual posterior-probability numerical
estimates are not.
Output probabilities are generally very close to 0
or 1.
© Daniel S. Weld
50
Machine Learning Overview
• Inductive Learning
Defn, need for bias, …
• One method: Decision Tree Induction
•
•
•
•
Hill climbing thru space of DTs
Entropy & information gain
Missing attributes
Multivariate attributes
Overfitting
Ensembles
Naïve Bayes Classifier
Co-learning
© Daniel S. Weld
51
Co-Training Motivation
• Learning methods need labeled data
Lots of <x, f(x)> pairs
Hard to get… (who wants to label data?)
• But unlabeled data is usually plentiful…
Could we use this instead??????
© Daniel S. Weld
52
• Suppose
Co-training
Each instance has two parts: x = [x1, x2]
and x1, x2 conditionally independent given f(x)
Each half can be used to classify instance
f1, f2 such that f1(x1) = f2(x2) = f(x)
• Suppose f1, f2 are learnable
f1 H1,
[x1, x2]
f2 H2,
learning algorithms A1, A2
A1
<[x1, x2], f1(x1)>
Unlabeled Instances
Labeled Instances
© Daniel S. Weld
A2
~
f2
Hypothesis
53
Observations
• Effectively, no limit to training data!
Just apply A1 to unlabeled data
• If x1 is cond. independent of x2 given f(x)
Then the error in the labels produced by A1 will
look like random noise to A2
• Thus no limit to quality of A2’s output
© Daniel S. Weld
54
It really works!
• Learning to classify web pages as class pages
x1 = bag of words on a page
x2 = bag of words from all anchors pointing to a page
• Naïve Bayes classifiers
12 labeled pages
1039 unlabeled
© Daniel S. Weld
55