GAM, CART and MARS - Department of Computing Science

Download Report

Transcript GAM, CART and MARS - Department of Computing Science

Additive Models and Trees
Lecture Notes for CMPUT 466/551
Nilanjan Ray
Principal Source: Department of Statistics, CMU
Topics to cover
• GAM: Generalized Additive Models
• CART: Classification and Regression
Trees
• MARS: Multiple Adaptive Regression
Splines
Generalized Additive Models
What is GAM?
E (Y | X 1 , X 2 ,, X p )    f1 ( X 1 )  f 2 ( X 2 )    f p ( X p )
The functions fj are smoothing functions in general, such as splines, kernel
functions, linear functions, and so on…
Each function could be different, e.g., f1 can be linear, f2 can be a natural
spline, etc.
Compare GAM with Linear Basis Expansions (Ch. 5 of [HTF])
Similarities? Dissimilarities?
Any similarity (in principle) with Naïve Bayes model?
Smoothing Functions in GAM
• Non-parametric functions (linear smoother)
–
–
–
–
Smoothing splines (Basis expansion)
Simple k-nearest neighbor (raw moving average)
Locally weighted average by using kernel weighting
Local linear regression, local polynomial regression
• Linear functions
• Functions of more than one variables
(interaction term)
• Example:
Learning GAM: Backfitting
Backfitting algorithm
1
1. Initialize:    y , f
N

2.

N
i 1
i
j
 0, i, j
Cycle: j = 1,2,…, p,…,1,2,…, p,…, (m cycles)



Fit f j with data {x } ,{ yi     f k ( xik )}1N

N
ij 1

1
fj  fj
N
N
f
i 1
k j

j
( xij )

Until the functions f j change less than a prespecified
threshold
Backfitting: Points to Ponder
p
Y     f j ( X j )   ,  ~ N (0,  2 )
j 1
E (Y     f k ( X k ) | X j )  f j ( X j )
k j
Computational Advantage?
Convergence?
How to choose fitting functions?
Example: Generalized Logistic
Regression
Model:
log
Pr(Y  1 | X 1 , , X p )
Pr(Y  0 | X 1 ,, X p )
   f1 ( X 1 )    f p ( X p )
Additive Logistic Regression: Backfitting
Fitting logistic regression (P99)


1.
1.   log
 j  0j

Fitting additive logistic regression (P262)


2. i   j  j xij , p i 

1
1 e

i
Iterate:


y
where y  avg( yi ), f j  0j
1 y


2. i     j f j ( xij ), p i 
1
1 e

i
Iterate:


a.
wi  pi (1  pi )
b.
c.


a.
wi  pi (1  pi )
zi  i  wi1 ( yi  pi )
b.
zi  i  wi1 ( yi  pi )
Using weighted least squares
to fit a linear model to zi with
weights wi, give new

estimates  j j
c. Using weighted backfitting
algorithm to fit an additive model
to zi with weights
wi, give new

estimates f j j





3. Continue step 2 until  j converge 3.Continue step 2 until

fj
converge
SPAM Detection via Additive
Logistic Regression
• Input variables (predictors):
– 48 quantitative variables: percentage of words in the email that
match a given word. Examples include business, address,
internet, etc.
– 6 quantitative variables: percentage of characters in the email
that match a given character, such as ‘ch;’, ch(, etc.
– The average length of uninterrupted sequences of capital letters
– The length of the longest uninterrupted sequence of capital
letters
– The sum of length of uninterrupted length of capital letters
• Output variable: SPAM (1) or Email (0)
• fj’s are taken as cubic smoothing splines
SPAM Detection: Results
True Class
Predicted Class
Email (0)
SPAM (1)
Email (0)
58.5%
2.5%
SPAM (1)
2.7%
36.2%
Sensitivity: Probability of predicting spam given true state is spam =
36.2
 0.93
36.2  2.7
58.5
 0.96
Specificity: Probability of predicting email given true state is email =
58.5  2.5
GAM: Summary
• Useful flexible extensions of linear models
• Backfitting algorithm is simple and
modular
• Interpretability of the predictors (input
variables) are not obscured
• Not suitable for very large data mining
applications (why?)
CART
• Overview
– Principle behind: Divide and conquer
– Partition the feature space into a set of rectangles
• For simplicity, use recursive binary partition
– Fit a simple model (e.g. constant) for each
rectangle
– Classification and Regression Trees (CART)
• Regress Trees
• Classification Trees
– Popular in medical applications
CART
• An example (in regression case):
Basic Issues in Tree-based
Methods
• How to grow a tree?
• How large should we grow the tree?
Regression Trees
• Partition the space into M regions: R1, R2,
…, RM.
M
f ( x)   cm I ( x  Rm )
m 1

where,
c m  average( y i | xi  Rm )
Note that this is still an additive model
Regression Trees– Grow the Tree
• The best partition: to minimize the sum of squared error:
N
2
(
y

f
(
x
))
 i
i
i 1
• Finding the global minimum is computationally infeasible
• Greedy algorithm: at each level choose variable j and
value s as:
arg min [min  ( yi  c1 ) 2  min  ( yi  c2 ) 2 ]
j ,s
c1
xi R1 ( j , s )
c2
xi R2 ( j , s )
• The greedy algorithm makes the tree unstable
– The error made at the upper level will be propagated to the lower
level
Regression Tree – how large
should we grow the tree ?
• Trade-off between bias and variance
– Very large tree: overfit (low bias, high
variance)
– Small tree (low variance, high bias): might not
capture the structure
• Strategies:
– 1: split only when we can decrease the error
(usually short-sighted)
– 2: Cost-complexity pruning (preferred)
Regression Tree - Pruning
• Cost-complexity pruning:
– Pruning: collapsing some internal nodes
– Cost complexity:
|T |
C (T )   N mQm (T )   | T |
m 1
Cost: sum of
squared errors
Penalty on the
complexity/size of
the tree
– Choose best alpha: weakest link pruning (p.270,
[HTF])
• Each time collapse an internal node which add smallest error
• Choose from this tree sequence the best one by crossvalidation
Classification Trees
• Classify the observations in node

m to the major class in the node: k (m)  arg max k pmk
– Pmk is the proportion of observation
of class k in node m

pmk
• Define impurity for a node:

1  pmk
– Misclassification error:
K
– Entropy:

1

Nm

  pmk log pmk
k 1
– Gini index :
K

p
k 1
mk

(1  pmk )
I(y
xi Rm
i
 k)
Classification Trees
Node impurity measures versus class proportion for 2-class problem
• Entropy and Gini are more sensitive
• To grow the tree: use Entropy or Gini
• To prune the tree: use Misclassification rate (or any other method)
Tree-based Methods: Discussions
• Categorical Predictors
– Problem: Consider splits of sub tree t into tL
and tR based on categorical predictor x which
has q possible values: 2(q-1)-1 ways !
– Treat the categorical predictor as ordered by
say proportion of class 1
Tree-based Methods: Discussions
• Linear Combination Splits
– Split the node based on  a j X j  s
– Improve the predictive power
– Hurt interpretability
• Instability of Trees
– Inherited from the hierarchical nature
– Bagging (section 8.7 of [HTF]) can reduce the
variance
Bootstrap Trees
Construct B number of trees from B bootstrap samples– bootstrap trees
Bootstrap Trees
Bagging The Bootstrap Trees
B
1
fˆbag ( x)   fˆ *b ( x)
B b 1
fˆ *b ( x )
is computed from the bth bootstrap sample
in this case a tree
Bagging reduces the variance of the original tree by aggregation
Bagged Tree Performance
Majority
vote
Average
MARS
• In multi-dimensional spline the basis functions
grow exponentially– curse of dimensionality
• A partial remedy is a greedy forward search
algorithm
– Create a simple basis-construction dictionary
– Construct basis functions on-the-fly
– Choose the best-fit basis function at each step
Basis functions
• 1-dim linear spline (t represents the knot)
• Basis collections C:
|C| = 2 * N * p
The MARS procedure (1st stage)
1.
2.
3.
4.
Initialize basis set M with a constant function
Form candidates (cross-product of M with set C)
Add the best-fit basis pair (decrease residual error
the most) into M
Repeat from step 2 (until e.g. |M| >= threshold)
M (old)
C
M (new)
The MARS procedure (2nd stage)
The final model M typically overfits the data
=>Need to reduce the model size (# of terms)
Backward deletion procedure
1. Remove term which causes the smallest
increase in residual error
2. Compute
3. Repeat step 1
Choose the model size with minimum GCV.
Generalized Cross Validation (GCV)
• M(.) measures effective # of parameters:
– r: # of linearly independent basis functions
– K: # of knots selected
–c=3
Discussion
• Piecewise linear reflected basis
– Allow operation on local region
– Fitting N reflected basis pairs takes O(N)
instead of O(N^2)
• Left-part is zero, right-part differs by a constant
X[i-1]
X[i]
X[i+1]
X[i+2]
Discussion (continue)
• Hierarchical model (reduce search computation)
– High-order term exists => some lower-order
“footprints” exist
• Restriction: Each input appear at most once in a
product:
e.g. (Xj - t1) * (Xj - t1) is not considered
• Set upper limit on order of interaction
– Upper limit of 1 => additive model
• MARS for classification
– Use multi-response Y (N*K indicator matrix)
– Masking problem may occur
– Better solution: “optimal scoring” (Chapter 12.5 of
[HTF])
MARS & CART relationship
IF
• replace piecewise linear basis by step
functions
• keep only the newly formed product terms
in M (leaf nodes of a binary tree)
THEN
MARS forward procedure
= CART tree growing procedure