沒有投影片標題 - ntpu.edu.tw

Download Report

Transcript 沒有投影片標題 - ntpu.edu.tw

Data Mining, Data Perturbation and Degrees of
Freedom of Projection Regression
T.C. Lin
* appear in JRSS series C
1
Canadian lynx data gives the annual record of the
Canadian lynx trapped in the Mackenzie River district of
the North-West Canada for the period 1821-1934. (n=114)
•Some about this data set
•is a fairly noisy set of field data.
•has nonlinear and no Gaussian characteristics (see
Tong, 1990).
•has been used as a benchmark to gauge the
performance of various time series methods.
(cf. Campbell and Walker (1977), Tong (1980, 1990),
Lim (1987))
2
3
4
Nowadays the same data set is used routinely in the formulation,
selection, estimation, diagnostics and prediction of statistical
models.
Parametric:
•Linear: AR, MA, ARMA
•Nonlinear: SETAR, ARCH,...
Nonparametric*
• Most general: Xt = f (Xt-i1, …, Xt-ip) + et
• Additive: Xt = f 1(Xt-i1)+...+ fp (Xt-ip ) + et (ADD)
•PPR: Y = β0+ΣKk=1βkfk(a’kX)+e,
*The selection, estimation are usually based on the
smoothing, back fitting, BRUTO, ACE, Projector, etc.
(Hastie, 1990)
5
Questions:
• Is it possible to compare performances of such models?
(/ Can nonparametric methods produce models with better
predictive power than the parametric methods?)
• What is the degrees of freedom (df) of a model?
(/ How can one access and adjust for the impact of data
mining?)
While no universal answer can be expected, a data
perturbation procedure (Wong and Ye, 1996) is used
here to assess and adjust the impact of df to each
fitted model.
6
Why Data Mining?
The theory of statistical inference is based on the
assumption that the model for the data is given a
priori. i.e.,
x1,...,xn ~ p(x|Θ),
p(x|Θ) is the known distribution with unknown
parametersΘ.
But in practice this assumption is not tenable since
the model is seldom formulated using the subjectmatter knowledge or data-free procedures.
Consequently, over fitting or data mining occurs
frequently in the modern data analysis environment.
7
How to count the df?
•Parametric
df = # of parameters in the model.
•Nonparametric
df is at the heart of the problem of assessing the impact of data
mining.
•Example 1. Y=WY, df = tr (W), see Hastie and Tibshirani
(1990).
•Example 2. df = tr (H), introduced later.
8
Idea of data perturbation:
Intuitively and ideally, any estimated model should be
validated using a new data set.
It can be viewed as a method of generating new data that is
close to the observed response Y (a generalization of the
Breiman's little bootstrap method) .
Yˆ  fˆ (Y )  ( fˆ1 (Y ),  , fˆn (Y ))' ,
  ( 1 ,  ,  n )'
we would like to have,
ˆ  )  Yˆ  fˆ ( Y   )  fˆ ( Y )  H  ,
(Y 
 fˆi ( Y   )  fˆi ( Y )  hii  i ,
where
n
H  [ hii ]  [  fˆi /  Y i ] i , j 1 .
=> Effective degrees of freedom (EDF) = tr(H) =  hii
n
i 1
9
Table 1. MSE & SD of five models fitted to lynx data
M o d el
A R (2 )
SETA R
A D D (1 ,2 ) A D D (1 ,2 ,9 ) P P R
M SE
0 .0 4 5 9
0 .0 3 5 8
0 .0 4 5 5
0 .0 3 8
0 .0 1 9 4
0 .0 4 4 3
0 .0 3 6 5
0 .0 3 7 7
0 .1 0 0
0 .3 4 7
0 .2 4 7
M S E ad j
SD
0 .2 9 5
0 .1 3 6
About SD: Fit the same class of models to the first 100 obs.,
keeping the last 100 for out-of-sample predictions. SD = the
standard deviation of the multi-step ahead prediction errors.
10
Models for the lynx data.
◆Model 1: Moran’s AR(2) model
Xt = 1.05 + 1.41 Xt-1 - 0.77Xt-2 + et,
et ~ WN (0,0.04591).
◆Model 2: Tong’s SETAR (2;7,2) model
0.546+1.032Xt-1-0.173Xt-2+0.171 Xt-3
-0.431Xt-4+0.332X-0.284Xt-6
Xt= +0.210Xt-7+et(1)
if Xt-2≦3.116
2.632+1.492Xt-1-1.324Xt-2+et(2) if Xt-2>3.116
var(et(1))= 0.0259, var(et(2))= 0.0505.
(Pooled var = 0.0358).
11
BRUTO Algorithm (see HT 90)
is a forward model selection procedure using a modified
GCV, defined by
GCV
b


n
i 1
{ yi 
n{1  [1 


p
j 1
p
j 1
fˆ j , j ( x ij )} 2
{tr ( S j , j )  1}] / n }
2
to choose the significant variables and their smoothing
parameters.
12
K = 5 (number of potential variable)
• Model 3: ADD(1,2)
Xt = 1.07 + 1.37 Xt-1 + s (Xt-2, 3) + et,
where
• et ~ WN(0,0.0455)
• s(x, d) stands for a cubic smoothing spline
with df = d fitted to a time series x.
13
K=10:
•Model 4: ADD(1,2,9)
Xt = 0.61 + 1.15 Xt-1 + s(Xt-2,3) + s(Xt-9,3) + et,
where et ~ WN (0,0.0381).
14
What is PPR (Projection pursuit )?
Let
• Xt-1=(Xt-i1, Xt-i2, ..., Xt-ip)' be a random vector and
• a1, a2, ... denote some p-dimensional unit
``direction" vectors.
The PPR are additive models of linear combinations
of past values,
Xt = ΣKk=1fk*(a’k Xt-1 )+et
= β0+ΣKk=1βkfk(a’k Xt-1 )+et.
15
• Model 5: PPR
Because of model 4 and for simplicity we take Xt = (xt-1, xt-2,
xt-9)' as the covariate. Using the PPR algorithm , we have
k
βk
1
0.4677
(0.8619, -0.5021, 0.0703)
2
0.1241
(0.6515, -0.4106, -0.6378)
3
0.1380
(0.0665, -0.9888, 0.1335)
4
0.1251
(-0.6473, 0.6631, 0.3756)
5
0.1384
(-0.5305, 0.6342, 0.5623)
and MSE = 0.0194 .
a k (projections)
16
Data Perturbation Procedure:
•For an integer m > 1 (the Monte Carlo sample size),
generateδ1, ...,δm as i.i.d. N(0, t2In) where t > 0 and In
is the n×n identity matrix.
•Use the ``perturbed" data Y +δj, to re-compute (refit)
(Y ˆ  j )  fˆ (Y   j ),
j  1, 2 ,  , m .
•For i =1,2, ..., n, the slope
of the LS line fitted to
( (Yi +δij), δij), j=1, ..., m, gives an estimate of hii.
fˆi
17
18
Conclusion:
The comparison and our findings bring home the danger of not
accounting for the impact of data mining. Evidently, extensive data
mining yields models and estimates that are too faith to the data with
small (in-sample) prediction error and yet considerable loss in out-ofsample prediction accuracy.
19