iiia06_teytaud_qrrar_01

Download Report

Transcript iiia06_teytaud_qrrar_01

Quasi-random resampling
O. Teytaud *, S. Gelly *,
S. Lallich **, E. Prudhomme **
*Equipe I&A-TAO,
LRI, Université Paris-Sud,
Inria, UMR-Cnrs 8623
**Equipe ERIC,
Université Lyon 2
Email : [email protected], [email protected],
[email protected], [email protected]
What is the problem ?
Many tasks in AI are based on random resamplings :
●cross-validation
●bagging
●bootstrap
●...
Resampling is time-consuming
●cross-validation for choosing hyper-parameters
●bagging in huge datasets
==> we want to have with n resamplings the same
result than with N>>n resamplings
A typical example
You want to learn a relation x--> y on a huge dataset.
The dataset is too large for your favorite learner.
A traditional solution is subagging : average 100 learnings
performed on random subsamples (1/20) of your dataset
We propose : use QR-sampling to average only 40
learnings.
Organization of the talk
(1) why resampling is Monte-Carlo integration
(2) quasi-random numbers
(3) quasi-random numbers in strange spaces
(4) applying quasi-random numbers in resampling
(5) when does it work and when doesn't it work ?
Why resampling is Monte-Carlo integration
What is Monte-Carlo integration :
E f(x) ⋲ sum f(x(i)) / n
What is cross-validation:
Error-rate ⋲ E f(x) ⋲ sum f(x(i)) / n
where f(x) = error rate with the partitionning x
An introduction to QR-numbers
(1) why resampling is Monte-Carlo integration
(2) quasi-random numbers
(3) quasi-random numbers in strange spaces
(4) applying quasi-random numbers in resampling
(5) when does it work and when doesn't it work ?
QR-numbers
(2) quasi-random numbers (less randomized numbers)
We have seen that resampling
is Monte-Carlo integration,
now we will see how Monte-Carlo
integration has been strongly improved.
Quasi-random numbers ?
Random samples in [0,1]^d can be not-so-well distributed
--> error in Monte-Carlo integration O(1/ n) with n
the number of points
Pseudo-random samples ⋲ random samples (we try
to be
very close to pure random)
Quasi-random samples O(1/n) within logarithmic
factors
--> we don't try to be as close as possible to random
--> number of samples much smaller
for a given precision
Quasi-random = low
discrepancy ?
Discrepancy = Max
|Area – Frequency |
A better discrepancy ?
Discrepancy2 = mean ( |Area – Frequency |2 )
Existing bounds on lowdiscrepancy-Monte-Carlo
Random --> Discrepancy ~ sqrt ( 1/n )
Quasi-random --> Discrepancy ~ log(n)^d/n
Koksma & Hlawka :
error in Monte-Carlo integration
< Discrepancy x V
V= total variation (Hardy & Krause)
( many generalizations in Hickernel,
A Generalized Discrepancy
and Quadrature Error Bound, 1997 )
Which set do you trust ?
Which quasi-random numbers ?
« Halton-sequence with a simple scrambling scheme »
●
fast (as fast as pseudo-random numbers) ;
●
easy to implement ;
●
available freely if you don't want to implement it.
(we will not detail how this sequence is built here)
(also:
Sobol sequence)
What else than Monte-Carlo integration ?
Thanks to various forms of quasi-random :
Numerical integration [thousands of papers; Niederreiter 92]
● Learning
[Cervellera et al, IEEETNN 2004, Mary phD 2005]
● Optimization [Teytaud et al, EA'2005]
● Modelizat° of random-process [Growe-Kruska et al,
●
BPTP'03, Levy's method]
●
Path planning [Tuffin]
... and how to do in strange spaces ?
(1) why resampling is Monte-Carlo integration
(2) quasi-random numbers
(3) quasi-random numbers in strange spaces
(4) applying quasi-random numbers in resampling
(5) when does it work and when doesn't it work ?
Have fun with QR in strange spaces
(3) quasi-random numbers in strange spaces
We have seen that resampling is Monte-Carlo
integration, and how Monte-Carlo is replaced by QuasiRandom Monte-Carlo.
But resampling is random in a non-standard
space.
We will see how to do Quasi-Random MonteCarlo in non-standard spaces.
Quasi-random numbers in strange spaces
We have seen hypercubes :
... but we need something else !
Sample of points
points
Sample of samples --->
--->
QR sample of
QR sample of samples
Quasi-random points in strange spaces
Fortunately, some QR-points exist also in various spaces.
Why not in something isotropic ?
How to do it in the sphere ? Or for gaussian distributions ?
For the gaussian : easy !
Generate x in [0,1]^d by quasi-random
Build y: P( N < y(i) ) = x(i)
It works because distrib = product of distrib(y(i))
What in the general case ?
Ok !
- generate x in [0,1]^d
- define y(i) such that
P(t<y(i) | y(1), y(2), ..., y(i-1))=x(i)
Ok !
However, we will do that
We do not have better than this general method for the
strange distributions in which we are interested
●
At least we can prove the O(1/n) property (see the paper)
●
Perhaps there is much better
●
Perhaps there is much simpler
●
The QR-numbers in resampling
(4) applying quasi-random numbers in resampling
we have seen that resampling is Monte-Carlo integration,
that we were able of generating quasi-random points for
any distribution in continuous domains;
==> it should work
==> let's see in details how to move the problem to the
continuous domain
QR-numbers in resampling
A very particular distribution for QR-points : bootstrap
samples. How to move the problem to continuous spaces ?
y(i) = x(r(i)) where r(i) = randomly uniformly distributed
in [[1,n]] ==> this is discrete
QR-numbers in resampling
A very particular distribution for QR-points : bootstrap
samples. How to move the problem to continuous spaces ?
We know :
We need :
Rectangular
uniform
distribution
Continuous
distribution
Any
continuous
distribution
Our discrete
distribution
y(i) = x(r(i)) where r(i) = randomly uniformly distributed
in [[1,n]] --> many solutions exist
What are bootstrap samples ?
Our technique works for various forms of resamplings :
- subsamples without replacement (random-CV, subagging)
- subsamples with replacement (bagging, bootstrap)
- random partitionning (k-CV).
W.l.o.g., we present here the sampling of n elements in a
sample of size n with replacement (= bootstrap
resampling).
(usefull in e.g. Bagging, bias/variance estimation...)
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
0.1
0.9 0.84 0.9
0.7
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
0.1
X
0.9 0.84 0.9
0.7
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
0.1
X
X
0.9 0.84 0.9
0.7
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
X
0.1
X
X
X
0.9 0.84 0.9
0.7
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
X
0.1
==> (1, 0,0,1,3)
X
X
X
0.9 0.84 0.9
0.7
A naive solution
y(i) = x(r(i))
r(1),...,r(n) = ceil( n x qr ) where qr
[0,1]^n
QR in dimension n with n the number of examples.
X
X
X
X
X
==> all permutations of ( 0.1, 0.9, 0.84, 0.9, 0.7) lead to
the same result !
...which does not work.
In practice it does not work better than random.
Two very distinct QR-points can lead to very similar
resamples (permutation of a point lead to the same
sample).
We have to remove this symetry.
A less naive solution
z(i) = number of times x(i) appears in the bootstrap sample
z(1) = binomial
z(2) | z(1) = binomial
z(3) | z(1), z(2) = binomial
...
z(n-1) | z(1), z(2),...,z(n-2) = binomial
z(n) | z(1), z(2), ..., z(n) = constant
==> yes, it works !
==> moreover, it works for many forms of resamplings
and not only bootstrap !
With dimension-reduction it's better
Put x(i)'s in k clusters
z(i) = number of times an element of cluster i appears in
the bootstrap sample
z(1) = binomial
z(2) | z(1) = binomial
z(3) | z(1), z(2) = binomial
...
z(k-1) | z(1), z(2),...,z(k-2) = binomial
z(k) | z(1), z(2), ..., z(k) = constant
(then, randomly draw the elements in each cluster)
Let's summarize
Put x(i)'s in k clusters
z(i) = number of times an element of cluster i appears in
the bootstrap sample
z(1) = binomial
z(2) | z(1) = binomial
...
z(k) | z(1), z(2), ..., z(k) = constant
we quasi-randomize this z(1),...,z(k)
Then, we randomly draw the elements in each cluster.
Let's conclude
(1) why resampling is Monte-Carlo integration
(2) quasi-random numbers
(3) quasi-random numbers in strange spaces
(4) applying quasi-random numbers in resampling
(5) when does it work and when doesn't it work ?
Experiments
In our (artificial) experiments :
● QR-randomCV
is better than randomCV
● QR-bagging
is better than bagging
● QR-subagging
is better than subagging
● QR-Bsfd
is better than Bsfd
bootstrap)
But QR-kCV is not better than kCV
kCV already has some derandomization:
each point appears the same number
of times in learning
(a
A typical example
You want to learn a relation x--> y on a huge ordered
dataset.
The dataset is too large for your favorite learner.
A traditional solution is subagging : average 100 learnings
performed on random subsets (1/20) of your dataset
We propose : use QR-sampling to average only 40
learnings.
Or do you have a better solution for choosing 40 subsets of
1/20 ?
Conclusions
Therefore:
● perhaps simpler derandomizations are enough ?
● perhaps in cases like CV in which « symetrizing »
(picking each example the same number of times) is easy,
this is useless ?
For bagging, subagging, bootstrap, simplifying the
approach is not so simple
==> now, we use QR-bagging, QR-subagging and QRbootstrap instead of bagging, subbagging and bootstrap
Further work
Real-world experiments
(in progress, for DP-applications)
Other dimension reduction (this one involves clustering)
Simplified derandomization methods (jittering, antithetic
variables, ...)
Random clustering for dimension reduction ?
(yes, we have not tested,
sorry ...)
Low Discrepancy ?
Random --> Discrepancy ~ sqrt ( 1/n )
Quasi-random --> Discrepancy ~ log(n)^d/n
Koksma & Hlawka :
error in Monte-Carlo integration
< Discrepancy x V
V= total variation (Hardy & Krause)
( many generalizations in Hickernel,
A Generalized Discrepancy
and Quadrature Error Bound, 1997 )
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
Dimension 1
●
What would you do ?
●
--> Van Der Corput
●
n=1, n=2, n=3...
●
n=1, n=10, n=11, n=100, n=101, n=110...
●
x=.1, x=.01, x=.11, x=.001, x=.101, ...
Dimension 1 more general
●
p=2, but also p=3, 4, ...
but p=13 is not very nice :
Dimension n
●
x --> (x,x) ?
Dimension n
●
x --> (x,x') ?
Dimension n : Halton
●
x --> (x,x') with prime numbers
Dimension n+1 : Hammersley
●
x --> (n/N,x,x') but --> closed sequence
Dimension n : the trouble
●
There are not so many small prime numbers
Dimension n : scrambling
(when random comes back)
●
Pi(p) : [1,p-1] --> [1,p-1]
●
Pi(p) applied to ordinate with prime p
Dimension n : scrambling
●
●
Pi(p) : [1,p-1] --> [1,p-1] (randomly chosen)
Pi(p) applied to
ordinate
with prime p (there is much more
complicated)