Learning intersections and thresholds of halfspaces

Download Report

Transcript Learning intersections and thresholds of halfspaces

Learning intersections and
thresholds of halfspaces
Adam Klivans (MIT/Harvard)
Ryan O’Donnell (MIT)
Rocco Servedio (Harvard)
Learning
We consider the PAC model of [Valiant-84],
in which learning a “concept class” C of
boolean functions means:
- a function f in C is selected, and also a
probability distribution D over {+1,−1}n
- the learning algorithm gets access to
random examples <x, f(x)>, where the x’s
are drawn from D
- goal: efficiently output a hypothesis h such
that w.h.p., Prx←D[f(x) ≠ h(x)] < ε.
Learning example
Example: C is the class of all conjunctions of
variables.
Perhaps the concept selected is:
x1 AND x2 AND x4.
One might see examples:
< (+ + − + − +), + >
< (− + − + − −), − >
< (+ + + − − +), − >
What is a learning algorithm for this class?
Halfspaces
Let h be a hyperplane in Rn:
n
h = {x : ∑wi xi = θ}.
i=1
h naturally induces a boolean function:
f : {+1,−1}n → {+1,−1},
f(x) = sgn(∑wi xi − θ).
We call such a function a boolean halfspace,
or a weighted majority. The majority
function itself is an example (wi ≡ 1, θ = 0).
Learning halfspaces
Learning halfspaces is a very old problem;
dates back to models for the brain from
the ’50s: [Agmon-54, Rosenblatt-58,
Block-62].
The concept class of halfspaces has long
been known to be PAC learnable in
polynomial time via Linear Programming
[BEHW-89].
Indeed, this works over any distribution on
Rn, including those singling out {+1,−1}n.
Learning halfspaces
Basic idea: given a bunch of examples, find a
halfspace which classifies them correctly.
By some learning theory technology
(“Occam’s Razor”), this is a good algorithm.
Consider the coefficients of a hypothesis
halfspace to be unknowns, a1, …, an, θ.
Each example induces some linear
constraints: e.g., < (+ + − + − −), + >
induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.
Learning intersections of halfspaces
The next logical extension of this, and a very
important one, is learning intersections of
halfspaces.
Intersections of halfspaces form a very rich
concept class: all convex bodies, CNF
formulas…
Learning them is also an important problem
for computer vision, study of perceptrons.
But very little is known.
Prior work
- [Baum91]: poly time algorithm for intersection of
two halfspaces through the origin under symmetric
distributions (those satisfying D(x) = D(−x)).
- [BlumKannan,Vempala97] learn an intersection of
O(1) halfspaces in poly time over near-uniform
distributions on the Euclidean sphere:
- not relevant for boolean halfspaces
- [KwekPitt98] gave a polynomial time alg., but
requires membership queries
- also not relevant for boolean halfspaces
Our results
Theorem 1: The concept class of
arbitrary functions of k boolean
halfspaces over {+1,−1}n
is learnable under the uniform distribution to
accuracy 1−ε in time:
nO(k²/ε²).
This is polynomial time if k = O(1), ε = Ω(1).
(Prior to this, no algorithm could learn even an intersection
of 2 arbitrary boolean halfspaces under the uniform
distribution in subexponential time.)
Our results
Theorem 2: The concept class of
intersections of k boolean
halfspaces with weight bound W
is learnable under any probability distribution
to accuracy 1−ε in time:
O(k
log
k
log
W)
n
/ε.
So if the weights are polynomially bounded,
one can learn an intersection of log many
halfspaces in quasipolynomial time.
More results
Function
Halfspaces
Distrib.
any fcn. of k
weight W
any
Time
nO(k² log k logW)/ε
weight k threshold
(e.g., inters. of k)
weight W
any
nO(k log k logW)/ε
intersection of k
weight W
any
nO(√W log k)/ε
read-once
intersection of k
read-once
majority of k
arbitrary
uniform
nO((log(k)/ε)²)
uniform
nÕ((log(k)/ε) )
arbitrary
4
Sketch of techniques
For arbitrary distribution results:
show that functions of low weight
halfspaces have low degree
polynomial threshold representations.
For uniform distribution results:
show that functions of halfspaces have low
noise sensitivity.
Both conclusions imply learning results
generically.
Talk outline
Plan for the rest of the talk:
1. Prove nO(k log k log W) bound for learning
intersections of k weight-W halfspaces
under arbitrary distributions.
(Sketch other arbit. dist. results.)
2. Prove nO(k²/ε²) bound for learning arbitrary
functions of k halfspaces under the
uniform distribution.
(Sketch other unif. dist. results.)
Polynomial threshold functions
A (multilinear) polynomial p : Rn→R is a
PTF for f if it sign-represents f :
f(x) = sgn(p(x)) for all x  {+1,−1}n.
- every boolean halfspace is a degree 1 PTF for
itself
- every boolean function has a degree n PTF
By linear programming [KS01]: if every
function in a class C has a PTF of degree
O(d)
d, then C is learnable in time n
/ε.
PTFs for intersections of halfspaces
Suppose f and g are hyperplanes,
f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .
We would like a PTF for sgn(f)  sgn(g).
Failed attempt 1:
- try f(x)g(x): is >0 if f(x)>0 and g(x)>0 
is >0 if f(x)<0 and g(x)<0 
Failed attempt 2:
- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 
is <0 if f(x)<0 and g(x)<0 
is ?? if f(x)>0 and g(x)<0 
PTFs for intersections of halfspaces
The solution: apply a (polynomial?) function
to f and g to make them look more like
their sign.
Assume ∑|wi| < W. Then for all x  {+1,−1}n,
f(x), g(x)  [-W,-1] ∪ [1,W].
Beigel et al. [BRS95] showed how to
construct a univariate rational function
which is an essentially optimal
approximator of the sgn function on
[-W,-1] ∪ [1,W].
BRS’s sgn-approximator
p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2
p(-x)-p(x)
q(x) = p(-x)+p(x)
Q is a rational function of
degree O(log k log W) such that:
Q(x)  [1, 1+1/k] for x  [1,W],
Q(x)  [-1-1/k, -1] for x  [-W,-1].
PTFs for intersections of halfspaces
Now given weight W halfspaces h1, …, hk,
sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))
is a rational function which sign-represents
the intersection. Once taken to a common
denominator, it has degree O(k log k log W).
Easy to get a polynomial: sgn(p/q)=sgn(pq).
So we have a PTF for the intersection of k
weight-W halfspaces of degree
O(k log k log W). Hence a learning
algorithm running in time nO(k log k log W).
Talk outline
Plan for the talk:
1. Prove nO(k log k log W) bound for learning
intersections of k weight-W halfspaces
under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning
functions of k halfspaces under the
uniform distribution.

Noise sensitivity
Let f : {+1,−1}n → {+1,−1} be a boolean
function. Pick x  {+1,−1}n uniformly at
random, and let y be an ε-corruption of x:
flip each bit of x independently with
probability ε.
defn: The noise sensitivity of f is:
NSε(f) = Pr[f(x) ≠ f(y)].
Noise sensitivity examples
• Let f be a projection to one bit,
f(x1, …, xn) = x1.
Then NSε(f) = ε.
• Suppose f depends on only k bits.
Then NSε(f) ≤ k ε.
• PARITY is the most noise-sensitive
function:
NSε(PARITYn) = ½ − ½(1−2ε)n.
Noise sensitivity – study and apps.
• [Benjamini-Kalai-Schramm-98] – percolation,
low-level circuit complexity
• [Kahn-Kalai-Linial-88] – random walks on the
hypercube
• [Håstad-97] – probabilistically checkable proofs
• [Bshouty-Jackson-Tamon-99] – learning theory
under noise
• [O-02] – Yao’s XOR Lemma, average case
hardness of NP
• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –
study of juntas, Fourier analysis of boolean fcns.
Low noise sens.  fast learning
We show that if the noise sensitivity of all f in
C is uniformly bounded:
NSε(f) ≤ α(ε),
then C is learnable under the uniform
distribution in time:
−1
nO(1)/α (ε/3).
Intuition: if f is not too noise sensitive,
nearby points are highly correlated, so a
net of examples works.
Proof of NS-learning connection
Actually, the intuition is wrong. Here is the
proper proof sketch:
Low noise sensitivity  Fourier spectrum
concentrated at low levels; this uses the
formula: NSε(f) = ½−½ Σ(1−2ε)|S| f(S)
ˆ 2 and a
Markovish inequality.
Low level Fourier concentration  efficient
uniform distribution learning; this is by the
“Low degree” Fourier sampling learning
algorithm of [Linial-Mansour-Nisan-93].
Noise sensitivity of halfspaces
Function
NSε
O(√ε)
one boolean
halfspace
O(k√ε)
any function of k
halfspaces
read-once intersection O(√ε log k)
of k halfspaces
read-once majority of Õ((ε log k)¼)
k halfspaces
proof
Y. Peres, ’98
union bound
difficult
probabilistic
analysis
Consequences
Let C be the class of functions of k boolean
halfspaces. Take α(ε) = O(k√ε), so all
f  C have NSε(f) ≤ α(ε).
α−1(ε/3) = O(ε2/k2).
Hence we get Theorem 1:
a uniform distribution
learning algorithm running
in time nO(k²/ε²).
Noise sensitivity of a halfspace
We now sketch Peres’s beautiful proof that
the noise sensitivity of a single halfspace is
O(√ε).
Suppose the halfspace is f = sgn(∑wi xi−θ).
Without (much) loss of generality, one can
assume θ = 0. Recall that xi’s are selected
randomly from {+1,−1} and the sum is
formed; then each xi is flipped indep. with
prob. ε. We want to show that the prob.
the sums land on opposite sides of 0 – call
this a “flop”, prob. P – is O(√ε).
Noise sensitivity of a halfspace
With high probability, the number of flipped
bits is about k := εn. Let’s assume we
always flip exactly k random bits, and that k
divides n. (Both assumptions are easily
removed.)
We now model the problem thus: Pick signs
xi at random. Randomly permute the
weights. Divide the weights into n/k blocks
of size k. Form the n/k block sums,
X1= ∑wi xi, X2= ∑wi xi, etc.
i=1…k
i=k+1…2k
Noise sensitivity of a halfspace
Write S = X1 + … + Xn/k for the initial sum.
Because of the permutation, we may
assume that the random signs in the first
block are the “flips”. Put S' = S − X1, so the
sum before flipping is S'+X1, and the sum
after flipping is S'−X1. We are trying to
bound the probability P that these two sums
have opposite signs (a flop). Note that this
happens iff |S'| < |X1|.
Noise sensitivity of a halfspace
sgn(X1) and S' are independent, so:
Pr[sgn(X1) ≠ sgn(S')] = ½.
sgn(X1) and |X1| are independent, so:




Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½
Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½
Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)
Pr[sgn(X1) ≠ sgn(S)]
= ½(1−P)
P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].
Noise sensitivity of a halfspace
Of course, there was nothing special about
block X1 as opposed to any other block. So
in fact,
P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].
for all i = 1…n/k.
Write τ=sgn(S), σi=sgn(Xi), and average:
P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].
Noise sensitivity of a halfspace
P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]
The quantity inside the expectation is some
random variable, a number which is either
½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].
If I tell you a number is either a or b, then
assuredly it’s at most |a| + |b|. Applying
this to the expectation, pointwise:
P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]|
+ |½ – (k/n) ∑i I[σi=−1]|].
Noise sensitivity of a halfspace
P ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]
i=1…1/ε
i=1…1/ε
But the σi’s are simply independent, uniformly
random signs. Hence both quantities in the
expectation are merely the expected
absolute deviation from the mean in 1/ε
samples of an unbiased 0/1 random
variable – i.e., O(√ε).
Extensions
This concludes the proof that a single
halfspace has noise sensitivity O(√ε), from
which the uniform distribution learning
algorithm for functions of k halfspaces
follows.
To get the extended learning algorithms,
must work harder at analyzing noise
sensitivity. Key result: if a halfspace h is
biased – say, the probability of + is p < ½,
then:
NSε(h) ≤ min{2p, C p (ε log(1/p))½}.
Talk outline
Plan for the talk:
1. Prove nO(k log k log W) bound for learning
intersections of k weight-W halfspaces
under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning
functions of k halfspaces under the
uniform distribution.

Open technical challenges
• Give an upper bound on the degree
necessary for a PTF which represents the
AND of two arbitrary halfspaces.
(For a new lower bound, see my talk
tomorrow!)
• Give a better analysis of the noise
sensitivity of the intersection of k
halfspaces on n bits. Is it O((ε log k)½)?
The huge open problem
It still remains open how to learn an
intersection of two arbitrary boolean
halfspaces under an arbitrary distribution
in subexponential time!