Transcript PPT Slides

• By Gil Kalai
• Institute of Mathematics and Center for
Rationality, Hebrew University, Jerusalem,
Israel
• presented by: Yair Cymbalista
1
• In this lecture we would like to study the extent to
which concepts of choice used in economic theory
imply “learnable” behavior.
• Learning theory which was introduced and
developed by Valiant, Angluin and others at the
early 1970’s deals with the question how well a
family of objects (functions in this lecture) is
learnable from examples. Valiants learning
concept is statistical i.e. the examples are chosen
randomly.
2
• In order to analyze Learnability we will use a
basic model of statistical learning theory
introduced by Valiant called the model of PACLearnability (PAC stands for “probably
approximately correct”). It is based on choosing
examples randomly according to some probability
distribution.
3
• Let F  { f | f : U  Y }.
Let 0    1 and  be a probability distribution
on U. We say that F is learnable from t examples
with probability of at least 1  with respect to the
probability distribution  if the following
assertion holds:
4
• For every f F if u1 , u2 ,...., ut and u are
chosen at random and independently according to
the probability distribution  and if f F
satisfies f ' (ui )  f (ui ), i = 1, 2, . . . , t, then
f ' (u )  f (u ) with probability of at least 1  .
We say that F is learnable from t examples with
probability of at least 1   if this is the case with
respect to every probability distribution  on U.
5
• Given a set X of N alternatives, a choice function
c is a mapping which assigns to a nonempty subset
S X an element c(S) S.
• A “rationalizable” choice function is consistent
with maximizing behavior, i.e. there is a linear
ordering on the alternatives in X and c(S) is the
maximum among the elements of S with respect to
this ordering.
6
• Rationalizable choice functions are characterized
by the Independence of Irrelevant alternatives
condition (IIA): the chosen element from a set is
also chosen from every subset which contains it.
• A class of choice functions is symmetric if it is
invariant under permutations of the alternatives.
7
• In our lecture we will concentrate in the proof of
the next two theorems.
• Theorem 1.1. A rationalizable choice function
can be statistically learned with high probability
from a number of examples which is linear in the
number of alternatives.
• Theorem 1.2. Every symmetric class of choice
functions requires at least a number of examples
which is linear in the number of alternatives for
learnability in the PAC-model.
8
F  { f | f :U  Y}
The P - dimension of F , denoted by dim P F , is
the maximal number of elements u1 , u2 ,..., u s of
U and values y1 , y2 ,..., ys of Y such that for every
subset B  {1,2,..., s} there is a function f B  F so
that f B (ui )  yi if i  B and f B (ui )  yi if i  B.
9
• Theorem 2.1. For a fixed value of 0    1, the
number of examples t needed to learn a class of
functions with probability of at least 1  is
bounded above an below by a linear function of
the P-dimension.
10
• Proposition 2.2. Let F  { f | f : U  Y } Then,
dim P ( F )  log 2 | F | .
• Proof: The proof is obvious since in order of the
s
P-dimension of F to be s we need at least 2
distinct functions.
11
Let F  { f | f : U  Y }, let U '  U .
Let F '  { f
U'
| f  F }.
• dim P ( F ' )  dim P ( F ).
• If F can be learned from t examples with
probability of at least 1   then so can F' (since
we can choose  to be supported only on U').
12
• Let C be the class of rationalizable choice
functions defined on nonempty subsets of a set
of alternatives where |X|=N.
X
• |C|=N!
• Proposition 2.2  dim PC  log 2 N! N log 2 N.
• Proposition 2.1  The number of examples
required to learn a rationalizable function in the
PAC-model is O(NlogN).
13
• Theorem 3.1. dim PC  N  1.
• Proof: We will first show that dim PC  N  1.
Consider the elements
ui  (a1, ai 1 ), yi  a1, 1  i  N  1.
For each B  {u1 , u2 ,..., uN 1} an order relation
which satisfies ai 1  a1 if ui  B, and ai 1
if ui  B, gives an appropriate function f B .
 a1
14
Proof of Theorem 3.1(cont.)
• We will now show that dim PC  N  1.
We
need to show that for every N sets
A1and
, A2N,..., AN
ai i=1,2,…,N
 Ai
elements
there is a subset S
 {1,2,…,N} such that there is no linear order
ai
on the 
ground set X for which
is the
maximal element of Ai if and only if i S.
Let sk | Ak |, clearly we can assume sk  1 for
every k.
Let X  ( x1 , x2 ,..., xN ).
15
Proof of Theorem 3.1(cont.)
For every j = 1,2,…,N consider the following
1
2
N
N
vector: v j  (v j , v j ,..., v j )  R .
Where:
 v  0 if xk  A j
k
j
 v  s j - 1 if xk  a j
k
j
 v  - 1 if xk  A j and xk  a j
k
j
16
Proof of Theorem 3.1(cont.)
• Note that all vectors v j belong to an N-1
N
dimensional V  R of vectors whose sum of
coordinates is 0. Therefore the vectors v1 , v2 ,..., vN
areNlinearly dependent. Suppose now that
r v  0 and that not all r j ' s equal zero.
j 1 j j
• Let S={ j| c j  0}.
We
will now show that there is no c C such that
c( Ak )k S
akand
c( Ak )  akk S.
when
when

17
Proof of Theorem 3.1(cont.)
• Assume to the contrary that there is such a
function c. Let B  { A j | r j  0}, and let
xm  c(B). Denote by yN the mth coordinate in the
r
v
.
linear combination
we
will
show
that
j
j
j 1
y is positive.
If rj  0 or if xm  A j then r j v j  0.

18
Proof of Theorem 3.1(cont.)
• Assume now that r j  0 and xm  A j therefore
c( A )  x . There are two cases to consider:
j
m
 r j  0  j  S  a j  c ( A j )  xm
 r v  rj  ( s j  1)  0.
m
j j
 r j  0  j  S  a j  c ( A j )  xm
•
 r v  rj  (1)  0.
N
m
Therefore y   rj v j  0.
j 1
m
j j
Contradiction!!!
19
• Theorem 1.1. A rationalizable choice function
can be statistically learned with high probability
from a number of examples which is linear in the
number of alternatives.
• Proof: Theorem 1.1 follows from Theorem 3.1
and theorem 2.1.
20
• Let F  { f | f : U  Y } . Let 0   ,   1 and
 be a probability distribution on U. Let g be an
arbitrary function g : U  Y . Define the distance
from g to F, dist(g,F), to be the minimum
probability over fF that f(x)  g(x), with respect
to .
• Given t random elements u1 , u2 ,...., ut (drawn
independently according to ), define the
empirical distance of g to F, dist emp ( g , F ), as:
min | {i : f (ui )  g (ui )} | / t.
f F
21
• Theorem 4.1. There exists K(,) such that for
every probability distribution  on U and every
function g : U  Y, the number of independent
random examples t needed such that
| dist( g , F )  dist emp ( g , f ) |  
with probability of at least 1- , is at most
K ( ,  )  dim P ( F ) .
22
• Corollary 4.2. For every probability distribution
 on U and every function g : U  Y, if g agrees
with a function in F on t independent random
examples and t  K ( ,  )  dim P ( F )
then: dist(g,F) < 
with probability of at least 1- .
23
• The class of rationalizable choice functions is
symmetric under relabeling of the alternatives.
Mathematically speaking, every permutation  on
X induces a symmetry among all choice functions
given by  (c)( S )   1c( ( S )).
• A class of choice functions will be called
symmetric if it is closed under all permutations of
the ground set of alternatives X.
24
• A choice function defined on pairs of elements is
an asymmetric preference relation. Every choice
function describes an asymmetric preference
relation by restricting it to pairs of elements.
• Every choice function defined on pairs of
elements of X describes a tournament whose
vertices are the elements of X, such that for two
elements x,yX, c({x,y})=x if and only if in the
graph induced by the tournament there is an edge
oriented from x to y.
25
• Theorem 5.1. (1) The P-dimension of every
symmetric class C of preference relations
(considered as choice functions on pairs) on N
alternatives is at least N/2.
(2) When N 8 the P-dimension is at least N-1.
(3) when N 68, if the P-dimension is precisely
N-1, then the class is the class of order relations.
26
Proof of Theorem 5.1
(1)Let X  {x1 , x2 ,..., xN } and m  N / 2 . Let
Ai  {x2i 1 , x2i }, 1  i  m. Let cC and assume
without loss of generality that c( Ai )  x2i 1 , 1  i  m.
Let R{1,2,…,m}. We will define  R as follows:
If k  R then :
 R ( x2 k 1 )  x2 k 1 and  R ( x2 k )  x2 k .
If k  R then :
 R ( x2 k 1 )  x2 k and  R ( x2 k )  x2 k 1.
(If N is odd define  R ( xN )  xN .)
27
Proof of Theorem 5.1(cont.)
1
If k  R then :  R (c)( Ak )   R (c( R ( Ak ))) 
1
1
 R (c( Ak ))   R ( x2 k 1 )  x2 k 1  c( Ak )
1
If k  R then :  R (c)( Ak )   R (c( R ( Ak ))) 
1
1
 R (c( Ak ))   R ( x2 k 1 )  x2 k  c( Ak )
Therefore:  R (c)( Ak )  x2 k 1  c( Ak )  k  R
and hence  R (c) is satisfactory.
28
Proof of Theorem 5.1(cont.)
(2) To prove part (2) we will use the next conjecture
made by Rosenfeld and proved by Havet and
Thomson: When N 8, for every path P on N
vertices with an arbitrary orientation of the edges,
every tournament on N vertices contains a copy of
P.
29
Proof of Theorem 5.1(cont.)
Let c be a choice function in the class and consider
the tournament T described by c.
Let Ak  {xk , xk 1}, 1  k  N  1. Every choice
function c' on A1 , A2 ,..., AN 1 describes a
directed path P. Suppose that a copy of P can be
found in our tournament and that the vertices of
this copy (in the order they appear on the path) are
xi1 , xi2 ,..., xiN . Define a permutation  by  ( x j )  xi j .
The choice function (c) will agree with c' on
A1 , A2 ,..., AN 1.
30
• Theorem 5.1 implies the following Corollary:
Corollary 5.2. The P-dimension of every
symmetric class of choice functions on N
alternatives, N 8 is at least N-1.
31
• Consider a tournament with N players such that
for two players i and j there is a probability pij
that i beats j in a match between the two. Among
a set A N of players let c(A) to be the player
most likely to win a tournament involving the
players in A.
• Consider the class W of choice functions that arise
in this model where pij  [0,1].
32
• Theorem 6.1. The class of choice functions W
requires O( N 3 ) examples for learning in the
PAC-model.
33
• Consider m polynomials Qi ( x1 ,..., xr ) 1  i  m
r
in r variables x1 , x2 ,..., xr . For a point c  R
the sign pattern ( s1 , s2 ,..., sm )  {1,0,1}m
where s j  signQ j (c), namely :
s j  (1) if Q j (c)  0
sj  0
if Q j (c)  0
sj 1
if Q j (c)  0.
34
• Theorem 6.2. If the degree of every Q j is at most
D and if 2m>r then the number of sign patterns
given by the polynomials Q1 , Q2 ,..., Qm is at most
(8eDm / r ) r .
35
• Given a set A of players, the probability that the
k-th player will be the winner in a tournament
between the players of A is described by a
polynomial Q(A,k) in the variables pij as follows:
Let M= ( mij ) to be an s by s matrix representing
the out come of all matches between the players of
A in a possible tournament such that
mii  0 for 1  i  s, mij  1 if player i won the
match against player j and mij  0 otherwise.
36
Proof of Theorem 6.1(cont.)
• The probability pM that such a matrix M will
represent the results of matches in a tournament is:
pM 
p
ij
i , j A
mij 1
• Define Q(A,k)=
{ pM : k won in the tournamen t represente d by M }
• C(A) is the player kA for which Q(A,k) is
maximal.
37
Proof of Theorem 6.1(cont.)
• Q(A,k) is a polynomial of degree N(N-1)/2 in
N(N-1)/2 variables pij i, jA.
• Now consider Q(A,k, j) = Q(A,k) - Q(A, j) for all
subsets A N and k, jA k  j. We have all
N
2
together less than 2  N polynomials in
N(N-1)/2 variables pij . The degree of these
polynomials is at most N(N-1)/2.
38
Proof of Theorem 6.1(cont.)
• Now c(A)=k  Q(A,k, j) >0 for every jA j k.
Therefore the choice function given by a vector of
probabilities pij is determined by the sign pattern
of all polynomials Q(A,k, j).
• We can now invoke Warren’s theorem with
r = D = ( 2N ) and m  N 2  2 N .According to
Warren’s theorem the number of different sign
patterns of the polynomials is at most
N
2 N ( N 1) / 2
(8e2 N )
.
39
Proof of Theorem 6.1(cont.)
• log 2 (8e2 N N 2 ) N ( N 1) / 2  N 3 .
• Therefore it follows from theorem 2.1 and
Proposition 2.2 that the number of examples
needed to learn W in the PAC-model is O( N 3 ).
40
• Our main result determined the P-dimension of the
class of rationalizable choice functions and
showed that the number of examples needed to
learn a rationalizable choice function is linear in
the number of alternatives.
• We also described a mathematical method for
analyzing the statistical learnability of complicated
choice models.
41