IEG 3050 - Information Theory Society

Download Report

Transcript IEG 3050 - Information Theory Society

Refinement of
Two Fundamental Tools in
Information Theory
Raymond W. Yeung
Institute of Network Coding
The Chinese University of Hong Kong
Joint work with Siu Wai Ho and Sergio Verdu
Discontinuity of Shannon’s
Information Measures
 Shannon’s information measures: H(X), H(X|Y), I(X;Y)
and I(X;Y|Z).
 They are described as continuous functions [Shannon
1948] [Csiszár & Körner 1981] [Cover & Thomas 1991]
[McEliece 2002] [Yeung 2002].
 All Shannon's information measures are indeed
discontinuous everywhere when random variables
take values from countably infinite alphabets [Ho &
Yeung 2005].
 e.g., X can be any positive integer.
P.2
Discontinuity of Entropy
 Let PX = {1, 0, 0, ...} and


1
1
1


PXn  1 
,
,
,...,0,0,....
log n n log n n log n




 As n  , we have
i PX (i)  PXn (i) 
 However,
2
 0.
log n
lim H  X n   .
n 
P.3
Discontinuity of Entropy
 Theorem 1: For any c  0 and any X taking values
from a countably infinite alphabet with H(X) < ,
PXn s.t. V  PX , PXn   i PX (i )  PXn (i )  0
but H  X n   H  X   c
H X n 
H X   c
H X 
0
n
P.4
Discontinuity of Entropy
 Theorem 2: For any c  0 and any X taking values
from countably infinite alphabet with H(X) < ,
PX (i )
PXn s.t. D PX || PXn   i PX (i ) log
0
PXn (i )
but H  X n   H  X   c
H X n 
H X   c
H X 
0
n
P.5
Pinsker’s inequality
1
D(p || q) ³
V 2 (p, q)
2 ln 2
 By Pinsker’s inequality, convergence w.r.t. D(× || ×)
implies convergence w.r.t. V(×,×) .
 Therefore, Theorem 2 implies Theorem 1.
P.6
Discontinuity of Entropy
1
1
1
4
1
2
4
P.7
Discontinuity of Shannon’s
Information Measures
 Theorem 3: For any X, Y and Z taking values from
countably infinite alphabet with I(X;Y|Z) < ,
PXnYnZn s.t. limn  D PXYZ || PXnYnZn   0
but limn  I  X n ; Yn | Z n   .
P.8
Discontinuity of Shannon’s
Information Measures
Applications:
channel coding theorem
lossless/lossy source coding theorems, etc.
Fano’s
Inequality
Typicality
Shannon’s
Information Measures
P.9
To Find the Capacity of a
Communication Channel
Alice
Bob
Channel
Capacity  C1
Typicality
Capacity  C2
Fano’s Inequality
P.10
On Countably Infinite Alphabet
Applications:
channel coding theorem
lossless/lossy source coding theorems, etc.
Fano’s
Inequality
Typicality
Shannon’s
Information Measures
discontinuous!
P.11
Typicality
 Weak typicality was first introduced by Shannon
[1948] to establish the source coding theorem.
 Strong typicality was first used by Wolfowitz [1964]
and then by Berger [1978]. It was further developed
into the method of types by Csiszár and Körner
[1981].
 Strong typicality possesses stronger properties
compared with weak typicality.
 It can be used only for random variables with finite
alphabet.
12
Notations
 Consider an i.i.d. source {Xk, k  1}, where Xk taking
values from a countable alphabet X .
 Let PX  PXk for all k.
 Assume H(PX) < .
 Let X = (X1, X2, …, Xn)
 For a sequence x = (x1, x2, …, xn)  X n ,
 N(x; x) is the number of occurrences of x in x
 q(x; x) = n-1N(x; x) and
 QX = {q(x; x)} is the empirical distribution of x
 e.g., x = (1, 3, 2, 1, 1).
N(1; x) = 3, N(2; x) = N(3; x) =1
QX = {3/5, 1/5, 1/5}.
13
Weak Typicality
 Definition (Weak typicality): For any e > 0, the weakly
typical set Wn[X]e with respect to PX is the set of sequences
x = (x1, x2, …, xn)  X n such that
14
Weak Typicality
 Definition 1 (Weak typicality): For any e > 0, the weakly
typical set Wn[X]e with respect to PX is the set of sequences
x = (x1, x2, …, xn)  X n such that
| D(QX || PX )+ H(QX )- H(PX ) |£ e
 Note that
H(QX ) = -åQX (x)logQX (x)
x
while
åQ
Empirical entropy = -
X
(x)log PX (x)
x
15
Asymptotic Equipartition Property
 Theorem 4 (Weak AEP): For any e > 0:
 1) If x  Wn[X]e , then
2  n ( H ( X )  e )  p ( x)  2  n ( H ( X ) e )
 2) For sufficiently large n,


Pr X  W[nX ]e  1  e
 3) For sufficiently large n,
(1  e )2n( H ( X ) e )  W[nX ]e  2n( H ( X ) e )
16
Illustration of AEP
X n − Set of all n-sequences
Typical Set of n-sequences:
Prob. ≈ 1
≈ Uniform distribution
17
Strong Typicality
 Strong typicality has been defined in slightly different
forms in the literature.
 Definition 2 (Strong typicality): For |X | <  and any d > 0,
the strongly typical set Tn[X]d with respect to PX is the set of
sequences x = (x1, x2, …, xn)  X n such that
V ( PX , QX )   x | PX ( x)  q( x; x) | d
the variational distance between the empirical distribution
of the sequence x and PX is small.
18
Asymptotic Equipartition Property
 Theorem 5 (Strong AEP): For a finite alphabet X and any
d > 0:
 1) If x  Tn[X]d , then
 n ( H ( X ) d )
2
 p(x)  2  n( H ( X ) d )
 2) For sufficiently large n,


Pr X  T[nX ]d  1  d
 3) For sufficiently large n,
(1- d )2n(H ( X )-g ) £ T[ nX ]d £ 2 n(H (X )+g )
19
Breakdown of Strong AEP
 If strong typicality is extended (in the natural way) to
countably infinite alphabets, strong AEP no longer
holds
 Specifically, Property 2 holds but Properties 1 and 3
do not hold.
P.20
Typicality
X n finite alphabet
Weak Typicality:
| D(QX || PX )  H (QX )  H ( PX ) | e
Strong Typicality:
V ( PX , QX )  d
21
Unified Typicality
X n countably infinite alphabet
Weak Typicality:
| D(QX || PX )  H (QX )  H ( PX ) | e
Strong Typicality:
V ( PX , QX )  d
x s.t. D(QX || PX ) is small
but | H (QX )  H ( PX ) | is large
22
Unified Typicality
X n countably infinite alphabet
Weak Typicality:
| D(QX || PX )  H (QX )  H ( PX ) | e
Strong Typicality:
V ( PX , QX )  d
Unified Typicality:
D(QX || PX ) | H (QX )  H ( PX ) | .
23
Unified Typicality
 Definition 3 (Unified typicality): For any  > 0, the
unified typical set Un[X] with respect to PX is the set of
sequences x = (x1, x2, …, xn)  X n such that
D(QX || PX )+ | H(QX ) - H(PX ) |£ h
 Weak Typicality: | D(Q || P )+ H(Q )- H(P ) |£ e
X
X
X
X
Strong Typicality: V(PX ,QX ) £ d
 Each typicality corresponds to a “distance measure”
 Entropy is continuous w.r.t. the distance measure induced
by unified typicality
24
Asymptotic Equipartition Property
 Theorem 6 (Unified AEP): For any > 0:
 1) If x  Un[X] , then
2 n( H ( X )  )  p(x)  2  n( H ( X )  )
 2) For sufficiently large n,


Pr X  U[nX ]  1  
 3) For sufficiently large n,
(1- h )2 n(H (X )-m ) £ U[nX ]h £ 2n(H ( X )+m )
25
Unified Typicality
 Theorem 7: For any x  X n,
if x  Un[X] , then x  Wn[X]e and x  Tn[X]d ,
where e   and d    2 ln 2.
26
Unified Jointly Typicality
 Consider a bivariate information source {(Xk, Yk), k  1}
where (Xk, Yk) are i.i.d. with generic distribution PXY .
 We use (X, Y) to denote the pair of generic random
variables.
 Let (X, Y) = ((X1, Y1), (X2, Y2), …, (Xn, Yn)).
 For the pair of sequence (x, y), the empirical distribution
is QXY = {q(x,y; x,y)} where q(x,y; x,y) = n-1N(x,y; x,y).
27
Unified Jointly Typicality
 Definition 4 (Unified jointly typicality): For any  > 0, the
unified typical set Un[XY] with respect to PXY is the set of
sequences (x, y)  X nY n such that
D(Q XY || PXY ) | H (Q XY )  H ( PXY ) |
 | H (Q X )  H ( PX ) |  | H (QY )  H ( PY ) |  .
 This definition cannot be simplified.
28
Conditional AEP
 Definition 5: For any x  Un[X] , the conditional typical
set of Y is defined as

U[nY | X ] (x)  y  U[nY ] : (x, y )  U[nXY ]

 Theorem 8: For x  Un[X], if
U[nY | X ]  1,
then
2n( H (Y | X )  )  U[nY | X ]  2n( H (Y | X )  )
where   0 as   0 and then n  
29
Illustration of Conditional AEP
2
2nH(X)
x S[nX]
nH(Y)
y
S[nY]
.
.
.
.
. .
.
.
.
.
.
. .
.
.
2nH(X,Y)
n
(x,y) T[XY
]
P.30
Applications
 Rate-distortion theory
 A version of rate-distortion theorem was proved by strong
typicality [Cover & Thomas 1991][Yeung 2008]
 It can be easily generalized to countably infinite alphabet
 Multi-source network coding
 The achievable information rate region in multisource
network coding problem was proved by strong typicality
[Yeung 2008]
 It can be easily generalized to countably infinite alphabet
31
Fano’s Inequality
 Fano's inequality: For discrete random variables X and
Y taking values on the same alphabet X = {1, 2, },
let
e  P[ X  Y ]  1   PXY ( w, w)
wX
 Then
H ( X | Y )  e log(| X | 1)  h(e ),
where
1
1
h( x)  x log  (1  x) log
x
1 x
for 0 < x < 1 and h(0) = h(1) = 0.
P.32
Motivation 1
H ( X | Y )  e log(| X | 1)  h(e )
 This upper bound on H(X | Y ) is not tight.
 For fixed e and | X | , can always find X such that
H(X |Y ) £ H(X) < e log(| X | -1)+ h(e )
 Then we can ask, for fixed PX and e, what is
max
PY | X :P[ X Y ]e
H ( X | Y )  e log(| X | 1)  h(e )
P.33
Motivation 2
 If X is countably infinite, Fano’s inequality no longer
gives an upper bound on H(X|Y).
 It is possible that H ( X | Y )  0 as e  0 which can be
explained by the discontinuity of entropy.
 PX  1 
n

1
1
1 
,
,...,
 and PYn  1,0,0,...
log n n log n
n log n 
1
0
 Then H(Xn|Yn) = H(Xn)   but e n 
log n
 Under what conditions e  0  H(X|Y)  0 for
countably infinite alphabets?
P.34
Tight Upper Bound on H(X|Y)
 Theorem 9: Suppose e  P[ X  Y ]  1  PX (1) , then
H ( X | Y )  e H (Q ( PX , e ))  h(e )
where the right side is the tight bound dependent on eand
PX. (This is the simplest of the 3 cases.)
e
q1 q2 q3 q4
Q ( PX , e )  {e 1q1, e 1q2 , e 1q3 ,}
 Let  X (e )  e H (Q ( PX , e ))  h(e )
P.35
Generalizing Fano’s Inequality
 Fano's inequality [Fano 1952] gives an upper bound
on the conditional entropy H(X|Y) in terms of the error
probability ePrX  Y.
 e.g. PX = [0.4, 0.4, 0.1, 0.1]
[Fano 1952]
H(X|Y)
[Ho & Verdú 2008]
e
P.36
Generalizing Fano’s Inequality
 e.g., X is a Poisson random variable with mean equal
to 10.
 Fano's inequality no longer gives an upper bound on
H(X|Y).
H(X|Y)
e
P.37
Generalizing Fano’s Inequality
 e.g. X is a Poisson random variable with mean equal
to 10.
 Fano's inequality no longer gives an upper bound on
H(X|Y).
H(X|Y)
[Ho & Verdú 2008]
e
P.38
Joint Source-Channel Coding
( S1, S2 , Sk )
Encoder
( X 1 , X 2 , X n )
Channel
( Sˆ1, Sˆ2 , Sˆk )
Decoder
(Y1, Y2 ,Yn )
k-to-n joint source-channel code
P.39
Error Probabilities
 The average symbol error probability is defined as
1 k
k   P[ Si  Sˆi ]
k i 1
 The block error probability is defined as
 k  P[( S1, S 2 , S k )  ( Sˆ1, Sˆ2 , Sˆk )]
P.40
Symbol Error Rate
 Theorem 10: For any discrete memoryless source
and general channel, the rate of a k-to-n joint sourcechannel code with symbol error probability k satisfies
sup
1
n n
n
I
(
X
;Y )
n
k
 1 X k
n k H ( S )   S * (k )
where S* is constructed from {S1, S2, ..., Sk} according
to
PS * (1)  min j PSj (1),
PS * (a)  min j ia1 PSj (i )  ia11 PS * (i ) a  2.
P.41
Block Error Rate
 Theorem 11: For any general discrete source and
general channel, the block error probability k of a kto-n joint source-channel code is lower bounded by
1 
 S k  H ( S k )  sup


Xn

I ( X ;Y )   k


n
n
P.42
Information Theoretic Security
1
n n
lim
n
I
(
X
;Y )  0
 Weak secrecy n 
has been considered in [Csiszár & Körner 78,
Broadcast channel] and some seminal papers.
 [Wyner 75, Wiretap channel I] only stated that “a
large value of the equivocation implies a large value
of Pew”, where the equivocation refers to n 1H ( X n | Y k )
and Pew means  n .
 It is important to clarify what exactly weak secrecy
implies.
P.43
Weak Secrecy
 E.g., PX = (0.4, 0.4, 0.1, 0.1).
H(X)
H(X|Y)
[Fano 1952]
[Ho & Verdú 2008]
e= P[X Y]
P.44
Weak Secrecy
 Theorem 12: For any discrete stationary memoryless
source (i.i.d. source) with distribution PX , if
lim n 1I ( X n ; Y n )  0,
n 
 Then
lim n  max and
n 
lim n  1.
n 
 Remark:
 Weak Secrecy together with the stationary source assumption
is insufficient to show the maximum error probability.
 The proof is based on the tight upper bound on H(X|Y) in
terms of error probability.
P.45
Summary
Applications:
channel coding theorem
lossless/lossy source coding theorems
Fano’s
Inequality
Typicality
Weak
Strong
Typicality Typicality
Shannon’s
Information Measures
P.46
On Countably Infinite Alphabet
Applications:
channel coding theorem
lossless/lossy source coding theorem
Typicality
Weak
Typicality
Shannon’s
Information Measures
discontinuous!
P.47
Unified Typicality
Applications:
channel coding theorem
MSNC/lossy SC theorems
Typicality
Unified
Typicality
Shannon’s
Information Measures
P.48
Generalized Fano’s Inequality
Applications:
results on JSCC, IT security
MSNC/lossy SC theorems
Generalized
Typicality
Fano’s
Inequality
Unified
Typicality
Shannon’s
Information Measures
P.49
Perhaps...
A lot of fundamental research in information theory are
still waiting for us to investigate.
P.50
References
 S.-W. Ho and R. W. Yeung, “On the Discontinuity of the Shannon
Information Measures,” IEEE Trans. Inform. Theory, vol. 55, no. 12,
pp. 5362–5374, Dec. 2009.
 S.-W. Ho and R. W. Yeung, “On Information Divergence Measures
and a Unified Typicality,” IEEE Trans. Inform. Theory, vol. 56, no.
12, pp. 5893–5905, Dec. 2010.
 S.-W. Ho and S. Verdú, “On the Interplay between Conditional
Entropy and Error Probability,” IEEE Trans. Inform. Theory, vol. 56,
no. 12, pp. 5930–5942, Dec. 2010.
 S.-W. Ho, “On the Interplay between Shannon's Information
Measures and Reliability Criteria,” in Proc. 2009 IEEE Int.
Symposium Inform. Theory (ISIT 2009), Seoul, Korea, June 28-July
3, 2009.
 S.-W. Ho, “Bounds on the Rates of Joint Source-Channel Codes
for General Sources and Channels,” in Proc. 2009 IEEE Inform.
Theory Workshop (ITW 2009), Taormina, Italy, Oct. 11-16, 2009.
P.51
Q&A
P.52
Why Countably Infinite Alphabet?
 An important mathematical theory can provide some
insights which cannot be obtained from other means.
 Problems involve random variables taking values from
countably infinite alphabets.
 Finite alphabet is the special case.
 Benefits: tighter bounds, faster convergent rates, etc.
 In source coding, the alphabet size can be very large,
infinite or unknown.
P.53
Discontinuity of Entropy
 Entropy is a measure of uncertainty.
 We can be more and more sure that a particular event will
happen as time goes, but at the same time, the uncertainty of the
whole picture keeps on increasing.
 If one found the above statement counter-intuitive, he/she
may have the concept that entropy is continuous rooted in
his/her mind.
 The limiting probability distribution may not fully
characterize the asymptotic behavior of a Markov chain.
P.54
Discontinuity of Entropy
Suppose a child hides in a shopping mall where the floor
plan is shown in the next slide.
In each case, the chance for him to hide in a room is
directly proportional to the size of the room.
We are only interested in which room the child locates in
but not his exact position inside a room.
Which case do you expect is the easiest to locate the
child?
P.55
Case A
1 blue room +
2 green rooms
Case B
1 blue room +
16 green rooms
Case C
1 blue room +
256 green rooms
Case A Case B
The chance in
Case C
Case D
0.698
0.742
0.5
0.622
0.25
0.0326 0.00118 0.000063
the blue room
The chance in
Case D
1 blue room +
4096 green rooms
a green room
P.56
Discontinuity of Entropy
From Case A to Case D, the difficulty is increasing. By the Shannon
entropy, the uncertainty is increasing although the probability of the child
being in the blue room is also increasing.
We can continue to construct this example and make the chance in the blue
room approaching to 1!
The critical assumption is that the number of rooms can be unbounded. So
we have seen that
“There is a very sure event” and “large uncertainty of the whole picture” can
exist at the same time.
Imagine there is a city where everyone has a normal life everyday with
probability 0.99.
With probability 0.01, however, any kind of accident that beyond our
imagination can happen.
Would you feel a big uncertainty about your life if you were living in that
P.57
city?
1
n
k
lim n I ( X ; Y )  0
n 
 Weak secrecy is insufficient to show the maximum
error probability.
 Example 1: Let W, V and Xi be binary random variables.
 Suppose W and V are independent and uniform.
 Let
W
V 0

Xi  
independen t and uniform V  1
~
max  1  max x PX ( x)  0.5
~
max

n 
 lim 1  max xn P
Xn
( xn)

1 1 3

 lim 1    
2 2 4
n  
P.58
Example 1
 Let
Y1
Y2
Y3
Y4
=
=
=
=
X1
X4
X9
X16
X2
X3
X8
X15
X5
X6
X7
X14
...
0  lim n 1I ( X n ; Y k )
n 
 lim n 1 n  0
n 
X10 X11 X12 X13
(0,0,,0) if Yi  0 i
Choose xˆ  
 (1,1,,1) if Yi  1 i.
 Then
1 ~
3
lim n   n  P[V  1]    max 
2
4
1 1 ~
1
lim n  n  P[V  1]    max 
2 4
2
n
P.59
Joint Unified Typicality
D(Q XY || PXY ) | H (Q XY )  H ( PXY ) |
 | H (Q X )  H ( PX ) |  | H (QY )  H ( PY ) |  .
Can
be changed to
Ans:
X
1
D(Q XY || PXY ) | H (Q XY )  H ( PXY ) |
 | H (Q X )  H ( PX ) |  .
Y
3
X
2
2
Q = {q(xy)}
?
Y
2
2
P = {p(xy)}
D(Q||P) << 1
P.60
Joint Unified Typicality
D(Q XY || PXY ) | H (Q XY )  H ( PXY ) |
 | H (Q X )  H ( PX ) |  | H (QY )  H ( PY ) |  .
Can
be changed to
Ans:
X
1
D(Q XY || PXY ) | H (Q X )  H ( PX ) |
 | H (QY )  H ( PY ) |  .
Y
3
X
1
2
Q = {q(xy)}
?
Y
2
2
P = {p(xy)}
D(Q||P) << 1
P.61
Asymptotic Equipartition Property
 Theorem 5 (Consistency): For any (x,y)  X nY n,
if (x,y)  Un[XY] , then x  Un[X] and y  Un[Y] .
 Theorem 6 (Unified JAEP): For any  > 0:
 1) If (x, y)  Un[XY] , then
2 n( H ( XY )  )  p(x, y )  2 n( H ( XY )  )
 2) For sufficiently large n,


Pr X, Y   U[nXY ]  1  
 3) For sufficiently large n,
(1   )2n( H ( XY )  )  U n
n( H ( XY )  )

2
[ XY ]
P.62