Information Theory and Security

Download Report

Transcript Information Theory and Security

Information Theory and Security
Lecture Motivation

Up to this point we have seen:
– Classical Crypto
– Symmetric Crypto
– Asymmetric Crypto

These systems have focused on issues of confidentiality:
Ensuring that an adversary cannot infer the original plaintext
message, or cannot learn any information about the original
plaintext from the ciphertext.

In today’s lecture we will put a more formal framework around
the notion of what information is, and use this to provide a
definition of security from an information-theoretic point of
view.
Lecture Outline

Probability Review: Conditional Probability and Bayes

Entropy:
– Desired properties and definition
– Chain Rule and conditioning

Coding and Information Theory
– Huffman codes
– General source coding results

Secrecy and Information Theory
– Probabilistic definitions of a cryptosystem
– Perfect Secrecy
The Basic Idea

Suppose we roll a 6-sided dice.
– Let A be the event that the number of dots is odd.
– Let B be the event that the number of dots is at least 3.

A = {1, 3, 5}

B = {3, 4, 5, 6}

I tell you: the roll belongs to both A and B then you know there
are only two possibilities: {3, 5}

In this sense A  B tells you more than just A or just B.
That is, there is less uncertainty in A  B than in A or B.


Information is closely linked with this idea of uncertainty:
Information increases when uncertainty decreases.
Probability Review, pg. 1

A random variable (event) is an experiment whose outcomes
are mapped to real numbers.

For our discussion we will deal with discrete-valued random
variables.

Probability: We denote pX(x) = Pr(X = x).
For a subset A,

p(A)   p X x 
xA
Joint Probability: Sometimes we want to consider more than
two events at the same time, in which we case we lump them
together into a joint random variable, e.g. Z = (X,Y).
p X ,Y X, Y   x, y   Pr X  x , Y  y 

Independence: We say that two events are independent if
p X ,Y X, Y   x, y   p X x p Y y 
Probability Review, pg. 2

Conditional Probability: We will often ask questions about
the probability of events Y given that we have observed X=x.
In particular, we define the conditional probability of Y=y
given X=x by
p XY ( x , y)
pY (y | x) 
pX (x)

Independence: We immediately get pY ( y | x)  pY ( y)

Bayes’s Theorem: If pX(x)>0 and pY(y)>0 then
p X ( x )p Y ( y | x )
p X ( x | y) 
p Y ( y)
Entropy and Uncertainty

We are concerned with how much uncertainty a random event
has, but how do we define or measure uncertainty?

We want our measure to have the following properties:
1.
2.
To each set of nonnegative numbers p  p1 , p2 ,, pn with
p1  p2    pn  1 , we define the uncertainty by H ( p) .
H (p) should be a continuous function: A slight change in p
should not drastically change H (p)
3.
for all n>0. Uncertainty increases
H n1 ,, n1   H n11 ,, n11 
when there are more outcomes.
4.
If 0<q<1, then
Hp1 ,, qp j , (1  q)p j ,, p n   Hp1 ,, p n   p jHq,1  q 
Entropy, pg. 2

We define the entropy of a random variable by
HX     px  log 2 p( x )
x


Example: Consider a fair coin toss. There are two outcomes,
with probability ½ each. The entropy is
1 1
1
1
  log 2  log 2  1 bit
2 2
2
2
Example: Consider a non-fair coin toss X with probability p of
getting heads and 1-p of getting tails. The entropy is
HX  p log 2 p  1  plog 2 1  p
The entropy is maximum when p= ½.
Entropy, pg. 3

Entropy may be thought of as the number of yes-no questions
needed to accurately determine the outcome of a random
event.

Example: Flip two coins, and let X be the number of heads.
The possibilities are {0,1,2} and the probabilities are {1/4,
1/2, 1/4}. The Entropy is

1 1
1 1
1
  log 2  log 2  log 2
4 2
2 4
4
So how can we relate this to questions?
1 3
  bits
4 2

First, ask “Is there exactly one head?” You will half the time
get the right answer…

Next, ask “Are there two heads?”
Half the time you needed one question, half you needed two

Entropy, pg. 4

Suppose we have two random variables X and Y, the joint
entropy H(X,Y) is given by
HX, Y     p XY x, y  log 2 p XY ( x, y)
x y

Conditional Entropy: In security, we ask questions of whether
an observation reduces the uncertainty in something else. In
particular, we want a notion of conditional entropy. Given that
we observe event X, how much uncertainty is left in Y?
HY | X    p X ( x )H(Y | X  x )
x


  p X ( x )  p Y ( y | x ) log 2 p Y ( y | x ) 
x
 y

  p XY ( x, y) log 2 p Y ( y | x )
x
y
Entropy, pg. 5

Chain Rule: The Chain Rule allows us to relate joint entropy to
conditional entropy via H(X,Y) = H(Y|X)+H(X).
HX, Y    p XY ( x, y) log 2 p XY ( x, y)
x
y
  p XY ( x, y) log 2 p X ( x )p Y ( y | x )
x
y
  p X ( x ) log 2 p X ( x )  H(Y | X)
x
 H(X)  H(Y | X)
(Remaining details will be provided on the white board)

Meaning: Uncertainty in (X,Y) is the uncertainty of X plus
whatever uncertainty remains in Y given we observe X.
Entropy, pg. 6

Main Theorem:
1.
Entropy is non-negative. H(X )  0
2.
H(X)  log 2 
where  denotes the number of elements
in the sample space of X.
3.
HX, Y  H(X)  H(Y)
4.
(Conditioning reduces entropy)
H(Y | X)  H(Y)
with equality if and only if X and Y are independent.
Entropy and Source Coding Theory

There is a close relationship between entropy and representing
information.

Entropy captures the notion of how many “Yes-No” questions
are needed to accurately identify a piece of information… that
is, how many bits are needed!

One of the main focus areas in the field of information theory is
on the issue of source-coding:
– How to efficiently (“Compress”) information into as few bits as
possible.

We will talk about one such technique, Huffman Coding.

Huffman coding is for a simple scenario, where the source is a
stationary stochastic process with independence between
successive source symbols
Huffman Coding, pg. 1

Suppose we have an alphabet with four letters A, B, C, D with
frequencies:
A
0.5
B
0.3
C
0.1
D
0.1
We could represent this with A=00, B=01, C=10, D=11. This
would mean we use an average of 2 bits per letter.

On the other hand, we could use the following representation:
A=1, B=01, C=001, D=000. Then the average number of bits
per letter becomes
(0.5)*1+(0.3)*2+(0.1)*3+(0.1)*3 = 1.7
Hence, this representation, on average, is more efficient.
Huffman Coding, pg. 2


Huffman Coding is an algorithm
that produces a representation for
a source.
A 0.5
The Algorithm:
B 0.3
– List all outputs and their
probabilities
– Assign a 1 and 0 to smallest two,
and combine to form an output
with probability equal to the sum
– Sort List according to
probabilities and repeat the
process
– The binary strings are then
obtained by reading backwards
through the procedure
1
1.0
1
0.5
C 0.1
1
0
0.2
0
D 0.1
0
Symbol Representations
A: 1
B: 01
C: 001
D: 000
Huffman Coding, pg. 3

In the previous example, we used probabilities. We may directly
use event counts.

Example: Consider 8 symbols, and suppose we have counted
how many times they have occurred in an output sample.
S1
28
S2
25
S3
20
S4
16
S5
15
S6
8
S7
7
S8
5

We may derive the Huffman Tree (Exercise will be done on
whiteboard)

The corresponding length vector is (2,2,3,3,3,4,5,5)

The average codelength is 2.83. If we had used a full-balanced
tree representation (i.e. the straight-forward representation) we
would have had an average codelength of 3.
Huffman Coding, pg. 4

We would like to quantify the average amount of bits needed in
terms of entropy.

Theorem: Let L be the average number of bits per output for
Huffman encoding of a random variable X, then
HX   L  HX   1,
L   pxlx
x
Here, lx =length of codeword assigned to symbol x.

Example: Let’s look back at the 4 symbol example
HX  .5 log 2 (0.5)  .3 log 2 (0.3)  .1log 2 (0.1)  .1log 2 (0.1)  1.685
Our average codelength was 1.7 bits.
Huffman Coding, pg. 5

An interesting and useful question is: What if I use the wrong
distribution when calculating the code? How badly will my
code perform?

Suppose the true distribution is px, and you used another
distribution to find the lengths lx. Define the auxiliary
distribution q x  2 l x .

Theorem: If we code the source X with lx instead of the correct
Huffman code, then the resulting average codelength will
satisfy:
HX   Dp q    p x l x  HX   Dp q   1
x
where the Kullback-Leibler Divergence D(p||q) is
Dp q    p x log 2
x
px
qx
Another way to look at cryptography, pg. 1

So far in class, we have looked at the security problem from an
algorithm point-of-view (DES, RC4, RSA,…).

But why build these algorithms? How can we say we are doing
a good job?

Enter information theory and its relationship to ciphers…

Suppose we have a cipher with possible plaintexts P, ciphertexts
C, and keys K.
– Suppose that a plaintext P is chosen according to a probability law.
– Suppose the key K is chosen independent of P
– The resulting ciphertexts have various probabilities depending on
the probabilities for P and K.
Another way to look at cryptography, pg. 2

Now, enter Eve… She sees the ciphertext C and several security
questions arise:
– Does she learn anything about P from seeing C?
– Does she learn anything about the key K from seeing C?

Thus, our questions are associated with H(P|C) and H(K|C).

Ideally, we would like for the uncertainty to not decrease, i.e.
H(P | C) = H(P)
H(K | C) = H(K)
Another way to look at cryptography, pg. 3

Example: Suppose we have three plaintexts {a,b,c} with
probabilities {0.5, 0.3, 0.2}. Suppose we have two keys k1 and
k2 with probabilities 0.5 and 0.5. Suppose there are three
ciphertexts U,V,W.
Ek1(a)=U
Ek2(a)=U
Ek1(b)=V
Ek2(b)=W
Ek1(c)=W
Ek2(c)=V
We may calculate probabilities of the ciphertexts
pC (U)  p K (k1 )p P (a )  p k (k 2 )p P (a )  0.50
Similarly we get pC(V)=0.25 and pC(W)=0.25
Another way to look at cryptography, pg. 4
Suppose Eve observes the ciphertext U, then she knows the
plaintext was “a”.
We may calculate the conditional probabilities:
p P ( b | V) 
p P,C b, V 
p C ( V)

p P,C b, k1 
p C ( V)

(0.3)(0.5)
 0.6
.25
Similarly we get pP(c|V)=0.4 and pP(a|V)=0. Also pP(a|W)=0 ,
pP(b|W)=0.6 , pP(c|W)=0.4.
What does this tell us? Remember, the original plaintexts
probabilities were 0.5, 0.3, and 0.2. So, if we see a ciphertext,
then we may revise the probabilities… Something is “learned”
Another way to look at cryptography, pg. 5
We use entropy to quantify the amount of information that is
learned about the plaintext given the ciphertext is observed.
H(P)  0.5 log 2 (0.5)  0.3 log 2 (0.3)  0.2 log 2 (0.2)
The conditional entropy of P given C is
H ( P | C)  

p
C
x{a , b ,c} c{ U , V , W}
(c)p P ( x | c) log 2 p P ( x | y)  0.485
Thus an entire bit of information is revealed just by observing the
ciphertext!
Perfect Secrecy and Entropy

The previous example gives us the motivation for the
information-theoretic definition of security (or “secrecy”)

Definition: A cryptosystem has perfect secrecy if H(P|C)=H(P).

Theorem: The one-time pad has perfect secrecy.

Proof: See the book for the details. Basic idea is to show each
ciphertext will result with equal likelihood. We then use
manipulations like:
H(P, K, C)  H(P, K )  H(P)  H(K )
H(P, K, C)  H(P, C)  H(C)  H(P | C)
Why?
Equating these two as equal and using H(K)=H(C) gives the result.