Tirgul 8 - Probability
Download
Report
Transcript Tirgul 8 - Probability
School of Computer Science and Engineering,
The Hebrew University of Jerusalem.
תאריך :יום ראשון ה1.5.11 -
המתכונת:
◦
◦
◦
◦
חצי שעה במהלך השיעור
רמה :קל עד בינוני
בחירה של 5מתוך 6שאלות אמריקאיות
הוכחה (קצרה) אחת (אין בחירה)
החומר :הכל עד עצי AVLכולל (כלומר עד שבוע 8כולל)
◦ כל הפתרונות הרלוונטיים יועלו לאתר
(מיד לאחר חופשת הפסח)
משקל 20% :מגן
יהיה :בסדר
Randomness is a good model for uncertainty.
Uncertainty has a negative connotation, but as we
will see later on, we can use it to our advantage.
Probability is the formalization of randomness,
and allows us to treat it mathematically.
In computer science, we use probability to:
◦
◦
◦
◦
Define parts of problems which contain ‘uncertainty’
Analyze probabilistic running time
Achieve average-case performance on worst-case input
We also computationally generate randomness!
But first – what is probability?
Consider a toss of a die. Though we know what
the possible outcomes of the toss can be, we
don’t know which it will be.
Nonetheless, we are not completely clueless. For
instance, we know that if we toss a die 100
times, there is a ‘good chance’ the die will sow
‘1’ at some point.
Since tossing a die is a process which is random
in nature, we will use it as a test case for
formalizing randomness as probability.
◦ Note: we will only discuss discrete probability for now.
Consider a toss of a die. We have mentioned
that we know the toss has 6 possible (basic)
outcomes:
◦
◦
◦
◦
After the toss, the die will show 1.
After the toss, the die will show 2.
…
After the toss, the die will show 6.
As a first step in formalization, we shall
create a set of these outcomes:
Ω= {1, 2, 3, 4, 5, 6}
Ωis called the Sample Space.
Nonetheless, there are other non-direct
outcomes, such as:
◦ After the toss the die showed an even number
◦ After the toss the die showed a number larger than 3
◦ After the toss the die showed either 1 or 2
Notice that These are actually subsets of Ω:
{2, 4, 6}, {4, 5, 6}, {1, 2}
Each subset is called an event.
Therefore, we will consider all possible subsets:
2Ω = {A : A ⊆ Ω}
This group is called the sigma-algebra.
◦ The events {1}, {2}, … {6} are called elementary events.
Although we do not know the outcome of the toss,
we know that all numbers have same ‘chance’ of
showing up.
Let us attempt to quantify this ‘chance’ notion as
probability (we will mark it Pr for short).
Since we want are interested in the chance of
things happening:
◦ Define the probability that something will happen as 1.
◦ Define the probability that nothing will happen as 0.
◦ In our case, let’s divide the probability equally between all
6 basic outcomes. That is:
∀ i=1…6 Pr(roll=i) = ⅙
But what about events like parity - {2, 4, 6}?
It makes sense that:
Pr(roll=even) = ½
since:
Pr(roll=even) = Pr(roll = 2 or 4 or 6)
and {2, 4, 6} are exactly half of the outcomes.
This gives us the intuition to require:
Pr(roll = 2 or 4 or 6) =
Pr(roll=2)+Pr(roll=4)+Pr(roll=6)
Formally, we will define a probability function:
P: 2Ω → ℝ
Such that:
1. ∀ event A ∋ 2Ω, P(A) ≤ 0
2. P(Ω)=1
3. ∀ disjoint events A,B ∋ 2Ω, P(A ⊍ B) = P(A) + P(B)
In general, for discrete sample spaces, we
have:
∀ A ∋ 2Ω
P(A) = |A| / |Ω|
Tossing 3 fair coins:
◦ What is the sample space?
Ω= {H, T}3
◦ What is the sigma-algebra?
2Ω
◦ What is the probability function?
∀ ω ∋ Ω, P(ω)= ½⋅½⋅½ = 1/8
◦ Q: Why is it enough to define the probability only
on the elementary events?
◦ A: Because all events are disjoint unions of
elementary events.
We toss a fair coin. If it lands on heads, we toss a
fair die. If it lands on tails, we draw a ball from a
bag that has 3 red balls and 2 yellow balls.
◦ What is the sample space?
Ω= { {H}x{1, 2,..., 6}, {T}x{R, Y} }
◦ What is the sigma-algebra?
2Ω
◦ What is the probability function?
∀ i=1…6, P(H, i) = ½⋅⅙ = 1/12
P(T, R) = ½⋅3/5 = 3/10
P(T, Y) = ½⋅2/5 = 2/10
◦ Do all the probabilities sum up to 1?
A drunkard tries to walk on a straight line of
5 meters. With each step, he advances ½
meter, but deviates 10 cm. left or right from
the line with the same chance.
◦ What is the sample space?
Ω = {L, R} or {L, R}10 ? We get to choose!
notice that there is no uncertainty about the forward motion!
◦ What is the sigma-algebra?
2Ω
◦ What is the probability function?
P(L) = P(R) = ½, or ∀A∋2Ω, P(A) = 2-10 ?
(*) try it yourselves! http://www.maniacworld.com/walk_home_drunk.htm
The sample space Ωis a set of all elementary
events.
The sigma algebra, which is (usually) all the
subsets of Ω, contains all of the events.
The probability P(⋅) is a special function that
tells us what are the chances of events
happening.
The sample space, sigma-algebra and
probability function define a probability
space.
Let’s go back to tossing a die. Say we play a
game where for each toss, we get as points
the outcome of the toss, and we want to get
as many points.
What if instead of numbers, the faces of the
die had the following drawings:
{,, L, G, M, :, 2}
◦ What is our sample space now?
◦ Has the probability function changed?
◦ How can we compute our score in the game?
Let’s give the formulation another try:
◦ Our sample space has for elementary event all the faces of
the die, regardless of what is drawn on them. Our sample
spce would be:
Ω = {faces}
◦ For each face, we will give a value. Formally, we will create a
value function X:
X: Ω → ℝ
this value function is called a random variable.
◦ Our original probability P will induce a new ‘probability’ on X:
PX : 2X(Ω) → ℝ
◦ This is not really a probability function, since it is not from
the sample space. We will call PX the distribution of the
random variable X.
(*) Note that we can actually treat X(Ω) as a new sample space, which will
make PX a true probability function. Distributions are actually a special
case of probability functions, which are from subsets of ℝ into ℝ.
For example, say we define:
◦ X(,) = X(L) = X(G) = 1
◦ X(M) = 2.5
◦ X(:) = X(2) = 10
Now, we get the distribution:
◦
◦
◦
◦
PX(X=1) = |{,, L, G}|/|Ω| = ½
PX(X=2.5) = |{M}|/|Ω| = ⅙
PX(X=10) = |{:, 2}|/|Ω| = 1/3
∀ x ∌ {1, 2.5, 10}, P(X=x) = 0
There are several interesting families of
distributions it is good to know.
For instance, consider our original die-toss
example (where Ω= {1,…,6}). If we define:
∀ ω ∋ Ω, X(ω) = ω
since all ω-s had the same probability, each value
of X has the same probability (in this case, ⅙).
We call this a uniform distribution U(⋅), since the
probability is partitioned uniformly between all
events.
There are many other families of distributions. We
might see some later on in the course.
A fair coin is tossed. You get 1 for heads and
0 for tails.
◦ What is the sample space? The prob. function?
Ω= {H, T}, P(H) = P(T) = ½
◦ How would you define the random variable?
X(H) = 1, X(T) = 0
◦ What is the distribution? How would it look like?
PX(X=1) = PX(X=0) = ½
Px
1
0.8
0.6
0.4
0.2
0
0
1
2 dice are tossed. We want their sum.
◦ What is the sample space? The prob. function?
Ω= {1,…, 6}2, P(i,j) = 1/36
◦ How would you define the random variable?
X(i,j) = i+j
◦ What is the distribution? How would it look like?
Px
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
8
9
10
11
12
3 fair coins are tossed. We get 1 for all 3
coins being heads, and 0 otherwise.
◦ What is the sample space? The probability function?
Ω = {H, T}3,
P(HHH) = P(HHT) =…= P(TTT)= 1/8
◦ How would you define the random variable?
X(HHH)=1, ∀ A ∋ 2Ω X(A)= 0
◦ What is the distribution? How would it look like?
◦ PX(X=1) = 1/8,
PX(X=0) = 7/8
Px
1
Q: What Boolean function is
represented here?
A: AND
0.8
0.6
0.4
0.2
0
0
1
You are a serious lottery addict. We want to
know how many losing cards you should fill
before finally winning.
◦ What is the sample space? The probability function?
Ω = {winning card, losing card}, P{winning} = 10-7 = p
Ω‘ = ⋃n=1…∞({winning card, losing card}n)
◦ How would you define the random variable?
Px with p=0.4
X = number of tries until success
◦ What is the distribution? How would
it look like? Use common sense!
PX(X=k) = (1-p)k-1p
This is called a Geometric distribution
0.3
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4
5
6
7
8
9 10 …
A drunkard tries to walk on an (infinite)
straight line. With each step, he advances ½
meter, but deviates 10 cm. left or right from
the line with the same chance.
◦ What is the sample space? The prob. function?
Ω = {L, R}, P(L) = P(R) = ½
◦ How would you define the random variable?
Y(R) = 1, Y(L) = -1
Define Xn = position on right-left axis = ∑i=1..nY
◦ What is the distribution?
1
𝑛
PX(Xn=k)=PX(∑i=1..nY = k) = 2𝑛 (𝑛 + 𝑘)/2
Different orderings of L, R
Choose exactly k more +1 than -1
For each web page, we want to know how
many web pages have links to it.
◦ What is the sample space? The probability function?
Ω = {web pages},
P~U(|web pages|)
uniform distribution
◦ How would you define the random variable?
X(web page) = number of outgoing links
◦ What is the distribution? How would it look like? Use
common sense!
◦ What is the distribution? How would it look like?
There are a zillion incoming links to Google, a few
Yahoo! and cnn.com with a lot of links, a few more
sites with less links, and so on, until there are a zillion
sites with one or no incoming links, like private
homepages.
One possible explanation for this is ‘rich get richer’ –
the more popular a web page, the more probable it is
that other web pages will link to it.
Research shows that if new links are distributed
proportionally to the already existing incoming-links
distribution, then the distribution is preserved.
◦ What is the distribution? How would it look like?
A known distribution that preserves this quality is the
power-law distribution, namely:
𝑃𝑋 𝑋 = 𝑥 ∝ 𝑥 −𝛼
This is a good example of a micro-scale random
process (a web-site adding new links) that creates a
macro-scale distribution (the distribution of the
number of incoming links
Px with alpha=2
of all web pages).
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
11
Let’s go back to the die-tossing game.
Say we toss the die 100 times and get points
accordingly. What is the average points per
toss?
This is easy! Just sum:
1
100
𝑋 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑜𝑓 𝑡𝑜𝑠𝑠 =
𝑡𝑜𝑠𝑠𝑒𝑠
1+ 1 + ⋯+ 2 + 2 + ⋯6 + 6 + 6
100
But will this be the average of every 100
tosses?
◦ No, since there is always uncertainty in dice
◦ Almost, since… well… let’s formalize first.
Let’s rearrange the summing of the average:
1
100
𝑋 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 =
𝑡𝑜𝑠𝑠𝑒𝑠
=
1
100
1
100
𝑥=
𝑡𝑜𝑠𝑠𝑒𝑠 𝑥∈𝑜𝑢𝑡𝑐 𝑜𝑚𝑒
6
𝑥 ⋅ #(𝑥) =
𝑥∈𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑖⋅
𝑖=1
1
100
𝑥
𝑥∈𝑜𝑢𝑡𝑐 𝑜𝑚𝑒 𝑡𝑜𝑠𝑠𝑒𝑠
#(𝑖)
100
If we think in probabilistic terms, each element
#(i)/100 ‘should’ be close to P(X=i).
We shall define the expectation of the random
variable X to be:
𝐸𝑃𝑋 𝑋 =
𝑥 ⋅ 𝑃𝑋 (𝑋 = 𝑥)
𝑥∈𝐼𝑚 (𝑋)
In our case:
6
𝐸𝑃𝑋 𝑋 =
𝑖=1
1
1
𝑖 ⋅ 𝑃𝑋 𝑋 = 𝑖 = 1 ⋅ + ⋯ + 6 ⋅ = 3.5
6
6
Intuitively, expectation can be thought of as
the average of an infinite number of samples.
It can also be thought of ‘what to expect’
from undrawn samples.
Statistical theory has theorems on the relation
between the (empirical, e.g. finite) average
and the (possibly infinite) expectation.
The notion of expectation is very handy. We
will use it plenty.
We flip a coin whit probability p for heads. We
get 1 for heads and 0 for tails. What is the
expectation?
◦ 𝔼[X] = 1⋅PX(X=1)+0⋅PX(X=0) = 1⋅p+0⋅(1-p) = p
We toss 2 dice. Define X to be the sum, then:
◦ 𝔼[X] = 2⋅PX(X=1)+3⋅PX(X=3)+…+12⋅PX(X=12) =
= 2⋅1/36+3⋅1/18+4⋅1/12+…+12⋅1/36 = 7
We toss a fair coin. If it lands on heads, we toss a
fair die. If it lands on tails, we toss 2 dice. Define
X to be the sum of the dice we eventually roll.
What is the expectation?
◦ There are several ways to model this. We will show one.
◦ We have already calculated the expectation of both one
die (3.5) and two dice (7). Since there is probability 0.5
for each:
E[X] = ½E[one die] + ½E[two dice] = 5.25
Q: Is it a coincidence that the expectation of 2 dice is double
that of one die?
A: No. This is true because of the ‘linearity of expectation’,
which we might learn later on.
We toss n coins with each having P(heads)=p.
Define X=1 for heads, 0 for tails.
What is the probability of X=k for any k?
𝑛
𝑃𝑋 𝑋 = 𝑘 =
𝑘
𝑘
𝑝
𝑛
𝑝
1−
𝑛
𝑛−𝑘
𝑘
𝜆
−𝜆
𝑒
, 𝜆 = 𝑝𝑛
𝑛→∞
𝑘!
𝑝→0
What is the ∞expectation
of X?
∞
𝑘
𝐸𝑋 =
𝑘𝑃𝑋 (𝑘) =
𝑘=0
∞
= 𝑒 −𝜆
𝑘=0
𝑘=0
𝑘+1
𝜆
𝑘!
𝜆
−𝜆
𝑘𝑒
= 𝑒 −𝜆
𝑘!
∞
= 𝜆𝑒 −𝜆
𝑘=0
This is called a Poisson distribution.
∞
𝑘=1
𝜆𝑘
(𝑘 − 1)!
𝜆𝑘
= 𝜆𝑒 −𝜆 𝑒 𝜆 = 𝜆
𝑘!
Statistics and probability are closely related,
but:
◦ Statistics can be measured, and can help us analyze
things that have happened.
Examples of statistics: average, median, minimum
◦ Probability is theoretical and cannot be measured,
but can help us ‘guess’ (or predict) better by
describing the population from which the sample
came from.
This is necessary since populations can’t always be
measured (for instance infinite populations)
We can use statistics to fit a probabilistic model to
the measured reality.
For example, consider the statistic average.
◦ (since it is a simple calculation over measurements, average is
indeed a statistic).
An average describes some property of a measured
sample (=reality), but does not tell us anything
(directly) about the whole world.
Statistical theory shows us that the average of a large
sample is very close to the expectation of the
population from which the sample was taken.
◦ This is called ‘the law of large numbers’
The expectation describes a property of the
distribution, which is a theoretical object. Knowing
the expectation of the distribution can help us
‘guess’ better.
Example:
◦ Say we take a sample of 100 people and measure
their height. We now have 100 numbers.
◦ Next, we calculate the average of these 100
numbers. Assume the average was 1.65 m.
◦ By the law of large numbers, we know that the
expectation of the population of heights (e.g. all
heights of all the people in the world) is close to
1.56 m.
◦ Given a new person, our best estimate of his height
would now be 1.65 m.
Same example, other way around:
◦ Say we want to estimate height of people.
◦ First, we model the probability space:
Each person is an elementary event
The probability of randomly drawing each person is uniform
We define h:{people}->R+ to be the height of a person. This
induces a probability space over h with a distribution over
heights.
◦ Next, we assume the distribution of heights has a
normal distribution.
◦ In order estimate the expectation of the distribution, we
sample 100 people, measure their heights, and calculate
the average. Assume the average was 1.65 m.
◦ The law of large numbers tells us this average is close to
the expectation of the population.
◦ Given a new person, our best estimate of his height
would now be 1.65 m.