probability distribution
Download
Report
Transcript probability distribution
Probability and Distributions
Deterministic vs. Random Processes
• In deterministic processes, the outcome can be
predicted exactly in advance
• Eg. Force = mass x acceleration. If we are given values
for mass and acceleration, we exactly know the value of
force
• In random processes, the outcome is not known
exactly, but we can still describe the probability
distribution of possible outcomes
• Eg. 10 coin tosses: we don’t know exactly how many
heads we will get, but we can calculate the probability of
getting a certain number of heads
Events
•
An event is an outcome or a set of outcomes of a
random process
Example: Tossing a coin three times
Event A = getting exactly two heads = {HTH, HHT, THH}
Example: Picking real number X between 1 and 20
Event A = chosen number is at most 8.23 = {X ≤ 8.23}
Example: Tossing a fair dice
Event A = result is an even number = {2, 4, 6}
•
•
Notation: P(A) = Probability of event A
Probability Rule 1:
0 ≤ P(A) ≤ 1 for any event A
Sample Space
• The sample space S of a random process is the set
of all possible outcomes
Example: one coin toss
S = {H,T}
Example: three coin tosses
S = {HHH, HTH, HHT, TTT, HTT, THT, TTH, THH}
Example: roll a six-sided dice
S = {1, 2, 3, 4, 5, 6}
Example: Pick a real number X between 1 and 20
S = all real numbers between 1 and 20
• Probability Rule 2: The probability of the
whole sample space is 1
P(S) = 1
4
Equally Likely Outcomes Rule
• If all possible outcomes from a random process have
the same probability, then
• P(A) = (# of outcomes in A)/(# of outcomes in S)
• Example: One Dice Tossed
P(even number) = |2,4,6| / |1,2,3,4,5,6| = 3/6 = 1/2
• Note: equal outcomes rule only works if the number
of outcomes is “countable”
• Eg. of an uncountable process is sampling any fraction between 0 and
1. Impossible to count all possible fractions !
Combinations of Events
• The complement Ac of an event A is the event that A does
not occur
• Probability Rule 3:
P(Ac) = 1 - P(A)
• The union of two events A and B is the event that either A
or B or both occurs
• The intersection of two events A and B is the event that
both A and B occur
Event A
Complement of A
Union of A and B
Intersection of A and B
Disjoint Events
• Two events are called disjoint if they can not happen
at the same time
• Events A and B are disjoint means that the intersection of
A and B is zero
• Example: coin is tossed twice
• S = {HH,TH,HT,TT}
• Events A={HH} and B={TT} are disjoint
• Events A={HH,HT} and B = {HH} are not disjoint
• Probability Rule 4: If A and B are disjoint events
then
P(A or B) = P(A) + P(B)
Independent events
• Events A and B are independent if knowing that A occurs
does not affect the probability that B occurs
• Example: tossing two coins
Event A = first coin is a head
Event B = second coin is a head
Independent
• Disjoint events cannot be independent!
• If A and B can not occur together (disjoint), then knowing that A
occurs does change probability that B occurs
• Probability Rule 5: If A and B are independent
P(A and B) = P(A) x P(B)
P( 2 H in two Tosses) = 0.5 * 0.5 = 0.25
multiplication rule for independent events
Distributions
• The magnitude of an event will vary
over a range of values with time. This
variation can be described by some
type of distribution function.
– Frequency
– Cumulative
Frequency Distribution
• A frequency distribution is an arrangement of the
values that one or more variables take in a sample.
Each entry in the table contains the frequency or
count of the occurrences of values within a particular
group or interval.
Cumulative Distribution Function (CDF)
• CDF is the probability of Variable X, taking on a
number that is less than or equal to number X. This
may also be known as the "area in so far" function.
Median Flow is
at 0.5 value on
the CDF
Normal Distribution
Probability Distribution
• A probability is a numerical value that measures the
uncertainty that a particular event will occur. The
probability of an event ordinarily represents the
proportion of times under identical circumstances that
the outcome can be expected to occur.
• A probability distribution of a random variable X
provides a probability for each possible value. Those
probabilities must sum to 1, and they are denoted by:
P[X = x] where x represents any one of the possible
values that the random variable may assume.
Types of Distributions
• Discrete (binary, nominal, ordinal):
–
–
–
–
Bernoulli
Binomial
Poisson
Geometric
• Continuous distributions (interval, ratio):
–
–
–
–
–
Uniform
Normal (Gaussian)
Gamma
Chi Square
Student t
Statistics of a Distribution
• Central Value
– Mean
– Medium
– Mode
• Variability
–
–
–
–
Min, Max and Range
Variance
Standard Deviation
Coefficient of Variation (CV) - a measure of dispersion of a
probability distribution (Standard Deviation / Mean)
• Shape
- Skewness - a measure of symmetry
- Kurtosis - a measure of whether the data are peaked or flat
relative to a normal distribution.
Basic Statistics
n
• Mean - X xi / n
i 1
• Variance -
n = number of observations
xi = observation i
Excel function: AVERAGE
n
1
2
S2
(
x
X
)
i
n 1 i 1
• Standard Deviation - S S 2
• Coefficient of Variation -
Excel function: VAR
Excel Function: STDEV
CV S / X
n
• Skew Coefficient Excel Function:
Skew
n
g
(n 1)( n 2)
3
(
x
X
)
i
i 1
S3
Other Metrics
• Central Tendency
– Mean
– Median
• Point in the distribution where half of the values in the distribution
lie below the point, and half lie above the point
– Mode
• Value of x at which the distribution is at its maximum
Continuous Uniform Distribution
Uniform
1.2
1.2
1
1
0.8
0.8
0.6
Series1
Value
Value
Uniform
0.6
0.4
0.4
0.2
0.2
0
Series1
0
1
12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199
1
34 67 100 133 166 199 232 265 298 331 364 397 430 463 496 529 562 595
Event
Event
All events within a range has a equal chance of occurrence.
Probability density function
Used in stochastic
modeling
Frequency
Cumulative
Normal Distribution
• Symmetrical – equal number of events on
either side of the mean value.
• Mean, medium and mode values are equal.
• f(x) =
Gamma Distribution
• A skewed distribution, not symmetric.
• Mean, medium and mode are not equal.
• f(x, k, Θ) =
Inference
• Most spatial analysis is based on comparing
sample events to theoretical distributions.
• With a normal distribution
– +/- 1 standard deviations = 0.68 of the events
– +/- 2 standard deviations = 0.955 of the events
– +/- 3 standard deviations = 0.997 of the events
• P(x > +3SD) = 0.0015
• Z statistic – normal deviate transformation
– Z = (X – Expect Mean of X)/ Expected SD of X
– Z = (10 – 5) / 1.5 = +3.33
Nearest Neighbor Analysis
Nearest neighbor analysis examines the distances between
each point and the closest point to it, and then
compares these to expected values for a random
sample of points from a CSR (complete spatial
randomness) pattern. CSR is generated by means of
two assumptions: 1) that all places are equally likely to
be the recipient of a case (event) and 2) all cases are
located independently of one another.
The mean nearest neighbor distance =
where N is the number of points. di is the nearest
neighbor distance for point i.
The expected value of the nearest neighbor distance in a
random pattern =
where A is the area and B is the length of the perimeter of the
study area.
The variance =
Nearest Neighbor Distance
R<1
R>1
Mean Observed NND
R
Expected Mean NND for Random Points
R=1
And the Z statistic =
This approach assumes:
Equations for the expected mean and variance cannot be used
for irregularly shaped study areas. The study area is a regular
rectangle or square. Area (A) is calculated by (Xmax –
Xmin) * (Ymax – Ymin), where these represent the study
area boundaries.
R statistic = Observed Mean d / Expect d
R = 1 random, R 0 cluster, R 2+ uniform
2 x 0.5
A = 1, B = 5
E (di) = 0.05277
Var (d) = 8.85 x 10-6
1x1
A = 1, B = 4
E(di) = 0.05222
Var(d) = 8.48 x 10-6
2 x 2: E(di) = 0.10444
Wilderness
Campsites
Real world study areas are complex and violate the
assumptions of most equations for expected values.
Solution
* Simulate randomization using Monte Carlo
Methods.
Compare simulated distribution to observed.
* If possible use the “true” area and perimeter to
compute the expected value.
* Software that does not ask for area/perimeter or a
shapefile of the study area will assume a
rectangle based on the minimum and
maximum coordinates.