Transcript Handout 2

OPIM 5103
Descriptive Statistics
Random Sampling
Intro to Probability and Discrete Distributions
Jan Stallaert
Professor of OPIM
Median
Measures of Central Tendency
Central Tendency
Average
Median
Mode
n
X 
X
i 1
N

i 1
N
Geometric Mean
X G   X1  X 2 
n
X
i
i
 Xn 
1/ n
Mean (Arithmetic Mean)
• Mean (arithmetic mean) of data values
– Sample mean
Sample Size
n
X
X1  X 2 
X

n
– Population n
mean
i 1
i
Population Size
N

X
i 1
N
 Xn
i
X1  X 2 

N
 XN
Mean (Arithmetic Mean)
• The most common measure of central tendency
• Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5
Excel function: =average(range)
Mean = 6
Median
• Robust measure of central tendency
• Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5
Median = 5
• In an ordered array, the median is the “middle”
number
Excel function: =median(range)
Measures of Variation
Variation
Variance
Range
Population
Variance
Sample
Variance
Interquartile Range
Standard Deviation
Population
Standard
Deviation
Sample
Standard
Deviation
Coefficient
of Variation
Example
9
8
7
6
5
4
3
2
1
0
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
25
5
4.
3.
75
2
2.
-0 1
.2
5
0.
5
1.
25
.00%
-3 4
.2
5
-2
.
-1 5
.7
5
Frequency
Histogram
Bins
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
Bins
5
3.
5
4.
25
2.
7
2
.00%
-3 4
.2
5
-2
.
-1 5
.7
5
-0 1
.2
5
0.
5
1.
25
Frequency
Histogram
Example
9
8
7
6
5
4
3
2
1
0
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
25
5
4.
3.
75
2
2.
-0 1
.2
5
0.
5
1.
25
.00%
-3 4
.2
5
-2
.
-1 5
.7
5
Frequency
Histogram
Bins
9
8
7
6
5
4
3
2
1
0
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
Bins
4
1
1.
75
2.
5
3.
25
.00%
-1 2
.2
5
-0
.5
0.
25
-4
.2
5
-3
.5
-2
.7
5
Frequency
Histogram
Range
• Measure of variation
• Difference between the largest and the smallest
observations:
Range  X Largest  X Smallest
• Ignores the way in which data are distributed
Range = 12 - 7 = 5
Range = 12 - 7 = 5
7
8
9
10
11
12
7
8
9
10
11
12
Quartiles
• Split Ordered Data into 4 Quarters
25%
25%
 Q1 
25%
 Q2 
25%
Q3 
• Q2= Median, A Measure of Central Tendency
Excel function: =quartile(range, number)
=0: minimum value
=1: Q1
…
=4: maximum value
Interquartile Range
• Measure of spread/dispersion
• Also known as midspread
– Spread in the middle 50%
• Difference between the first and third quartiles
• Not affected by extreme values
Variance
• Important measure of variation
• Shows variation about the mean
– Sample variance:
n
S2 
 X
i 1
i
X
2
n 1
• “Average of squared deviations from the mean”
• “Standard deviation” = square root of variance
Excel functions
• Variance
=VAR(range)
• Standard Deviation
=STDEV(range)
Comparing Standard Deviations
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 3.338
Data B
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = .9258
Data C
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Coefficient of Variation
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data
measured in different units
•
S
CV  
X

100%

Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
• Stock B:
– Average price last year = $100
– Standard deviation = $5
• Coefficient of variation:
– Stock A:
– Stock B:
S
CV  
X

 $5 
100%  
100%  10%

 $50 
S
CV  
X

 $5 
100%  
100%  5%

 $100 
Exploratory Data Analysis
• Box-and-whisker plot
– Graphical display of data using 5-number summary
X smallest Q
1
4
6
Median( Q2)
8
Q3
10
Xlargest
12
Coefficient of Correlation
• Measures the strength of the linear relationship
between two quantitative variables
n
r
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
i
2
Features of
Correlation Coefficient
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y
Y
Y
X
r = -1
X
r = -.6
Y
X
r=0
Y
r = .6
X
r=1
X
Producing Data
• Sampling methods
• Survey Errors
Probability Sampling
• Subjects of the sample are chosen based on
known probabilities
Probability Samples
Simple
Random
Systematic
Stratified
Cluster
Simple Random Samples
• Every individual or item from the frame has an
equal chance of being selected
• Selection may be with replacement or without
replacement
• Samples obtained from table of random numbers
or computer random number generators
Random Samples
Systematic Samples
• Decide on sample size: n
• Divide frame of N individuals into groups of k
individuals: k=N/n
• Randomly select one individual from the 1st
group
• Select every k-th individual thereafter
N = 64
n=8
k=8
First Group
Stratified Samples
• Population divided into two or more groups
according to some common characteristic
• Simple random sample selected from each
group
• The two or more samples are combined into
one
Advantages and Disadvantages
• Simple random sample and systematic sample
– Simple to use
– May not be a good representation of the population’s
underlying characteristics
• Stratified sample
– Ensures representation of individuals across the
entire population
• Cluster sample
– More cost effective
– Less efficient (need larger sample to acquire the
same level of precision)
Key Definitions
• A population (universe) is the collection of things under
consideration
• A sample is a portion of the frame selected for
analysis
• A parameter is a summary measure computed to
describe a characteristic of the population
• A statistic is a summary measure computed to
describe a characteristic of the sample
Population and Sample
Population
Sample
Use statistics to
summarize features
Use parameters to
summarize features
Inference on the population from the sample
Reasons for Drawing a Sample
• Less time consuming than a census
• Less costly to administer than a census
• Less cumbersome and more practical to
administer than a census of the targeted
population
Evaluating Survey Worthiness
•
•
•
•
•
What is the purpose of the survey?
Is the survey based on a probability sample?
Coverage error – appropriate frame
Nonresponse error – follow up
Measurement error – good questions elicit good
responses
• Sampling error – always exists when
sample ≠ population
Types of Survey Errors
• Coverage error
Excluded from
frame.
• Non response error
Follow up on
non responses.
• Sampling error
• Measurement error
Chance
differences from
sample to sample.
Bad Question!
Measurement Errors
• Question Phrasing
Avoid negations
• Telescoping Effect
• “Halo” Effect
• Overzealous/Underzealous
Probability
Probability
• Probability is the numerical
measure of the likelihood
that an event will occur
1
Certain
• Value is between 0 and 1
• Sum of the probabilities of
all mutually exclusive and
collective exhaustive events
is 1
.5
0
Impossible
Computing Probabilities
• The probability of an event E:
number of event outcomes
P( E ) 
total number of possible outcomes in the sample space
X

T
e.g. P(
) = 2/36
(There are 2 ways to get one 6 and the other 4)
• Each of the outcomes in the sample space is
equally likely to occur
Empirical Probability
Example: Find the probability that a randomly selected
person will be struck by lightning this year .
The sample space consists of two simple events: the person is
struck by lightning or is not. Because these simple events are not
equally likely, we can use the relative frequency approximation
(Rule 1) or subjectively estimate the probability (Rule 3). Using
Rule 1, we can research past events to determine that in a recent
year 377 people were struck by lightning in the US, which has a
population of about 274,037,295. Therefore,
P(struck by lightning in a year)
= 377 / 274,037,295
= 1/727,000
Computing Joint Probability
• The probability of a joint event, A and B:
P(A and B) = P(A  B)
number of outcomes from both A and B

total number of possible outcomes in sample space
E.g. P(Red Card and Ace)
2 Red Aces
1


52 Total Number of Cards 26
Computing Compound Probability
• Probability of a compound event, A or B:
P( A or B)  P( A  B)
number of outcomes from either A or B or both

total number of outcomes in sample space
E.g.
P(Red Card or Ace)
4 Aces + 26 Red Cards - 2 Red Aces

52 total number of cards
28 7


52 13
Compound Probability
(Addition Rule)
P(A or B ) = P(A) + P(B) - P(A and B)
P(A)
P(A and B)
P(B)
For Mutually Exclusive Events: P(A or B) = P(A) + P(B)
Computing Conditional Probability
• The probability of event A given that event B has
occurred:
P( A and B)
P( A | B) 
P( B)
E.g.
P (Red Card given that it is an Ace)
2 Red Aces 1


4 Aces
2
Conditional Probability
American Int’l
Total
Men
0.25
0.15
0.40
Women
0.45
0.15
0.60
Total
0.70
0.30
Q: What is the probability that a randomly selected
student is American, knowing that the student
is female?
Conditional Probability and Joint
Probability
• Conditional probability:
P( A and B)
P( A | B) 
P( B)
• Multiplication rule for joint probability:
P( A and B)  P( A | B) P( B)
 P( B | A) P( A)
Conditional Probability and Statistical
Independence
(continued)
• Events A and B are independent if
P ( A | B )  P ( A)
or P ( B | A)  P ( B )
or P ( A and B )  P ( A) P ( B )
• Events A and B are independent when the
probability of one event, A, is not affected by
another event, B
Example
• A company has two suppliers A and B. Rush
orders are placed to both. If no raw material
arrives in 4 days, the process shuts down.
– A can deliver within 4 days with 55% probability.
– B can deliver within 4 days with 35% probability.
1. What is the probability that A and B deliver
within 4 days?
2. What is the probability the process shuts
down?
3. What is the probability at least one delivers in 4
days?
Stock Trader’s Almanac
• 1998 stock trader’s almanac has 48 years of
data (1950-1997)
• Stocks up in January: 31 times
• Stocks up in year: 36 times
• Stocks up in January AND year: 29 times
Binomial Probability Distribution
• ‘n’ identical trials
– e.g.: 15 tosses of a coin; ten light bulbs taken from a
warehouse
• Two mutually exclusive outcomes on each trials
– e.g.: Head or tail in each toss of a coin; defective or
not defective light bulb
• Trials are independent
– The outcome of one trial does not affect the outcome
of the other
• Constant probability for each trial
– e.g.: Probability of getting a tail is the same each time
we toss the coin
Excel’s Binomial Function
=BINOMDIST(no. of successes, no. of trials, prob.
of success, cumulative?)
Example
=BINOMDIST(2,8,0.5, FALSE)
(=0.11)
“Probability of tossing (exactly) two heads within 8
trials”
=BINOMDIST(2,8,0.5, TRUE)
(=0.14)
“Probability of tossing two heads or less within 8
trials”
Binomial Setting
Examples
• Number of times newspaper arrives on time (i.e.,
before 7:30 AM) in a week/month
• Number of times I roll “5” on a die in 20 rolls
• Number of times I toss heads within 20 trials
• Students pick random number between 1 and
10. Number of students who picked “7”
• Number of people who will vote “Republican” in
a group of 20
• Number of left-handed people in a group of 40
Service Center Staffing
0.36417
0.371602
0.185801
0.06067
0.014548
0.002732
0.000418
5.36E-05
5.88E-06
5.6E-07
4.69E-08
3.48E-09
2.31E-10
1.38E-11
7.42E-13
3.64E-14
1.62E-15
6.63E-17
2.48E-18
8.52E-20
2.7E-21
7.86E-23
2.11E-24
5.25E-26
1.21E-27
2.56E-29
5.02E-31
9.11E-33
0.36417
0.735771
0.921572
0.982242
0.99679
0.999522
0.99994
0.999994
0.999999
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Assumptions
- 50 computers sold
- Prob. customer calls for service = 0.02
- Want < 5% that there is no engineer
1
0.9
0.8
0.7
Probability
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
Cumul. Prob.
Probability
10
15
20
25
30
35
Number of Service Calls
40
45
50
Poisson Distribution
• Poisson Process:
P( X  x |
- x
– Discrete events in an “interval”
e 
• The probability of One Success
x!
in an interval is stable
• The probability of More than
One Success in this interval is 0
– The probability of success is
independent from interval to
interval
– e.g.: number of customers arriving in 15 minutes
– e.g.: number of defects per case of light bulbs
Excel’s Poisson Function
=POISSON(no. of occurences, mean, cumulative?)
Example
=POISSON(5,2,FALSE)
(=0.036)
“Probability that (exactly) five customers arrive wihtin
an hour when the overall average is two”
=POISSON(5,2,TRUE)
(=0.983)
“Probability that five or less customers arrive wihtin an
hour when the overall average is two”
Poisson Setting
Examples
• Number of accidents at an intersection in 6
months
• Number of people entering a bank in a 30minute interval
• Number of kids ringing the doorbell in 30
minutes for Halloween
• Number of times a Microsoft machine crashes
within 24 hours
• Number of sewing flaws per (100) garment(s)
Halloween
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.000335
0.002684
0.010735
0.028626
0.057252
0.091604
0.122138
0.139587
0.139587
0.124077
0.099262
0.07219
0.048127
0.029616
0.016924
0.009026
0.004513
0.002124
0.000944
0.000397
0.000159
0.000335
0.003019
0.013754
0.04238
0.099632
0.191236
0.313374
0.452961
0.592547
0.716624
0.815886
0.888076
0.936203
0.965819
0.982743
0.991769
0.996282
0.998406
0.99935
0.999747
0.999906
Assume: on average 4 kids /hour (=lambda)
A Poisson Distribution
1
0.9
0.8
0.7
0.6
0.5
0.4
Probability
Cum. Prob.
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9101112131415161718192021