Transcript Statistics
Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
1/54
2: Descriptive Statistics
Statistics and Data
Analysis
Part 2 – Descriptive Statistics
Summarizing data with useful
statistics
2/54
2: Descriptive Statistics
Use random samples
and basic descriptive
statistics.
What is the ‘breach
rate’ in a pool of tens
of thousands of
mortgages? (‘Breach’
= improperly
underwritten or
serviced or otherwise
faulty mortgage.)
3/54
2: Descriptive Statistics
The forensic analysis was an examination of
statistics from a random sample of 1,500 loans.
4/54
2: Descriptive Statistics
Descriptive Statistics
Agenda
Populations and Random Samples
Descriptive Statistics for a Variable
Measuring Correlation of Two Variables
5/54
Measures of location: Mean,median,mode
Measure of dispersion: Standard deviation
Understanding correlation
Measuring correlation
Scatter plots and regression
2: Descriptive Statistics
Populations and Samples
Population: Collection of all possible observations (data
points) on a variable
Sample: A subset of the data points in the population
Random sample: Defined by the way the sample data are
obtained. All points in the population are equally likely to
be drawn in any particular sample.
What is the purpose of obtaining a sample?
To describe or learn about the population.
6/54
The sample is observed
The population is assumed.
In order to learn confidently about the population from
a sample, the sample must be ‘random.’
2: Descriptive Statistics
Random Sampling
A production process produces circuit boards. Boards are
produced in each hour with an average of 2 defects per board
when the process is in control. Each hour, the engineer
examines a random sample of 100 circuit boards. The average
number of defects per board in a particular 30 hour week is
Hour 1:
Mean of 100 boards = 1.95,
Hour 2:
“
2.65,
Hour 3:
“
1.80, …
Hour 30:
“
2.35.
(These are estimates of the defect rate per board)
7/54
The objective of drawing the sample is to determine
whether the process is in control or not. The process is
under control if the defect rate is < 2.)
Method: Assuming the process is in control, would we
expect to see this rate of defects?
2: Descriptive Statistics
Random samples of behavior are difficult
to obtain, especially by telephone.
8/54
2: Descriptive Statistics
Nonrandom Samples
Nonrandom samples produce tainted,
sometimes not believable results
9/54
Biased with respect to the population
May describe a not useful specific subset of
the population.
2: Descriptive Statistics
(Non)Randomness of Samples
Sources of bias in samples (generally related)
10/54
Bad sample design – e.g., home phone
surveys conducted during working hours
Survey (non)response bias – e.g., opinion
surveys about service quality
Participation bias – e.g., voluntary
participation in a survey
Self selection – volunteering for a trial or an
opinion sample. (Shere Hite’s cultural
revolution)
Attrition bias from clinical trials - e.g., if the
drug works, the subject does not come back.
2: Descriptive Statistics
Nonrandom
results in
incubator
funds.
The “NYU No
Action Letter”
11/54
2: Descriptive Statistics
Nonscientific, Nonrandom “(non)Sampling”
A Cultural Revolution …
“3000 women, ages 14
to 78 describe in their
own words …”
12/54
2: Descriptive Statistics
http://www.amazon.com/The-Hite-Report-National-Sexuality/dp/1583225692
A Cultural Revolution …
“3000 women, ages 14 to 78
describe in their own words …”
13/54
2: Descriptive Statistics
http://en.wikipedia.org/wiki/Shere_Hite
14/54
2: Descriptive Statistics
The Lesson…
Having a really big sample does not
assure you of an accurate result. It may
assure you of a really solid, really bad
(inaccurate) result.
15/54
2: Descriptive Statistics
How do ASCAP, BMI and SESAC allocate the royalty pool to specific authors and publishers?
The following relates to terrestrial radio, which, as a group, pays a lump sum into the pool,
which is then allocated by the PRSs.
16/54
http://old.cni.org/docs/ima.ip-workshop/Massarsky.html
2: Descriptive Statistics
A Descriptive Statistic
Is … ?
Describes what?
The
sample data
The population that the data came from
17/54
2: Descriptive Statistics
Measures of Location
These are 30 hours of average defect data on sets of
boards. Roughly what is the typical value?
circuit
1.45
2.35
1.90
1.70
2.35
2.35
1.65
1.70
1.55
18/54
1.50
1.90
1.95
2.25
1.45
1.60
1.65
1.40
2.05
1.60
2.60
2.05
2.30
2.05
1.70
2.20
1.70
2.30
2.70
1.05
1.30
Location and central tendency
There exists a distribution of values
We are interested in the “center” of the distribution
Two measures are the sample mean and the sample
median
They look similar, and measure the same thing.
They differ systematically (and predictably) when the data
are not ‘symmetric.’
2: Descriptive Statistics
The Sample Mean
These are 30 hours of average defect data on sets of circuit
boards. Roughly what is the typical value?
1.45
2.35
1.90
1.65
1.70
1.55
1.50
1.90
1.95
2.25
1.45
1.60
1.65
1.40
2.05
1.60
2.60
2.05
2.30
2.05
1.70
2.20
1.70
2.30
2.70
1.05
1.30
1.70
2.35
2.35
There are N observations (data points) in the sample.
Sample data : y = [y1, y 2 , y 3 , y 4 ,...y N ]
In this sample, N = 30. The sample mean is
y=
=
19/54
N
1
N
i=1
yi =
1
N
[ y1 + y 2 + y 3 + y 4 ... y N ]
1
56.30
(1.45 +... + 2.35) =
=1.8767
30
30
2: Descriptive Statistics
It may be necessary to ‘weight’ aggregate data.
Average Home Listings
1
Listing = (896,800 + 713,864 +...+164,326) = 369,687
51
20/54
2: Descriptive Statistics
Averaging Averages?
Hawaii’s average listing = $896,800
Hawaii’s population
= 1,275,194
Illinois’ average listing
= $377,683
Illinois’ population
= 12,763,371
Illinois and Hawaii each get weight 1/51 =
.019607 when the mean is computed.
Looks like Hawaii is getting too much
influence.
21/54
2: Descriptive Statistics
A Properly Weighted Average
Simple average = Listing
= States Weight State ListingState
1
= .019607
51
Illinois is 10 times as big as Hawaii. Suppose we use weights that are
Weight
=
in proportion to the state's population. (The weights sum to 1.0.)
Weight State varies from .001717 for Wyoming to .121899 for California
New average is 409,234 compared to 369,687 without weights, an
error of 11%. Sometimes an unequal weighting of the
observations is necessary.
State populations from http://www.factmonster.com/ipka/A0004986.html
22/54
2: Descriptive Statistics
Averaging Trending Time Series
Observations Is Usually Not Informative
Note how the mean changes completely depending
on what time interval is used to compute it.
Does the mean
over the entire
observation
period mean
anything? (Does
it estimate
anything
meaningful?)
23/54
2: Descriptive Statistics
The Sample Median
Median = the middle observation after
data are sorted.
Odd number: Central observation:
Med[1,2,4,6,8,9,17] = 6
Even number: Midpoint between the
two central observations
Med[1,2,4,6,8,9,14,17] = (6+8)/2=7
24/54
2: Descriptive Statistics
Sample Median of (Sorted) Defects Data
1.05
1.55
1.70
2.05
2.30
1.30
1.60
1.70
2.05
2.35
1.40
1.60
1.70
2.05
2.35
1.45
1.65
1.90
2.20
2.35
1.45
1.65
1.90
2.25
2.60
1.50
1.70
1.95
2.30
2.70
12
Mean
= 1.8767
F req u en cy
Median = 1.8000
9
6
3
0
1. 000
1. 500
2. 000
2. 500
3. 000
DEFECTS
25/54
2: Descriptive Statistics
(Let’s deduce
estimates of
the mean and
median from
the histogram.)
Tomorrow I will compute the average number of defectives
for a 61st day. What is a good guess of the number I will find?
26/54
2: Descriptive Statistics
Skewed Earnings Distribution
Mean vs. Median in Skewed Data
These data are skewed to the right.
M y
27/54
Monthly Earnings
N = 595,
Median = 800
Mean = 883
The mean will exceed
the median when the
distribution is skewed
to the right. (The
skewness is in the
direction of the long
tail.)
2: Descriptive Statistics
Extreme Observations Distort
Means but Not Medians
Outlying observations distort the mean
Med [1,2,4,6,8,9,17] = 6
Mean[1,2,4,6,8,9,17] = 6.714
Med [1,2,4,6,8,9,17000] = 6 (still)
Mean[1,2,4,6,8,9,17000] = 2432.8 (!)
28/54
This typically occurs when there are some outlying
obervations, such as in cross sections of income or
wealth and/or when the sample is not very large.
2: Descriptive Statistics
29/54
2: Descriptive Statistics
The mean does not give information
about the shape of the distribution.
Two problems with the computations
(1) The data are ratings, not quantitative
(2) The mean does not suggest the
extreme nature of the data
30/54
2: Descriptive Statistics
The problem with the mean or median as a description of
a sample – more information is usually needed.
Both data sets have a mean of about 100.
31/54
2: Descriptive Statistics
Dispersion of the Observations
These are 30 hours of average defect data on sets of circuit
boards.
1.45
2.35
1.90
1.65
1.70
1.55
1.50
1.90
1.95
2.25
1.45
1.60
6
Frequency
5
4
3
2
1
32/54
1.2
1.6
2.0
Defects
1.60
2.60
2.05
2.30
2.05
1.70
2.20
1.70
2.30
2.70
1.05
1.30
1.70
2.35
2.35
We quantify the variation of the values
around the mean. Note the range is
from 1.05 to 2.70. This gives an idea
where the data lie. The mean plus a
measure of the variation do the same
job.
Histogram of Defects
0
1.65
1.40
2.05
2.4
2.8
2: Descriptive Statistics
The Problem with the Range as a Measure of Dispersion
These two data sets both have 1,000 observations
that range from about 10 to about 180
33/54
2: Descriptive Statistics
A Measure of Dispersion
The standard deviation is the interesting value. You need
to compute the variance to get the standard deviation.
Variance =
sy2
N
= 1 i=1
Yi - Y
N 1
Standard deviation = sy =
2
1
N
Yi - Y
i=1
N 1
2
Note the units of measurement. The standard deviation has the same units
as the mean. The standard deviation is the standard measure for the
dispersion (spread) of a set of values (sample of observations).
34/54
2: Descriptive Statistics
The variance is the average squared deviation of the
sample values from the mean. Why is N-1 in the
denominator of s2?
Everyone else does it
Minitab does it
I have totally no idea.
Tendency of the variance to be too
small when computed using 1/N when
the sample size, N, is itself small.
(When N is large, it won’t matter.)
See HOG, p. 37
35/54
2: Descriptive Statistics
Computing a Standard Deviation
Y
Deviation
From Mean
1
-2.1
4
0.9
6
2.9
0
-3.1
3
-0.1
2
-1.1
6
2.9
4
0.9
4
0.9
1
-2.1
SUM
0.0
36/54
Squared
Deviation
4.41
0.81
8.41
9.61
0.01
1.21
8.41
0.81
0.81
4.41
38.90
Sum
= 31
Mean
= 31/10=3.1
Sum of squared deviations = 38.90
Variance = 38.90/(10-1)
= 4.322
Standard Deviation
= 2.079
2: Descriptive Statistics
Standard Deviation
These are 30 hours of average defect data on sets of circuit
boards.
1.45
2.35
1.90
1.65
1.70
1.55
1.50
1.90
1.95
2.25
1.45
1.60
1.65
1.40
2.05
1.60
2.60
2.05
2.30
2.05
1.70
2.20
1.70
2.30
2.70
1.05
1.30
1.70
2.35
2.35
2
1
1
30
Variance =
Y
-1.8767
=
4.808667 = 0.165816
i
30 -1 i=1
30 -1
Standard Deviation =
37/54
2
1
30
Y
-1.8767
= 0.407205
i
i=1
30 -1
2: Descriptive Statistics
Distribution of Values
Histogram of Defects
6
Frequency
5
4
3
2
1
0
38/54
1.2
1.6
2.0
Defects
2.4
2.8
2: Descriptive Statistics
Reliable Rules of Thumb
Almost always, 66% of the observations in a sample will
lie in the range
[mean - 1 s.d. to mean + 1 s.d.]
Almost always, 95% of the observations in a sample will
lie in the range
[mean - 2 s.d. to mean + 2 s.d.]
Almost always, 99.5% of the observations in a sample will
lie in the range
[mean - 3 s.d. to mean + 3 s.d.]
When these rules are not met, they will almost be met. Data
nearly always act this way.
39/54
2: Descriptive Statistics
A Reliable Empirical Rule
Dotplot of Defects
Mean ± 2 s = 1.8767 ± 2(.4072)
= 1.06 to 2.69 includes 28/30 = 93%
1.00
1.25
1.50
1.75
2.00
Defects
2.25
2.50
2.75
Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60%
Minitab: Graph Dotplot …
40/54
2: Descriptive Statistics
Rules For Transformations
Mean
of a + bY = a + b y
Standard
41/54
deviation of a + bY = |b| sy
2: Descriptive Statistics
Which city is warmer, New York (USA) or Old
York (England)? Which is more variable?
Average Temperatures (high + low)/2
Month NY (f) OY(c)
Month
Jan
29.5
2.0
Jul
Feb
32.0
2.0
Aug
Mar
35.0
4.5
Sep
Apr
50.0
8.5
Oct
May
60.5
9.5
Nov
Jun
70.0
13.0
Dec
City
Mean
Old York 8.500
New York 52.25
42/54
Std.Dev.
4.913
16.93
NY(f)
75.5
73.5
66.0
55.0
45.0
35.0
Min
2.000
29.50
OY(c)
15.5
15.0
13.0
9.5
6.0
3.5
Max
15.50
75.50
2: Descriptive Statistics
Application – Cost of Defects
These are 30 observations of average defect data on sets of
manufactured circuit boards.
1.45
2.35
1.90
1.65
1.70
1.55
1.50
1.90
1.95
2.25
1.45
1.60
1.65
1.40
2.05
1.60
2.60
2.05
2.30
2.05
1.70
2.20
1.70
2.30
2.70
1.05
1.30
1.70
2.35
2.35
Suppose the cost to repair defects is $25 + 10*Defects
I.e., a $25 setup cost plus $10 per defect.
Mean defects = 1.8767 Standard Deviation = 0.407205
Mean Cost
=
Standard Deviation Cost
43/54
$25 + $10(1.8767)
=
=
$10(.407205) =
$43.767
$4.07205
2: Descriptive Statistics
Correlation
Variables Y and X vary together
Causality vs. correlation: Does movement in X
“cause” movement in Y in some metaphysical
sense?
Correlation
44/54
Simultaneous movement through a statistical relationship
Simultaneous variation “induced” by the variation of a
common third effect
2: Descriptive Statistics
Samples of House Listings and
Per Capita Incomes at a Particular Time
45/54
2: Descriptive Statistics
Scatter Plot Suggests Positive Correlation
Scatterplot of Listing vs IncomePC
900000
800000
700000
Listing
600000
500000
400000
300000
200000
100000
15000
46/54
17500
20000
22500
25000
IncomePC
27500
30000
32500
2: Descriptive Statistics
Regression Measures Correlation
Scatterplot of Listing vs IncomePC
900000
800000
Regression Line: Listing = a + b IncomePC
700000
Listing
600000
500000
400000
300000
200000
100000
15000
47/54
17500
20000
22500
25000
IncomePC
27500
30000
32500
2: Descriptive Statistics
Correlation Is Not Causation
Price and Income seem to be “positively” related.
Scatterplot of Income vs GasPrice
27500
25000
The U.S. Gasoline
Market. Data are
yearly from 1953 to
2004. Plot of per
capita income vs.
gasoline price
index.
Income
22500
20000
17500
15000
12500
10000
20
40
60
80
100
120
GasPrice
48/54
2: Descriptive Statistics
The Hidden (Spurious) Relationship
Not positively “related” to each other; both positively related to “time.”
Scatterplot of Income vs Year
Scatterplot of GasPrice vs Year
27500
120
25000
100
20000
GasPrice
Income
22500
17500
15000
80
60
40
12500
10000
20
1950
49/54
1960
1970
1980
Year
1990
2000
2010
1950
1960
1970
1980
Year
1990
2000
2010
2: Descriptive Statistics
Correlation is the interesting number.
We must compute covariance and the two
standard deviations first.
2
1
1
N
n
Standard Deviations: s X
X
X
,
s
i
Yi - Y
Y
N 1 i=1
N 1 i=1
2
X X Y Y
N
Covariance:
s XY
Correlation :
rXY
50/54
i=1
i
i
N 1
s XY
s X sY
-1 < rXY < +1
Units free. A pure number.
2: Descriptive Statistics
Correlation
Income
Scatterplot of Listing vs IncomePC
900000
800000
700000
Listing
600000
500000
400000
Listing
300000
200000
100000
15000
17500
20000
22500
25000
IncomePC
27500
30000
32500
rIncome,Listing = +0.591
51/54
2: Descriptive Statistics
Scatterplot of Noise vs Defects
2.6
2.4
2.2
Noise
Correlations
2.0
1.8
1.6
1.4
Scatterplot of cost vs Defects
1.2
1.0
25.28
1.2
1.4
1.6
25.26
1.8
2.0
Defects
2.2
2.4
2.6
2.8
25.24
cost
25.22
r = 0.0
25.20
25.18
25.16
25.14
Scatterplot of Noise vs MoreNoise
25.12
2.6
25.10
1.2
1.4
1.6
1.8
2.0
Defects
2.2
2.4
2.6
2.4
2.8
2.2
Noise
1.0
r = +1.0
2.0
1.8
1.6
1.4
1.2
1.50
1.75
2.00
MoreNoise
2.25
2.50
r = +0.5
52/54
2: Descriptive Statistics
Sample Statistics and Population Parameters
53/54
Sample has a sample mean and standard
deviation Y and sY.
Population has a mean, μ, and standard
deviation, σ.
The sample “looks like” the population.
The sample statistics resemble the population
features.
The bigger is the RANDOM sample, the
closer will be the resemblance. We will study
this later in the course.
2: Descriptive Statistics
Summary
Statistics to describe location (mean) and
spread (standard deviation) of a sample of
values.
Statistics and graphical tools to describe
bivariate (two variable) relationships
54/54
Interpretations
Computations
Complications
Scatter plots
Correlation
2: Descriptive Statistics