Class6 - NYU Stern School of Business
Download
Report
Transcript Class6 - NYU Stern School of Business
Statistics & Data Analysis
Course Number
Course Section
Meeting Time
B01.1305
31
Wednesday 6-8:50 pm
CLASS #6
Class #6 Outline
Point estimation
Confidence interval estimation
Determining sample sizes
Introduction to Regression and Correlation Analysis
Professor S. D. Balkin -- March 5, 2003
-2-
Review of Last Class
Sampling distributions for sample statistics
Professor S. D. Balkin -- March 5, 2003
-3-
Review of Notation
Mean
Standard Deviation
Professor S. D. Balkin -- March 5, 2003
Population Sample
X
s
-4-
Point and Interval Estimation
Chapter 8
Review
Basic problem of statistical theory is how to infer a population
or process value given only sample data
Any sample statistic will vary from sample to sample
Any sample statistic will differ from the true, population value
Must consider random error in sample statistic estimation
Professor S. D. Balkin -- March 5, 2003
-6-
Chapter Goals
Summarize sample data
• Choosing an estimator
• Unbiased estimator
Constructing confidence intervals for means with known standard
deviation
Constructing confidence intervals for proportions
Determining how large a sample is needed
Constructing confidence intervals when standard deviation is not known
Understanding key underlying assumptions underlying confidence interval
methods
Professor S. D. Balkin -- March 5, 2003
-7-
Reminder: Statistical Inference
Problem of Inferential Statistics:
• Make inferences about one or more population parameters based on
observable sample data
Forms of Inference:
• Point estimation: single best guess regarding a population parameter
• Interval estimation: Specifies a reasonable range for the value of the
parameter
• Hypothesis testing: Isolating a particular possible value for the
parameter and testing if this value is plausible given the available data
Professor S. D. Balkin -- March 5, 2003
-8-
Point Estimators
Computing a single statistic from the sample data to estimate
a population parameter
Choosing a point estimator:
• What is the shape of the distribution?
• Do you suspect outliers exist?
• Plausible choices:
•
•
•
•
Mean
Median
Mode
Trimmed Mean
Professor S. D. Balkin -- March 5, 2003
-9-
Technical Definitions
ESTIMATOR : An estimator ˆ of a parameter is a
function of a random sample that yields a point estimate
for . An estimator is itself a random variable and therefore
it has a theoretic al sampling distributi on.
UNBIASED ESTIMATOR : An estimator ˆ that is a function
of the sample data is called unbiased for the population parameter
if its expected value equals
EFFICIENT ESTIMATOR : An estimator is called most efficient
for a particular problem if it has the smallest standard error of all possible
unbiased estimators
Professor S. D. Balkin -- March 5, 2003
- 10 -
Example
I used R to draw 1,000 samples, each of size 30, from a
normally distributed population having mean 50 and standard
deviation 10.
data.mean = data.median = numeric(0)
for(i in 1:1000) {
data = rnorm(n=30, mean=50, sd=10)
data.mean[i] = mean(data)
data.median[i] = median(data)
}
For each sample the mean and median are computed.
Do these statistics appear unbiased?
Which is more efficient?
Professor S. D. Balkin -- March 5, 2003
- 11 -
Example
I used R to draw 1,000 samples, each of size 30, from an
extremely skewed population with mean and standard
deviation both equal to 2.
data.mean = data.median = numeric(0)
for(i in 1:1000) {
data = rt(n=30, 3)
data.mean[i] = mean(data)
data.median[i] = median(data)
}
For each sample the mean and median are computed.
Do these statistics appear unbiased?
Which is more efficient?
Professor S. D. Balkin -- March 5, 2003
- 12 -
Expressing Uncertainty
Suppose we are trying to make inferences about a population
mean based on a sample of size n.
The sample mean X is a point estimator of the parameter . Used
by itself, X is of limited usefulness because it contains no
informatio n about its own reliabilit y.
Furthermor e, the reporting of X alone may leave the false
impression that X estimates with complete accuracy.
Professor S. D. Balkin -- March 5, 2003
- 13 -
Confidence Interval
An interval with random endpoints which contains the
parameter of interest (in this case, μ) with a pre-specified
probability, denoted by 1 - α.
The confidence interval automatically provides a margin of
error to account for the sampling variability of the sample
statistic.
Example: A machine is supposed to fill “12 ounce” bottles of
Guinness. To see if the machine is working properly, we
randomly select 100 bottles recently filled by the machine,
and find that the average amount of Guinness is 11.95
ounces. Can we conclude that the machine is not working
properly?
Professor S. D. Balkin -- March 5, 2003
- 14 -
No! By simply reporting the sample mean, we are neglecting
the fact that the amount of beer varies from bottle to bottle
and that the value of the sample mean depends on the luck of
the draw
It is possible that a value as low as 11.75 is within the range
of natural variability for the sample mean, even if the average
amount for all bottles is in fact μ = 12 ounces.
Suppose we know from past experience that the amounts of
beer in bottles filled by the machine have a standard deviation
of σ = 0.05 ounces.
Since n = 100, we can assume (using the Central Limit
Theorem) that the sample mean is normally distributed with
mean μ (unknown) and standard error 0.005
What does the Empirical Rule tell us about the average
volume of the sample mean?
Professor S. D. Balkin -- March 5, 2003
- 15 -
Why does it work?
X
X is in here
95% of the time
Professor S. D. Balkin -- March 5, 2003
X
SX
is in here about
95% of the time
- 16 -
Using the Empirical Rule Assuming Normality
Professor S. D. Balkin -- March 5, 2003
- 17 -
Confidence Intervals
“Statistics is never having to say you're certain”.
• (Tee shirt, American Statistical Association).
Any sample statistic will vary from sample to sample
Point estimates are almost inevitably in error to some degree
Thus, we need to specify a probable range or interval
estimate for the parameter
Professor S. D. Balkin -- March 5, 2003
- 18 -
Confidence Interval
100(1 )% CONFIDENCE INTERVAL FOR AND KNOWN
Using the sample mean as an estimate of the population mean, allow for
sampling error with a plus - or - minus term equal to a z - table value times the
standard error of the sample mean :
y z / 2 Y y z / 2 Y
Professor S. D. Balkin -- March 5, 2003
- 19 -
Example
An airline needs an estimate of the average number of
passengers on a newly scheduled flight
Its experience is that data for the first month of flights are
unreliable, but thereafter the passenger load settles down
The mean passenger load is calculated for the first 20
weekdays of the second month after initiation of this particular
flight
If the sample mean is 112 and the population standard
deviation is assumed to be 25, find a 95% confidence interval
for the true, long-run average number of passengers on this
flight
Find the 90% confidence interval for the mean
Professor S. D. Balkin -- March 5, 2003
- 20 -
Interpretation
The significance level of the confidence interval refers to the
process of constructing confidence intervals
Each particular confidence interval either does or does not
include the true value of the parameter being estimated
We can’t say that this particular estimate is correct to within
the error
So, we say that we have a XX% confidence that the
population parameter is contained in the interval
Or…the interval is the result of a process that in the long run
has a XX% probability of being correct
Professor S. D. Balkin -- March 5, 2003
- 21 -
Imagine Many Samples
Missed!
Missed!
The interval you computed
22
23
The population mean = 23.29
Professor S. D. Balkin -- March 5, 2003
24
- 22 -
Example
A signal transmitting value is sent from California, the value
received in NY is normally distributed with mean and
variance 4.
To reduce error, the same value is sent 9 times
If the successive values received are:
• 5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5
Construct a 99% confidence interval for
Professor S. D. Balkin -- March 5, 2003
- 23 -
Getting Realistic
The population standard deviation is rarely known
Usually both the mean and standard deviation must be
estimated from the sample
Estimate with s
However…with this added source of random errors, we need
to handle this problem using the t-distribution (later on)
Professor S. D. Balkin -- March 5, 2003
- 24 -
Confidence Intervals for Proportions
We can also construct confidence intervals for proportions of
successes
Recall that the expected value and standard error for the
number of successes in a sample are:
E(ˆ ) ; ˆ (1 ) / n
How can we construct a confidence interval for a proportion?
Professor S. D. Balkin -- March 5, 2003
- 25 -
Example
Suppose that in a sample of 2,200 households with one or
more television sets, 471 watch a particular network’s show at
a given time.
Find a 95% confidence interval for the population proportion
of households watching this show.
Professor S. D. Balkin -- March 5, 2003
- 26 -
Example
The 1992 presidential election looked like a very close threeway race at the time when news polls reported that of 1,105
registered voters surveyed:
• Perot: 33%
• Bush: 31%
• Clinton: 28%
Construct a 95% confidence interval for Perot?
What is the margin of error?
What happened here?
Professor S. D. Balkin -- March 5, 2003
- 27 -
Example
A survey conducted found that out of 800 people, 46%
thought that Clinton’s first approved budget represented a
major change in the direction of the country.
Another 45% thought it did not represent a major change.
Compute a 95% confidence interval for the percent of people
who had a positive response.
What is the margin of error?
Interpret…
Professor S. D. Balkin -- March 5, 2003
- 28 -
Choosing a Sample Size
Gathering information for a statistical study can be expensive,
time consuming, etc.
So…the question of how much information to gather is very
important
When considering a confidence interval for a population mean
, there are three quantities to consider:
z / 2
Y / n
Professor S. D. Balkin -- March 5, 2003
- 29 -
Choosing a Sample Size (cont)
Tolerability Width: The margin of acceptable error
• ±3%
• ± $10,000
Derive the required sample size using:
• Margin of error (tolerability width)
• Level of Significance (z-value)
• Standard deviation (given, assumed, or calculated)
Professor S. D. Balkin -- March 5, 2003
- 30 -
Example
Union officials are concerned about reports of inferior wages
being paid to employees of a company under its jurisdiction
How large a sample is needs to obtain a 90% confidence
interval for the population mean hourly wage with width
equal to $1.00? Assume that =4.
Professor S. D. Balkin -- March 5, 2003
- 31 -
Example
A direct-mail company must determine its credit policies very
carefully.
The firm suspects that advertisements in a certain magazine
have led to an excessively high rate of write-offs.
The firm wants to establish a 90% confidence interval for this
magazine’s write-off proportion that is accurate to ± 2.0%
• How many accounts must be sampled to guarantee this goal?
• If this many accounts are sampled and 10% of the sampled accounts
are determined to be write-offs, what is the resulting 90% confidence
interval?
• What kind of difference do we see by using an observed proportion
over a conservative guess?
Professor S. D. Balkin -- March 5, 2003
- 32 -
The t Distribution
Up until now, we have assumed that the population standard
deviation is known or that we choose a large enough sample
so the sample standard deviation s can replace .
Sometimes a large sample is not possible
So far, we’ve based the confidence interval on the z statistic:
Y
Z
/ n
Professor S. D. Balkin -- March 5, 2003
- 33 -
The t Distribution (cont)
When the population standard deviation is unknown, it must
be replaced by the sample statistic
This yields the summary statistic
Y
t
s/ n
This statistic follows a t-Distribution
Professor S. D. Balkin -- March 5, 2003
- 34 -
The t Distribution (cont)
This statistic was derived by W. S. Gosset
Gosset obtained a post as a chemist in the Guinness brewery
in Dublin in 1899 and did important work on statistics
He invented the t Distribution to handle small samples for
quality control in brewing
He wrote under the name "Student"
Professor S. D. Balkin -- March 5, 2003
- 35 -
Properties of the t Distribution
0.4
Symmetric about the mean 0
More variable than the z-distribution
0.0
0.1
0.2
0.3
Normal
t
-4
Professor S. D. Balkin -- March 5, 2003
-2
0
2
4
- 36 -
Properties of the t Distribution (cont)
There are many different t distributions.
• We specify a particular one by its degrees of freedom
• If a random sample is taken from a normal population, then the statistic:
Y
t
s/ n
has a t distribution with d.f. = n-1
As sample size increases, the t-distribution approaches the zdistribution
R functions
Y
pt(t, df) : P
t ; Cumulative distributi on function
s/ n
Y
qt(p, df) : P
t p; Inverse CDF
s/ n
Professor S. D. Balkin -- March 5, 2003
- 37 -
Degrees of Freedom
Technical definition fairly complex
Intuitively: d.f. refers to the estimated standard deviation and
is used to indicate the number of pieces of information
available for that estimate
The standard deviation is based on n deviations from the
mean, but the deviations must sum to 0, so only n-1
deviations are free to vary
Professor S. D. Balkin -- March 5, 2003
- 38 -
Example
How long should you wait before ordering new inventory?
• If you choose too soon, you pay stocking costs
• If you choose too late, you risk stock-outs
Your supplier says goods will arrive in two weeks (10 business days), but
you made note of how many business days it actually took: 10, 9, 7, 10, 3,
9, 12, 5
Calculate the sample mean, standard deviation, and standard error for this
sample
What is the probability a shipment takes more than two weeks?
Professor S. D. Balkin -- March 5, 2003
- 39 -
Confident Intervals for the t Distribution
100(1 )% CONFIDENCE INTERVAL FOR AND UNKNOWN
Using the sample mean as an estimate of the population mean, allow for
sampling error with a plus - or - minus term equal to a t - table value times the
standard error of the sample mean :
y t / 2 s / n y t / 2 s / n
where tα/ 2 is the tabulated t value cutting off a right - tail area of α/ 2 with n-1 d.f.
Professor S. D. Balkin -- March 5, 2003
- 40 -
Example
How long should you wait before ordering new inventory?
• If you choose too soon, you pay stocking costs
• If you choose too late, you risk stock-outs
Your supplier says goods will arrive in two weeks (10
business days), but you made note of how many business
days it actually took: 10, 9, 7, 10, 3, 9, 12, 5
Calculate a 95% confidence interval for the mean delivery
time
Professor S. D. Balkin -- March 5, 2003
- 41 -
Assumptions
Assumptions needed for validity of the Confidence Interval
1. Data are a RANDOM SAMPLE from the population of interest
• (So that the sample can tell you about the population)
2. The sample average is approximately NORMAL
• Either the data are normal (check the histogram)
• Or the central limit theorem applies:
– Large enough sample size n, distribution not too skewed
• (So that the t table is technically appropriate)
Professor S. D. Balkin -- March 5, 2003
- 42 -
Linear Regression and Correlation Methods
Chapter 11
Chapter Goals
Introduction to Bivariate Data Analysis
• Introduction to Simple Linear Regression Analysis
• Introduction to Linear Correlation Analysis
Interpret scatter plots
Professor S. D. Balkin -- March 5, 2003
- 44 -
Motivating Example
Before a pharmaceutical sales rep can speak about a product
to physicians, he must pass a written exam
An HR Rep designed such a test with the hopes of hiring the
best possible reps to promote a drug in a high potential area
In order to check the validity of the test as a predictor of
weekly sales, he chose 5 experienced sales reps and piloted
the test with each one
The test scores and weekly sales are given in the following
table:
Professor S. D. Balkin -- March 5, 2003
- 45 -
Motivating Example (cont)
SALESPERSON
TEST SCORE
WEEKLY SALES
JOHN
4
$5,000
BRENDA
7
$12,000
GEORGE
3
$4,000
HARRY
6
$8,000
AMY
10
$11,000
Professor S. D. Balkin -- March 5, 2003
- 46 -
Introduction to Bivariate Data
Up until now, we’ve focused on univariate data
Analyzing how two (or more) variables “relate” is very
important to managers
•
•
•
•
Prediction equations
Estimate uncertainty around a prediction
Identify unusual points
Describe relationship between variables
Visualization
• Scatterplot
Professor S. D. Balkin -- March 5, 2003
- 47 -
Scatterplot
8000
4000
6000
sales
10000
12000
Do Test Score and Weekly Sales appear related?
3
4
5
6
7
8
9
10
score
Professor S. D. Balkin -- March 5, 2003
- 48 -
Correlation
Boomers' Little Secret Still Smokes Up the Closet
July 14, 2002
…Parental cigarette smoking, past or current, appeared to have a stronger correlation
to children's drug use than parental marijuana smoking, Dr. Kandel said. The
researchers concluded that parents influence their children not according to a simple
dichotomy — by smoking or not smoking — but by a range of attitudes and behaviors,
perhaps including their style of discipline and level of parental involvement. Their own
drug use was just one component among many…
A Bit of a Hedge to Balance the Market Seesaw
July 7, 2002
…Some so-called market-neutral funds have had as many years of negative returns as
positive ones. And some have a high correlation with the market's returns…
Professor S. D. Balkin -- March 5, 2003
- 49 -
Correlation Analysis
Statistical techniques used to measure the strength of the
relationship between two variables
Correlation Coefficient: describes the strength of the
relationship between two sets of variables
•
•
•
•
•
•
•
Denoted r
r assumes a value between –1 and +1
r = -1 or r = +1 indicates a perfect correlation
r = 0 indicates not relationship between the two sets of variables
Direction of the relationship is given by the coefficient’s sign
Strength of relationship does not depend on the direction
r means LINEAR relationship ONLY
Professor S. D. Balkin -- March 5, 2003
- 50 -
Example Correlations
r=-0.73
r=-0.25
-2
-1
0
1
2
-1.0
-2
-1.0
-1
0.0
0.0
0
1
1.0
1.0
2
r=-0.9
-2
-1
0
1
2
-2
-1
0
1
2
1
2
Correlation Demo
r=0.7
r=0.88
2
-3
-1.0
-2
-2
-1
-1
0
0
1
1
0.0 0.5 1.0 1.5
3
r=0.34
-2
-1
0
1
Professor S. D. Balkin -- March 5, 2003
2
-2
-1
0
1
2
-2
-1
0
- 51 -
Scatterplot
8000
4000
6000
sales
10000
12000
r = 0.88
3
4
5
6
7
8
9
10
score
Professor S. D. Balkin -- March 5, 2003
- 52 -
Correlation and Causation
Must be very careful in interpreting correlation coefficients
Just because two variables are highly correlated does not mean that
one causes the other
• Ice cream sales and the number of shark attacks on swimmers are
correlated
• The miracle of the "Swallows" of Capistrano takes place each year at the
Mission San Juan Capistano, on March 19th and is accompanied by a large
number of human births around the same time
• The number of cavities in elementary school children and vocabulary size
have a strong positive correlation.
To establish causation, a designed experiment must be run
CORRELATION DOES NOT IMPLY CAUSATION
Professor S. D. Balkin -- March 5, 2003
- 53 -
Regression Analysis
Simple Regression Analysis is predicting one variable
from another
• Past data on relevant variables are used to create and evaluate a
prediction equation
Variable being predicted is called the dependent
variable
Variable used to make prediction is an independent
variable
Professor S. D. Balkin -- March 5, 2003
- 54 -
Introduction to Regression
Predicting future values of a variable is a crucial management
activity
• Future cash flows
• Needs to raw materials into a supply chain
• Future personnel or real estate needs
Explaining past variation is also an important activity
• Explain past variation in demand for services
• Impact of an advertising campaign or promotion
Professor S. D. Balkin -- March 5, 2003
- 55 -
Introduction to Regression (cont.)
Prediction: Reference to future values
Explanation: Reference to current or past values
Simple Linear Regression: Single independent variable
predicting a dependent variable
• Independent variable is typically something we can control
• Dependent variable is typically something that is linearly related to the
value of the independent variable
yˆ ˆ0 ˆ1 x
Professor S. D. Balkin -- March 5, 2003
- 56 -
Introduction to Regression (cont.)
Basic Idea: Fit a straight line that relates dependent variable (y)
and independent variable (x)
Linearity Assumption: Slope of the equation does not change as x
change
Assuming linearity we can write
y 0 1 x
which says that Y is made up of a predictable part (due
to X) and an unpredictable part
Coefficients are interpreted as the true, underlying intercept and
slope
Professor S. D. Balkin -- March 5, 2003
- 57 -
Regression Assumptions
We start by assuming that for each value of X, the corresponding
value of Y is random, and has a normal distribution.
Professor S. D. Balkin -- March 5, 2003
- 58 -
Which Line?
8000
4000
6000
sales
10000
12000
There are many good fitting lines through these points
3
4
5
6
7
8
9
10
score
http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
Professor S. D. Balkin -- March 5, 2003
- 59 -
Least Squares Principle
This method gives a best-fitting straight line by minimizing the
sum of the squares of the vertical deviations about the line
Regression Coefficient Interpretations:
• 0: Y-Intercept; estimated value of Y when X = 0
• 1: Slope of the line; average change in predicted value of Y for each
change of one unit in the independent variable X
Professor S. D. Balkin -- March 5, 2003
- 60 -
Least Square Estimates
ˆ1
S xy
S xx
; ˆ0 y ˆ1 x
where
S xx ( xi x )
2
i
S xy ( xi x )( yi y )
i
Professor S. D. Balkin -- March 5, 2003
- 61 -
Back to the Example
simple.lm(score, sales)
8000
4000
6000
y
10000
12000
y = 1133.33 x + 1199.99
3
4
5
6
7
8
9
10
x
Professor S. D. Balkin -- March 5, 2003
- 62 -
Back to the Example
Regression Plot
Sales = 1200 + 1133.33 Score
S=1,955
12000
11000
Sales
10000
9000
8000
7000
6000
5000
4000
3
4
5
6
7
8
9
10
Score
Professor S. D. Balkin -- March 5, 2003
- 63 -
Example
It is well known that the more beer you drink, the more your
blood alcohol level rises.
However, the extent to how much it rises per additional
beer is not clear.
Student
1
2
3
4
5
6
7
8
9
10
Beers
5
2
9
8
3
7
3
5
3
5
BAL
0.100 0.030 0.190 0.120 0.040 0.095 0.070 0.060 0.020 0.050
Calculate the correlation coefficient
Perform a regression analysis
Professor S. D. Balkin -- March 5, 2003
- 64 -
Homework #6
Hildebrand/Ott
•
•
•
•
•
•
•
•
•
•
HO:
HO:
HO:
HO:
HO:
HO:
HO:
HO:
HO:
HO:
7.1, page 204
7.2, page 204-205
7.14, page 211
7.17, page 211
7.18, page 211
7.20, page 214
7.21, page 214
7.30, page 218
7.39, page 229
7.74, page 244
Professor S. D. Balkin -- March 5, 2003
Verzani
• 13.4 – first part (do not test the
hypothesis). Provide an
interpretation.
- 65 -