Transcript stats2
Data Analysis
Statistics
Inferential statistics
Hypothesis testing
Normal distribution: a
probability distribution
99% of scores
are within 3sd
of mean
Who cares…
• The most useful distribution in inferential
statistics.
• We can translate any normal variable, X, into the
standardized value, Z to make assumptions about
the whole population. Use when comparing
means or proportions.
• Example:
• Suppose you were the city police and you wanted
to know how many photo radar tickets you could
expect to collect next year so that you can
develop your budget...
• Last year the mean number of tickets for all
locations was 9000 with a standard deviation of
500 tickets. What is the probability that you will
give out between 7500 tickets (your lowball
guess) and 9625 (your highball guess)?
• Calculate Z score
• …what type of scale must you have to calculate
Z scores?
• …what reasons can you think of for wanting to
calculate a Z score for your research?
Z tests, another application
• You have been asked to conduct a survey
on customer satisfaction at the food court.
Customers indicate their perceptions on a 5
point scale where 1=very unfriendly and
5=very friendly. Assume this is an interval
scale and that previous studies have shown
that a normal distribution of scores is
expected.
Z tests, assumptions about mean
• You think: perhaps customers think that the
service is neither friendly nor unfriendly Ho:
mean is equal to 3.0
• H1: mean is not equal to 3.0
• Establish significance/confidence
level=0.05/95% confidence therefore Z= +/- 1.96
• You do a study with a sample of 225 interviews
and the mean is 3.78. The standard deviation is
1.5.
• Do we accept or reject the null hypothesis?
A Sampling Distribution
UPPER
LIMIT
LOWER
LIMIT
m=3.0
Critical values of m
Critical value - upper limit
S
= m ZS X or m Z
n
1 .5
= 3.0 1.96
225
Critical values of m
Critical value - lower limit
= m - ZS X or m - Z
1 .5
= 3.0 - 1.96
225
S
n
3.78 sample mean,
therefore reject Ho
and say that the
sample results are
significant at .05
level of
significance
2.804
3.0
Range of acceptability
3.196
Type I and Type II Errors
Null is true
Null is false
Accept null
Reject null
Correctno error
Type I
error
Type II
error
Correctno error
If sample is small…
• Small usually means less than 30
• Do a t test instead
Is this statistically significant?
• Chi-square test: a hypothesis test that allows for
investigation of statistical significance in the
analysis of a frequency distribution (or cross
tab)
• Categorical data such as sex, education or
dichotomous answers may be statistically
analyzed
• Tests the “goodness of fit” of the sample with
expected population results
Chi-square example
• Through observation research we have identified that of
the sample of 100 people who got photo radar tickets, 60
were female and 40 were male. We expected that the
proportions should be equal (.5 probability for each sex).
Our null hypothesis is that the population data will be
consistent with our sample data at 0.05 level of
significance.
• If the calculated chi square is above the critical chi
square for this level (3.84) we reject the null hypothesis.
This is the case. The observed values are not comparable
to expected values
Estimation of population
parameters: Confidence
• The population mean and standard deviation are
unknown; we do know the sample mean and
standard deviation….
• We take a sample of a number of students with
children and ask them to identify how much they
would be willing to pay per hour for on campus
childcare . Our sample size is 30. The student
population with children is estimated to be 300.
• The sample mean is $2.60.
• This is called a point estimate.
• How close is this sample mean to the
population mean? How confident are we?
• Confidence interval: the percentage
indicating the long run probability that the
results will be correct. Usually 95%
Relationship between variables
Correlation and regression analysis
Types of questions
• Is employee productivity associated with pay
incentives?
• Is salary level correlated with type of degree or
designation?
• Is willingness to pay student fees levies for
daycare correlated with whether one has a child?
• Are students grades influenced by length of term?
Measures of association
• A general term that refers to a number of
bivariate statistical techniques used to measure
the strength of a relationship between two
variables
• Correlation coefficient (r): most popular. Is a
measure of the covariation or association
between two variables. It ranges from +1 to -1
Measures of association
• Coefficient of determination (r2)
• The proportion of the total variance of a variable
that is accounted for by knowing the value of
another variable. Often shown as a correlation
matrix.
• We have calculated r=-.65 when investigating
whether the number of years of university is
correlated with unemployment. If r2=.38, we
know that about 40% of the variance in
unemployment can be explained by variance in
years of university
Regression analysis
• Bivariate linear regression: a measure of linear
association that investigates a straight line
relationship.
• Assuming that there is an association between
students’ performance and length of term, can we
predict a students GPA given the distribution of
their courses along semesters
• Uses interval data
Regression analysis
• Multiple regression analysis: an analysis of
association that simultaneously
investigates the effect of two or more
variables on a single, interval-scaled
dependent variable
Summary
•
•
•
•
Chi-square allows you to test whether an observed sample distribution fits some given
distribution. Are the groups in your cross tab independent?
Z and t tests are used to determine if the means or proportions of two samples are
significantly different.
Simple correlation measures the relationship of one variable to another. Correlation
coefficient (r) indicates the strength of the association and direction of the association.
The coefficient of determination measures the amount of the total variance in the DV
that is accounted for by knowing the value of the independent variable. The results are
often shown in a correlation matrix.
Bivariate regression investigates a straight-line relationship between one IV and one
DV. This can be done by plotting a scatter diagram or least squares method. This is
used to forecast values of the DV given values of the IV. The goodness of fit may be
evaluated by calculating the correlation of determination. Multiple regression analysis
allows for simultaneous investigation of two or more IV on the DV
Type of Scale
Nominal
Numerical
Operation
Counting
Descriptive
Statistics
Frequency; cross
tab
Percentage; mode
(plus…)Median
Range; Percentile
Ordinal
Rank ordering
Interval
Arithmetic
operations on
intervals bet
numbers
Ratio
Arithmetic
(plus…) Geometric
operations on actual mean; Co-efficent
quantities
of variation
(plus…) Mean;
Standard deviation;
variance
Selecting appropriate univariate
statistical method
Scale
Nominal
Scale
Business
Problem
Identify sex
of key
executives
Statistical
Possible test
question to be of statistical
asked
significance
Is the number Chi-square
of female
test
executives
equal to the
number of
males
executives?
Scale
Nominal
Scale
Business
Problem
Indicate
percentage of
key
executives
who are male
Statistical
question to be
asked
Possible test
of statistical
significance
Is the
proportion of Z test
male
executives the
same as the
hypothesized
proportion?
Scale
Ordinal scale
Business
Problem
Compare
actual and
expected
evaluations
Statistical
question to be
asked
Possible test
of statistical
significance
Does the
Chi-square
distribution of test
scores for a
scale with
categories of
poor,good,
excellent
differ from an
expected
distribution?
Scale
Interval or
Ratio scale
Business
Problem
Statistical
question to be
asked
Possible test
of statistical
significance
Compare
actual and
hypothetical
values of
average salary
Is the sample
mean
significantly
different from
the
hypothesized
population
mean?
Z-test (sample
is large)
T-test (sample
is small)
Determining Sample Size
• What data do you need to consider
– Variance or heterogeneity of population
– The degree of acceptable error (confidence
interval
– Confidence level
– Generally, we need to make judgments on all
these variables
Determining Sample Size
• Variance or heterogeneity of population
– Previous studies? Industry expectations? Pilot
study?
– Sequential sampling
– Rule of thumb: the value of standard deviation
is expected to be 1/6 of the range.
Determining Sample Size
• Formula
• N= (ZS/E)2
Z= standardization value indicating confidence
level
S= sample standard deviation
E= acceptable magnitude of error
Its not the size that matters….