Transcript stats2

Data Analysis
Statistics
Inferential statistics
Hypothesis testing
Type I and Type II Errors
Null is true
Null is false
Accept null
Reject null
Correctno error
Type I
error
Type II
error
Correctno error
Normal distribution: a
probability distribution
99% of scores
are within 3sd
of mean
Who cares…



The most useful distribution in inferential
statistics.
We can translate any normal variable, X,
into the standardized value, Z to make
assumptions about the whole population.
Use when comparing means or proportions.
Example:
Suppose you were the city police and you
wanted to know how many photo radar
tickets you could expect to collect next year
so that you can develop your budget...




Last year the mean number of tickets for all
locations was 9000 with a standard deviation
of 500 tickets. What is the probability that
you will give out between 7500 tickets (your
low guess) and 9625 (your high guess)?
Calculate Z score
…what type of scale must you have to
calculate Z scores?
…what reasons can you think of for wanting
to calculate a Z score for your research?
Z tests, another
application

You have been asked to conduct a
survey on customer satisfaction at the
food court. Customers indicate their
perceptions on a 5 point scale where
1=very unfriendly and 5=very friendly.
Assume this is an interval scale and
that previous studies have shown that
a normal distribution of scores is
expected.
Z tests, assumptions
about mean




You think: customers believe that the
service is neither friendly nor unfriendly
Ho: mean is equal to 3.0
H1: mean is not equal to 3.0
Establish significance/confidence
level=0.05/95% confidence
therefore Z= +/- 1.96
You do a study with a sample of 225
interviews and the mean is 3.78. The
standard deviation is 1.5.
Do we accept or reject the null hypothesis?
A Sampling Distribution
UPPER
LIMIT
LOWER
LIMIT
m=3.0
Critical values of m
Critical value - upper limit
S
= m  ZS X or m  Z
n
 1 .5 
= 3.0  1.96

 225 
Critical values of m
Critical value - lower limit
= m - ZS X or m - Z
 1 .5 
= 3.0 - 1.96

 225 
S
n
3.78 sample mean,
therefore reject Ho
and say that the
sample results are
significant at .05
level of
significance
2.804
3.0
Range of acceptability
3.196
Type I and Type II Errors
Null is true
Null is false
Accept null
Reject null
Correctno error
Type I
error
Type II
error
Correctno error
If sample is small…


Small usually means less than 30
Do a t test instead
Is this statistically
significant?



Chi-square test: a hypothesis test that
allows for investigation of statistical
significance in the analysis of a frequency
distribution (or cross tab)
Categorical data such as sex, education or
dichotomous answers may be statistically
analyzed
Tests the “goodness of fit” of the sample
with expected population results
Chi-square example


Through observation research we have identified
that of the sample of 100 people who got photo
radar tickets, 60 were female and 40 were male.
We expected that the proportions should be equal
(.5 probability for each sex). Our null hypothesis is
that the population data will be consistent with our
sample data at 0.05 level of significance.
If the calculated chi square is above the critical chi
square for this level (3.84) we reject the null
hypothesis. This is the case. The observed values
are not comparable to expected values
Estimation of population
parameters: Confidence


The population mean and standard
deviation are unknown; we do know the
sample mean and standard deviation….
We take a sample of a number of students
with children and ask them to identify how
much they would be willing to pay per hour
for on campus childcare . Our sample size is
30. The student population with children is
estimated to be 300.




The sample mean is $2.60.
This is called a point estimate.
How close is this sample mean to the
population mean? How confident are
we?
Confidence interval: the percentage
indicating the long run probability that
the results will be correct. Usually
95%
Relationship between
variables
Correlation and regression
analysis
Types of questions




Is employee productivity associated with
pay incentives?
Is salary level correlated with type of degree
or designation?
Is willingness to pay student fees levies for
daycare correlated with whether one has a
child?
Are students grades influenced by length of
term?
Measures of association


A general term that refers to a number of
bivariate statistical techniques used to
measure the strength of a relationship
between two variables
Correlation coefficient (r): most popular. Is a
measure of the covariation or association
between two variables. It ranges from +1
to -1
Measures of association



Coefficient of determination (r2)
The proportion of the total variance of a
variable that is accounted for by knowing
the value of another variable. Often shown
as a correlation matrix.
We have calculated r=-.65 when
investigating whether the number of years
of university is correlated with
unemployment. If r2=.38, we know that
about 40% of the variance in
unemployment can be explained by variance
in years of university
Regression analysis



Bivariate linear regression: a measure of
linear association that investigates a straight
line relationship.
Assuming that there is an association
between students’ performance and length
of term, can we predict a students GPA
given the distribution of their courses along
semesters
Uses interval data
Regression analysis

Multiple regression analysis: an
analysis of association that
simultaneously investigates the effect
of two or more variables on a single,
interval-scaled dependent variable
Summary




Chi-square allows you to test whether an observed sample distribution fits
some given distribution. Are the groups in your cross tab independent?
Z and t tests are used to determine if the means or proportions of two
samples are significantly different.
Simple correlation measures the relationship of one variable to another.
Correlation coefficient (r) indicates the strength of the association and
direction of the association. The coefficient of determination measures the
amount of the total variance in the DV that is accounted for by knowing the
value of the independent variable. The results are often shown in a
correlation matrix.
Bivariate regression investigates a straight-line relationship between one IV
and one DV. This can be done by plotting a scatter diagram or least squares
method. This is used to forecast values of the DV given values of the IV. The
goodness of fit may be evaluated by calculating the correlation of
determination. Multiple regression analysis allows for simultaneous
investigation of two or more IV on the DV
Type of Scale
Nominal
Numerical
Operation
Counting
Descriptive
Statistics
Frequency; cross
tab
Percentage; mode
(plus…)Median
Range; Percentile
Ordinal
Rank ordering
Interval
Arithmetic
operations on
intervals bet
numbers
(plus…) Mean;
Standard
deviation;
variance
Ratio
Arithmetic
operations on
actual quantities
(plus…)
Geometric mean;
Co-efficent of
variation
Selecting appropriate
univariate statistical
method
Scale
Nominal
Scale
Business
Problem
Identify sex
of key
executives
Statistical Possible test
question to of statistical
be asked
significance
Is the
Chi-square
number of
test
female
executives
equal to the
number of
males
executives?
Scale
Nominal
Scale
Business
Problem
Indicate
percentage
of key
executives
who are
male
Statistical
question to
be asked
Possible test
of statistical
significance
Is the
proportion of Z test
male
executives
the same as
the
hypothesized
proportion?
Scale
Business
Problem
Ordinal scale Compare
actual and
expected
evaluations
Statistical
question to
be asked
Possible test
of statistical
significance
Does the
Chi-square
distribution test
of scores for
a scale with
categories of
poor,good,
excellent
differ from
an expected
distribution?
Scale
Interval or
Ratio scale
Business
Problem
Statistical
question to
be asked
Compare
actual and
hypothetical
values of
average
salary
Is the
sample
mean
significantly
different
from the
hypothesized
population
mean?
Possible test
of statistical
significance
Z-test
(sample is
large)
T-test
(sample is
small)
Determining Sample Size

What data do you need to consider
– Variance or heterogeneity of population
– The degree of acceptable error
(confidence interval
– Confidence level
– Generally, we need to make judgments
on all these variables
Determining Sample Size

Variance or heterogeneity of
population
– Previous studies? Industry expectations?
Pilot study?
– Sequential sampling
– Rule of thumb: the value of standard
deviation is expected to be 1/6 of the
range.
Determining Sample Size


Formula
N= (ZS/E)2
Z= standardization value indicating
confidence level
S= sample standard deviation
E= acceptable magnitude of error
Its not the size that matters….