chi-square test for independence

Download Report

Transcript chi-square test for independence

A PowerPoint Presentation Package to Accompany
Applied Statistics in Business &
Economics, 4th edition
David P. Doane and Lori E. Seward
Prepared by Lloyd R. Jaisingh
McGraw-Hill/Irwin
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 15
Chi-Square Tests
Chapter Contents
15.1
15.2
15.3
15.4
15.5
15.6
Chi-Square Test for Independence
Chi-Square Tests for Goodness-of-Fit
Uniform Goodness-of-Fit Test
Poisson Goodness-of-Fit Test
Normal Chi-Square Goodness-of-Fit Test
ECDF Tests (Optional)
15-2
Chapter 15
Chi-Square Tests
Chapter Learning Objectives
LO15-1:
LO15-2:
LO15-3:
LO15-4:
LO15-5:
LO15-6:
LO15-7:
Recognize a contingency table.
Find degrees of freedom and use the chi-square table of critical values.
Perform a chi-square test for independence on a contingency table.
Perform a goodness-of-fit (GOF) test for a uniform distribution.
Explain the GOF test for a Poisson distribution.
Use computer software to perform a chi-square GOF test for normality.
State advantages of ECDF tests as compared to chi-square GOF tests.
15-3
Chapter 15
LO15-1
15.1 Chi-Square Test for Independence
LO15-1: Recognize a contingency table.
Contingency Tables
•
•
•
A contingency table is a cross-tabulation of n paired observations into categories.
Each cell shows the count of observations that fall into the category defined by its
row (r) and column (c) heading.
For example:
15-4
Chapter 15
LO15-3, 2
15.1 Chi-Square Test for Independence
LO15-3: Perform a chi-square test for independence on a
contingency table.
LO15-2: Find degrees of freedom and use the chi-square
table of critical values.
Chi-Square Test
•
•
•
•
•
In a test of independence for an r x c contingency table, the hypotheses are
H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
Use the chi-square test for independence to test these hypotheses.
This non-parametric test is based on frequencies.
The n data pairs are classified into c columns and r rows and then the observed
frequency fjk is compared with the expected frequency ejk.
The critical value comes from the chi-square probability distribution with n degrees
of freedom. (See Appendix E for table values).
d.f. = degrees of freedom = (r – 1)(c – 1)
where
r = number of rows in the table
c = number of columns in the table
15-5
Chapter 15
LO15-3
15.1 Chi-Square Test for Independence
Expected Frequencies
•
Assuming that H0 is true, the expected frequency of row j and column k is:
ejk = RjCk/n
where
Rj = total for row j (j = 1, 2, …, r)
Ck = total for column k (k = 1, 2, …, c)
n = sample size
Steps in Testing the Hypotheses
•
•
•
Step 1: State the Hypotheses.
H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
•
•
•
Step 2: Specify the Decision Rule.
Calculate d.f. = (r – 1)(c – 1)
For a given a, look up the right-tail critical value (c2R) from
Appendix E or by using Excel.
15-6
Chapter 15
LO15-3
15.1 Chi-Square Test for Independence
Steps in Testing the Hypotheses
•
•
Step 4: Calculate the Test Statistic.
The chi-square test statistic is
•
•
Step 5: Make the Decision.
Reject H0 if test statistic > c2R or if the p-value ≤ a.
Small Expected Frequencies
•
•
The chi-square test is unreliable if the expected frequencies are too small.
Rules of thumb:
• Cochran’s Rule requires that ejk > 5 for all cells.
• Up to 20% of the cells may have ejk < 5.
•
•
Most agree that a chi-square test is infeasible if ejk < 1 in any cell.
If this happens, try combining adjacent rows or columns to enlarge the
expected frequencies.
15-7
Chapter 15
LO15-3
15.1 Chi-Square Test for Independence
Test of Two Proportions
• For a 2 × 2 contingency table, the chi-square test is equivalent to a two-tailed
z test for two proportions, if the samples are large enough to ensure
normality.
• The hypotheses are:
Cross-Tabulating Raw Data
•
Chi-square tests for independence can also be used to analyze quantitative
variables by coding them into categories.
Why Do a Chi-Square Test on Numerical Data?
•
•
•
The researcher may believe there’s a relationship between X and Y, but doesn’t
want to use regression.
There are outliers or anomalies that prevent us from assuming that the data came
from a normal population.
The researcher has numerical data for one variable
but not the other.
Figure 14.6
15-8
Chapter 15
15.2 Chi-Square Tests for Goodness-of-Fit
Purpose of the Test
•
•
The goodness-of-fit (GOF) test helps you decide whether your sample
resembles a particular kind of population.
The chi-square test will be used because it is versatile and easy to understand.
Multinomial GOF Test
•
•
A multinomial distribution is defined by any k probabilities p1, p2, …, pk that sum to
unity. For example,
H0: p1 = .13, p2 = .13, p3 = .24, p4 = .20, p5 = .16, p6 = .14
H1: At least one of the pj differs from the hypothesized value.
If no parameters are estimated (m = 0) and there are c = 6 classes, so the
degrees of freedom will be d.f. = c – m – 1 = 6 – 0 – 1 = 5.
15-9
Chapter 15
15.2 Chi-Square Tests for Goodness-of-Fit
Hypotheses for GOF
•
The hypotheses are:
H0: The population follows a _____ distribution
H1: The population does not follow a ______
distribution
The blank may contain the name of any theoretical distribution (e.g., uniform, Poisson,
normal).
•
Test Statistic and Degrees of Freedom for GOF
Where fj = the observed frequency of
observations in class j and ej = the expected
frequency in class j if H0 were true.
•
The test statistic follows the chi-square distribution with degrees of freedom
d.f. = c – m – 1 where c is the number of classes used in the test m is the
number of parameters estimated.
15-10
Chapter 15
LO15-4
15.3 Uniform Goodness-of-Fit Test
LO15-4: Perform a goodness of-fit (GOF) test for a uniform
distribution.
Uniform Distribution
•
•
•
•
•
•
•
•
•
The uniform goodness-of-fit test is a special case of the multinomial in which every
value has the same chance of occurrence.
The chi-square test for a uniform distribution compares all c groups simultaneously.
The hypotheses are:
H0: p1 = p2 = …, pc = 1/c
H1: Not all pj are equal
The test can be performed on data that are already tabulated into groups.
Calculate the expected frequency ej for each cell.
The degrees of freedom are d.f. = c – 1 since there are no parameters for the
uniform distribution.
Obtain the critical value c2a from Appendix E for the desired level of
significance a.
The p-value can be obtained from Excel.
Reject H0 if p-value ≤ a.
15-11
Chapter 15
LO15-4
15.3 Uniform Goodness-of-Fit Test
Uniform GOF Test: Raw Data
•
•
•
•
•
•
•
First form c bins of equal width and create a frequency distribution.
Calculate the observed frequency fj for each bin.
Define ej = n/c.
Perform the chi-square calculations.
The degrees of freedom are d.f. = c – 1 since there are no parameters for the
uniform distribution.
Obtain the critical value from Appendix E for a given significance level a and make
the decision.
Maximize the test’s power by defining bin width as (As a result, the expected
frequencies will be as large as possible.)
15-12
Chapter 15
LO15-4
15.3 Uniform Goodness-of-Fit Test
Uniform GOF Test: Raw Data
•
Calculate the mean and standard deviation of the uniform distribution as:
•
If the data are not skewed and the sample size is large (n > 30), then the mean is
approximately normally distributed.
So, test the hypothesized uniform mean using
•
15-13
Chapter 15
LO15-5
15.4 Poisson Goodness-of-Fit Test
LO15-5: Explain the GOF test for a Poisson distribution.
Poisson Data-Generating Situations
•
•
•
•
In a Poisson distribution model, X represents the number of events per unit of time
or space.
X is a discrete nonnegative integer (X = 0, 1, 2, …).
Event arrivals must be independent of each other.
Sometimes called a model of rare events because X typically has a small mean.
Poisson Goodness-of-Fit Test
•
•
•
•
•
•
•
The mean l is the only parameter.
If l is unknown, it must be estimated from the sample.
Use the estimated l to find the Poisson probability P(X) for each value of X.
Compute the expected frequencies.
Perform the chi-square calculations.
Make the decision.
You may need to combine classes until expected frequencies become large enough for the
test (at least until ej > 2).
15-14
Chapter 15
LO15-5
15.4 Poisson Goodness-of-Fit Test
Poisson GOF Test: Tabulated Data
•
Calculate the sample mean as:
•
Using this estimate mean, calculate the Poisson probabilities either by using the
Poisson formula P(x) = (lxe-l)/x! or Excel.
For c classes with m = 1 parameter estimated, the degrees of freedom are
d.f. = c – m – 1
Obtain the critical value for a given a from Appendix E.
Make the decision.
•
•
•
15-15
Chapter 15
LO15-6
15.5 Normal Chi-Square
Goodness-of-Fit Test
LO15-6: Use computer software to perform a chi-square GOF test for
normality.
Normal Data Generating Situations
•
•
•
Two parameters, the mean m and the standard deviation s, fully describe the normal
distribution.
Unless m and s are know a priori, they must be estimated from a sample.
Using these statistics, the chi-square goodness-of-fit test can be used.
Method 1: Standardizing the Data
•
Transform the sample observations x1, x2, …, xn into standardized values.
15-16
Chapter 15
LO15-6
15.5 Normal Chi-Square
Goodness-of-Fit Test
Method 2: Equal Bin Widths
•
•
•
•
•
•
To obtain equal-width bins, divide the exact data range into c groups of equal
width.
Step 1: Count the sample observations in each bin to get observed
frequencies fj.
Step 2: Convert the bin limits into standardized z-values by using the formula.
Step 3: Find the normal area within each bin assuming a normal distribution.
Step 4: Find expected frequencies ej by
multiplying each normal area by the
sample size n.
Classes may need to be collapsed from the ends inward to enlarge expected
frequencies.
15-17
Chapter 15
LO15-6
15.5 Normal Chi-Square
Goodness-of-Fit Test
Method 3: Equal Expected Frequencies
•
•
•
•
•
•
•
Define histogram bins in such a way that an equal number of observations would
be expected within each bin under the null hypothesis.
Define bin limits so that ej = n/c
A normal area of 1/c in each of the c bins is desired.
The first and last classes must be open-ended for a normal distribution, so to
define c bins, we need c – 1 cut-points.
The upper limit of bin j can be found directly by using Excel.
Alternatively, find zj for bin j using Excel and then calculate the upper limit for bin j
as x  z j s
Once the bins are defined, count the observations fj within each bin and compare
them with the expected frequencies ej = n/c.
15-18
Chapter 15
LO15-7
15.6 ECDF Tests
LO15-7: State advantages of ECDF tests as compared to chi-square
GOF tests.
•
•
•
•
•
•
•
•
•
There are many alternatives to the chi-square test based on the Empirical
Cumulative Distribution Function (ECDF).
The Kolmogorov-Smirnov (K-S) test uses the largest absolute difference between
the actual and expected cumulative relative frequency of the n data values
The K-S test is not recommended for grouped data.
The K-S test assumes that no parameters are estimated.
If parameters are estimated, use a Lilliefors test.
Both of these tests are done by computer.
The Anderson-Darling (A-D) test is widely used for non-normality because of its
power.
The A-D test is based on a probability plot.
When the data fit the hypothesized distribution closely, the probability plot will be
close to a straight line.
15-19