Chi Square Statistic

Download Report

Transcript Chi Square Statistic

Stat 31, Section 1, Last Time
•
Inference for Proportions
–
•
Hypothesis Tests
2 Sample Proportions Inference
–
•
Skipped
2-way Tables
–
Sliced populations in 2 different ways
–
Look for independence of factors
–
Chi Square Hypothesis test
Reading In Textbook
Approximate Reading for Today’s Material:
Pages 582-611,
634-667
Approximate Reading for Next Class:
Pages 634-667
Midterm I - Results
Preliminary comments:
•
Circled numbers are points taken off
•
Total for each problem in brackets
•
Points evenly divided among parts
•
Page total in lower right corner
•
Check those sum to total on front
•
Overall score out of 100 points
Midterm I - Results
Interpretation of Scores:
•
Too early for letter grades
•
These will change a lot:
•
–
Some with good grades will relax
–
Some with bad grades will wake up
Don’t believe “A & C” average to “B”
Midterm I - Results
Interpretation of Scores:
•
Recall large variation over 2 midterms
–
No exception this semester
Midterm I - Results
Compare Midterm Scores
100
Midterm 2 I
90
80
70
60
50
40
40
50
60
70
Midterm I
80
90
100
Midterm I - Results
Compare Midterm Scores
Line of
Equal
Scores
100
Midterm 2 I
90
80
70
60
50
40
40
50
60
70
Midterm I
80
90
100
Midterm I - Results
Compare Midterm Scores
Some have
Dramatically
Improved
90
Midterm 2 I
Others have
Been
distracted
By other
things
100
80
70
60
50
40
40
50
60
70
Midterm I
80
90
100
Midterm I - Results
Interpretation of Scores:
•
Recall large variation over 2 midterms
–
•
No exception this semester
Get better info from 2 test Total
–
So will report answers in those terms
Midterm I - Results
Histogram
Midterm I + II, Total Score
of Results:
14
10
8
6
4
2
Total Score
5
19
0
18
5
16
0
15
5
13
0
12
5
10
90
75
60
45
0
30
Frequency
12
Midterm I - Results
Interpretation of Scores (2 Test total):
170 - 200
A
155 – 168
B
131 – 154
C
120 – 129
D
-- 119
F
Midterm I - Results
Where do we go from here?
•
I see 2 rather different groups…
•
Which are you in?
•
What can you do?
•
Most important:
It is still early days……
Chapter 9: Two-Way Tables
Main idea:
Divide up populations in two ways
–
–
•
E.g. 1:
E.g. 2:
Age & Sex
Education & Income
Typical Major Question:
How do divisions relate?
Are the divisions independent?
•
–
–
Similar idea to indepe’nce in prob. Theory
Statistical Inference?
Two-Way Tables
Big Question:
Is there a
relationship?
Class Example 31 - Counts
45
40
35
30
# Bottles 25
purchased 20
15
10
Other Wine
5
Italian Wine
0
None
French Wine
French
Note: tallest bars
French Wine  French Music
Italian Wine  Italian Music
Other Wine  No Music
Suggests there is a relationship
Music
Italian
Two-Way Tables
General Directions:
•
Can we make this precise?
•
Could it happen just by chance?
–
•
Really: how likely to be a chance effect?
Or is it statistically significant?
–
I.e. music and wine purchase are related?
Two-Way Tables
An alternate view:
Replace counts by proportions (or %-ages)
Class Example 31 (Wine & Music), Part 2
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls
Advantage:
May be more interpretable
Drawback:
No real difference (just rescaled)
Two-Way Tables
Testing for independence:
What is it?
From probability theory:
P{A | B} = P{A}
i.e. Chances of A, when B is known, are
same as when B is unknown
Table version of this idea?
Independence in 2-Way Tables
Counts analog of P{A|B}???
Equivalent condition for independence is:
P{ A & B}  P{ A}  P{B}
So for counts, look for:
Table Prop’n = Row Marg’l Prop’n x Col’n Marg’l Prop’n
i.e. Entry = Product of Marginals
Independence in 2-Way Tables
Visualize Product of Marginals for:
Class Example 31 (Wine & Music), Part 4
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls
Shows same structure
as marginals
But not match between
music & wine
Good null hypothesis
Class Example 31 - Independent Model
0.18
0.16
0.14
0.12
0.1
# Bottles
purchased 0.08
0.06
0.04
Other Wine
0.02
Italian Wine
0
None
Music
French Wine
French
Italian
Independence in 2-Way Tables
Approach:
•
Measure “distance between tables”
–
Use Chi Square Statistic
–
Has known probability distribution when
table is independent
•
Assess significance using P-value
–
Set up as: H0: Indep.
–
P-value = P{what saw or m.c. | Indep.}
HA: Dependent
Independence in 2-Way Tables
Chi-square statistic:
•
Based on:
Observed Counts (raw data), Obsi
Expected Counts (under indep.), Expi
•
X 
2

cells i
Obsi  Expi 
2
Expi
Notes:
–
Small for only random variation
–
Large for significant departure from indep.
Independence in 2-Way Tables
Chi-square statistic calculation:
X 
2

Obsi  Expi 
cells i
2
Expi
Class example 31, Part 5:
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls
–
Calculate term by term
–
Then sum
–
Is X2 = 18.3
“big” or “small”?
Independence in 2-Way Tables
H0 distribution of the X2 statistic:
“Chi Squared” (another Greek letter  )
2
Parameter: “degrees of freedom”
(similar to T distribution)
Excel Computation:
–
CHIDIST (given cutoff, find area = prob.)
–
CHIINV
(given prob = area, find cutoff)
Independence in 2-Way Tables
For test of independence, use:
degrees of freedom =
= (#rows – 1) x (#cols – 1)
E.g. Wine and Music:
d.f. = (3 – 1) x (3 – 1) = 4
Independence in 2-Way Tables
E.g. Wine and Music:
P-value = P{Observed X2 or m.c. | Indep.} =
= P{X2 = 18.3 of m.c. | Indep.} =
= P{X2 >= 18.3 | d.f. = 4} =
= 0.0011
Also see Class Example 31, Part 5
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg31.xls
Independence in 2-Way Tables
E.g. Wine and Music:
P-value = 0.001
Yes-No: Very strong evidence against
independence, conclude music has a
statistically significant effect
Gray-Level:
evidence
Also very strong
Independence in 2-Way Tables
Excel shortcut:
CHITEST
•
Avoids the (obs-exp)^2 / exp calculat’n
•
Automatically computes d.f.
•
Returns P-value
Independence in 2-Way Tables
HW:
9.27
9.29
And Now for Something
Completely Different
A statistics joke, from:
GARY C. RAMSEYER'S INTERNET GALLERY
OF STATISTICS JOKES
http://www.ilstu.edu/~gcramsey/Gallery.html
And Now for Something
Completely Different
A somewhat advanced society has figured
how to package basic knowledge in pill
form.
A student, needing some learning, goes to
the pharmacy and asks what kind of
knowledge pills are available.
And Now for Something
Completely Different
The pharmacist says "Here's a pill for
English literature."
The student takes the pill and swallows it
and has new knowledge about English
literature!
And Now for Something
Completely Different
"What else do you have?" asks the student.
"Well, I have pills for art history, biology, and
world history, "replies the pharmacist.
The student asks for these, and swallows
them and has new knowledge about
those subjects!
And Now for Something
Completely Different
Then the student asks, "Do you have a pill
for statistics?"
The pharmacist says "Wait just a moment",
and goes back into the storeroom and
brings back a whopper of a pill that is
about twice the size of a jawbreaker
and plunks it on the counter.
"I have to take that huge pill for statistics?"
inquires the student.
And Now for Something
Completely Different
The pharmacist understandingly nods his
head and replies:
"Well, you know statistics always was a little
hard to swallow."
Caution about 2-Way Tables
Simpson’s Paradox:
Aggregation into tables can be dangerous
E.g. from:
http://www.math.sfu.ca/~cschwarz/Stat-301/Handouts/node50.html
Study Admission rates to professional
programs, look for sex bias….
Simpson’s Paradox
Admissions to Business School:
Male
Female
Admit
480
180
Deny
120
20
% Males ad’ted = 480 / (480 + 120) * 100%
= 80%
% Females ad’ted = 180 / (180 + 20)* 100%
= 90%
Better for females???
Simpson’s Paradox
Admissions to Law School:
Male
Female
Admit
10
100
Deny
90
200
% Males ad’ted = 10 / (10 + 90) * 100%
= 10%
% Females ad’ted = 100 / (100+200)*100%
= 33.3%
Better for females???
Simpson’s Paradox
Combined Admissions:
Male
Female
Admit
490
280
Deny
210
220
% Males ad’ted = 490 / (490 + 210) * 100%
= 70%
% Females ad’ted = 280 / (280+210)*100%
= 56%
Better for males???
Simpson’s Paradox
How can the rate be higher for both females
and also males?
Reason: depends on relative proportions
Notes:
•
In Business (male applicants
dominant), easier to get in
(660 / 800)
•
In Law (female applicants dominant),
much harder to get in
(110 / 400)
Simpson’s Paradox
Lesson:
Must be very careful about aggregation
Worse: may not be aware that aggregation
has been done….
Recall terminology:
Lurking Variable
Can hide in aggregation…
Could be used for cheating…
Simpson’s Paradox
HW:
9.15
9.17
Inference for Regression
Chapter 10
Recall:
•
Scatterplots
•
Fitting Lines to Data
Now study statistical inference associated
with fit lines
E.g. When is slope statistically significant?
Recall Scatterplot
Toy Scatterplot, Separate Points
For data (x,y)
View by plot:
2.5
(1,2)
1.5
(3,1)
0.5
(-1,0)
2
y
1
0
-2
-1
-0.5 0
1
-1
(2,-1)
-1.5
x
2
3
4
Recall Linear Regression
Idea:
Fit a line to data in a scatterplot
•
To learn about “basic structure”
•
To “model data”
•
To provide “prediction of new values”
Recall Linear Regression
Recall some basic geometry:
A line is described by an equation:
y = mx + b
m = slope
b = y intercept
m
b
Varying m & b gives a “family of lines”,
Indexed by “parameters” m & b (or a & b)
Recall Linear Regression
Approach:
Given a scatterplot of data:
( x1 , y1 ),..., ( xn , yn )
Find a & b (i.e. choose a line)
to “best fit the data”
Recall Linear Regression
Given a line,
y  bx  a , “indexed” by b & a
( x1 , y1 )
( x2 , y 2 )
( x3 , y 3 )
Define “residuals” = “data Y” – “Y on line”
= yi  (bxi  a )
Now choose b & a to make these “small”
Recall Linear Regression
Excellent Demo, by Charles Stanton, CSUSB
http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html
More JAVA Demos, by David Lane at Rice U.
http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
http://www.ruf.rice.edu/~lane/stat_sim/comp_r/index.html
Recall Linear Regression
Make Residuals > 0, by squaring
Least Squares:
adjust
b & a to
Minimize the “Sum of Squared Errors”
n
SSE    yi  (bxi  a )
i 1
2
Least Squares in Excel
Computation:
1. INTERCEPT
(computes y-intercept a)
2. SLOPE (computes slope b)
Revisit Class Example 14
http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls
HW: 10.17a
Inference for Regression
Goal: develop
•
Hypothesis Tests and Confidence Int’s
•
For slope & intercept parameters, a & b
•
Also study prediction
Inference for Regression
Idea: do statistical inference on:
–
Slope a
–
Intercept b
Model:
Yi  aX i  b  ei
ei are random, independent
and N 0,  e 
Assume:
Inference for Regression
Viewpoint:
Data generated as:
y = ax + b
Yi chosen from
Xi
Note:
a and b are “parameters”
Inference for Regression
Parameters
a and b determine the
underlying model (distribution)
Estimate with the Least Squares Estimates:
â and b̂
(Using SLOPE and INTERCEPT in Excel,
based on data)
Inference for Regression
Distributions of
â and b̂ ?
Under the above assumptions, the sampling
distributions are:
aˆ ~ N a, a 
bˆ ~ N b,  b 
•
Centerpoints are right (unbiased)
•
Spreads are more complicated
Inference for Regression
Formula for SD of â :
SD aˆ    a 
e
n
  xi  x 
2
i 1
•
–
•
Big (small) for  e big (small, resp.)
Accurate data  Accurate est. of slope
Small for x’s more spread out
–
•
Data more spread  More accurate
Small for more data
–
More data  More accuracy
Inference for Regression
Formula for SD of b̂ :
SD bˆ    b   e
1

n
x2
n
  xi  x 
2
i 1
•
–
•
–
•
Big (small) for  e big (small, resp.)
Accurate data  Accur’te est. of intercept
Smaller for x  0
Centered data  More accurate intercept
Smaller for more data
–
More data  More accuracy
Inference for Regression
One more detail:
Need to estimate  using data
e
For this use:
n
se 
  yi  aˆxi  bˆ 
i 1
2
n2
•
Similar to earlier sd estimate,
•
Except variation is about fit line
•
n  2 is similar to n  1 from before
s
Inference for Regression
Now for Probability Distributions,
Since are estimating  e by
se
Use TDIST and TINV
With degrees of freedom = n  2
Inference for Regression
Convenient Packaged Analysis in Excel:
Tools  Data Analysis  Regression
Illustrate application using:
Class Example 27,
Old Text Problem 8.6 (now 10.12)