Transcript Stat New

What we won’t cover
• Lots of maths!!!
• Students coming to statistics support usually
want help with using SPSS, choosing the right
analysis and interpreting output
• They often find maths scary and so you need
to think of ways of explaining without the
maths
Data and variables
DATA: the answers to questions or
measurements from the
experiment
VARIABLE = measurement which
varies between subjects e.g. height
or gender
One row per subject
One variable per
column
Data types
Data Variables
Scale
Measurements/ Numerical/ count
data
Categorical:
appear as categories
Tick boxes on questionnaires
Data types
Variables
Scale
Continuous
Measurements
takes any value
Categorical
Discrete:
Ordinal:
Nominal:
Counts/ integers
obvious order
no meaningful order
Populations and samples
• Taking a sample from a population
statstutor.ac.uk
Sample data ‘represents’ the whole population
Point estimation
Sample data is used to estimate parameters of a population
Statistics are calculated using sample data.
Parameters are the characteristics of population data
www.statstutor.ac.uk
sample mean
𝒙
Sample SD
𝑺
Population mean
estimates

Population SD

“outliers”
• minority cases, so different from the
majority that they merit separate
consideration
– are they errors?
– are they indicative of a different pattern?
• think about possible outliers with care, but
beware of mechanical treatments…
• significance of outliers depends on your
research interests
summaries of distributions
• graphic vs. numeric
– graphic may be better for visualization
– numeric are better for statistical/inferential
purposes
general characteristics
• kurtosis [“peakedness”]
0.22
0.4
0.8
X
X
0.00
-5
0.0
-5
5
D
0.0
-5
5
‘leptokurtic’
D
’platykurtic’
5
5
right
(positive)
skew
4
X
3
• skew (skewness)
2
5
1
4
0.2
0.4
0.6
D
0.8
1.0
1.2
3
X
0
0.0
left
(negative)
skew
2
1
0
0.0
0.2
0.4
0.6
D
0.8
1.0
1.2
Descriptive Statistics
An Illustration:
Which Group is Smarter?
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Class B--IQs of 13 Students
127
162
131
103
96
111
80
109
93
87
120
105
109
Each individual may be different. If you try to understand a group by remembering the qualities
of each member, you become overwhelmed and fail to understand the group.
Descriptive Statistics
Which group is smarter now?
Class A--Average IQ
110.54
Class B--Average IQ
110.23
They’re roughly the same!
With a summary descriptive statistic, it is much easier to
answer our question.
Descriptive Statistics
Types of descriptive statistics:
• Organize Data
– Tables
– Graphs
• Summarize Data
– Central Tendency
– Variation
Descriptive Statistics
Types of descriptive statistics:
• Organize Data
– Tables
• Frequency Distributions
– Graphs
• Bar Chart or Histogram
• Frequency Polygon
Frequency Distribution
Frequency Distribution of IQ for Two Classes
IQ
Frequency
82.00
87.00
89.00
93.00
96.00
97.00
98.00
102.00
103.00
105.00
106.00
107.00
109.00
111.00
115.00
119.00
120.00
127.00
128.00
131.00
140.00
162.00
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
Total
24
Descriptive Statistics
Summarizing Data:
– Central Tendency (or Groups’ “Middle Values”)
• Mean
• Median
• Mode
– Variation (or Summary of Differences Within Groups)
•
•
•
•
Range
Interquartile Range
Variance
Standard Deviation
Mean
Most commonly called the “average.”
Add up the values for each case and divide by the total
number of cases.
Y-bar =
(Y1 + Y2 + . . . + Yn)
n
Y-bar = Σ Yi
n
Mean
What’s up with all those symbols, man?
Y-bar =
(Y1 + Y2 + . . . + Yn)
n
Y-bar = Σ Yi
n
Some Symbolic Conventions in this Class:
• Y = your variable (could be X or Q or  or even “Glitter”)
• “-bar” or line over symbol of your variable = mean of that variable
• Y1 = first case’s value on variable Y
• “. . .” = ellipsis = continue sequentially
• Yn = last case’s value on variable Y
• n = number of cases in your sample
• Σ = Greek letter “sigma” = sum or add up what follows
• i = a typical case or each case in the sample (1 through n)
Mean
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Σ Yi = 1437
Y-barA = Σ Yi = 1437 = 110.54
n
13
Class B--IQs of 13 Students
127
162
131
103
96
111
80
109
93
87
120
105
109
Σ Yi = 1433
Y-barB = Σ Yi = 1433 = 110.23
n
13
Mean
1. Means can be badly affected by outliers
(data points with extreme values unlike the
rest)
2. Outliers can make the mean a bad measure
of central tendency or common experience
Income in the U.S.
All of Us
Mean
Bill Gates
Outlier
Median
The middle value when a variable’s values are ranked in
order; the point that divides a distribution into two equal
halves.
When data are listed in order, the median is the point at
which 50% of the cases are above and 50% below it.
The 50th percentile.
Median
Class A--IQs of 13 Students
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109
(six cases above, six below)
Median
If the first student were to drop out of Class A, there
would be a new median:
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109.5
109 + 110 = 219/2 = 109.5
(six cases above, six below)
Median
1. The median is unaffected by outliers, making
it a better measure of central tendency,
better describing the “typical person” than
the mean when data are skewed.
All of Us
Bill Gates
outlier
Median
2. If the recorded values for a variable form a
symmetric distribution, the median and
mean are identical.
3. In skewed data, the mean lies further toward
the skew than the median.
Symmetric
Skewed
Mean
Mean
Median
Median
Median
The middle score or measurement in a set of ranked scores
or measurements; the point that divides a distribution
into two equal halves.
Data are listed in order—the median is the point at which
50% of the cases are above and 50% below.
The 50th percentile.
Mode
The most common data point is called the
mode.
The combined IQ scores for Classes A & B:
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120
127 128 131 131 140 162
A la mode!!
BTW, It is possible to have more than one mode!
Mode
It may mot be at the center
of a distribution.
2.0
1.8
1.6
Count
Data distribution on the
right is “bimodal” (even
statistics can be openminded)
1.4
1.2
1.0
82.00
89.00
96.00
98.00
103.00 106.00 109.00 115.00 120.00 128.00 140.00
87.00
93.00
97.00
102.00 105.00 107.00 111.00 119.00 127.00 131.00 162.00
IQ
Mode
1.
2.
3.
It may give you the most likely experience rather than the
“typical” or “central” experience.
In symmetric distributions, the mean, median, and mode
are the same.
In skewed data, the mean and median lie further toward the
skew than the mode.
Symmetric
Median
Skewed
Mean
Mode
Mode Median Mean
Descriptive Statistics
Summarizing Data:
 Central Tendency (or Groups’ “Middle Values”)
Mean
Median
Mode
– Variation (or Summary of Differences Within Groups)
•
•
•
•
Range
Interquartile Range
Variance
Standard Deviation
Range
The spread, or the distance, between the lowest and highest
values of a variable.
To get the range for a variable, you subtract its lowest value from
its highest value.
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Class A Range = 140 - 89 = 51
Class B--IQs of 13 Students
127
162
131
103
96
111
80
109
93
87
120
105
109
Class B Range = 162 - 80 = 82
Interquartile Range
A quartile is the value that marks one of the divisions that breaks a series of values into four
equal parts.
The median is a quartile and divides the cases in half.
25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.
The interquartile range is the distance or range between the 25th percentile and the 75th
percentile. Below, what is the interquartile range?
25% of
cases
0
250
25%
25%
500
750
25% of
cases
1000
Variance
A measure of the spread of the recorded values on a variable. A measure of
dispersion.
The larger the variance, the further the individual cases are from the mean.
Mean
The smaller the variance, the closer the individual scores are to the mean.
Mean
Variance
Variance is a number that at first seems complex
to calculate.
Calculating variance starts with a “deviation.”
A deviation is the distance away from the mean of a case’s score.
Yi – Y-bar
Variance
The deviation of 102 from 110.54 is? Deviation of 115?
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Y-barA = 110.54
Variance
The deviation of 102 from 110.54 is?
102 - 110.54 = -8.54
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Y-barA = 110.54
Deviation of 115?
115 - 110.54 = 4.46
Variance
• We want to add these to get total deviations, but if we were
to do that, we would get zero every time. Why?
• We need a way to eliminate negative signs.
Squaring the deviations will eliminate negative signs...
A Deviation Squared: (Yi – Y-bar)2
Back to the IQ example,
A deviation squared for 102 is: of 115:
(102 - 110.54)2 = (-8.54)2 = 72.93
(115 - 110.54)2 = (4.46)2 = 19.89
Variance
If you were to add all the squared deviations
together, you’d get what we call the
“Sum of Squares.”
Sum of Squares (SS) = Σ (Yi – Y-bar)2
SS = (Y1 – Y-bar)2 + (Y2 – Y-bar)2 + . . . + (Yn – Y-bar)2
Variance
Class A, sum of squares:
(102 – 110.54)2 + (115 – 110.54)2 +
(126 – 110.54)2 + (109 – 110.54)2 +
(131 – 110.54)2 + (89 – 110.54)2 +
(98 – 110.54)2 + (106 – 110.54)2 +
(140 – 110.54)2 + (119 – 110.54)2 +
(93 – 110.54)2 + (97 – 110.54)2 +
(110 – 110.54) = SS = 2825.39
Class A--IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Y-bar = 110.54
Variance
The last step…
The approximate average sum of squares is the variance.
SS/N = Variance for a population.
SS/n-1 = Variance for a sample.
Variance = Σ(Yi – Y-bar)2 / n – 1
Variance
For Class A, Variance = 2825.39 / n - 1
= 2825.39 / 12 = 235.45
How helpful is that???
Standard Deviation
To convert variance into something of meaning, let’s create
standard deviation.
The square root of the variance reveals the average deviation of
the observations from the mean.
s.d. =
Σ(Yi – Y-bar)2
n-1
Standard Deviation
For Class A, the standard deviation is:
235.45
= 15.34
The average of persons’ deviation from the mean IQ of 110.54 is
15.34 IQ points.
Review:
1. Deviation
2. Deviation squared
3. Sum of squares
4. Variance
5. Standard deviation
Standard Deviation
1.
Larger s.d. = greater amounts of variation around the mean.
For example:
19
2.
3.
4.
25
31
13
25
37
Y = 25
Y = 25
s.d. = 3
s.d. = 6
s.d. = 0 only when all values are the same (only when you have a constant and
not a “variable”)
If you were to “rescale” a variable, the s.d. would change by the same
magnitude—if we changed units above so the mean equaled 25, the s.d. on the
left would be 3, and on the right, 6
Like the mean, the s.d. will be inflated by an outlier case value.
Standard Deviation
• Note about computational formulas:
– A book provides a useful short-cut formula for
computing the variance and standard deviation.
– This is intended to make hand calculations as
quick as possible.
– They obscure the conceptual understanding of our
statistics.
– SPSS and the computer are “computational
formulas” now.
Practical Application for Understanding
Variance and Standard Deviation
Even though we live in a world where we pay real MONEY IN RUPEES for
goods and services (not percentages of income), most INDIAN employers
issue raises based on percent of salary.
Why do supervisors think the most fair raise is a percentage raise?
Answer: 1) Because higher paid persons win the most money.
2) The easiest thing to do is raise everyone’s salary by a
fixed percent.
If your budget went up by 5%, salaries can go up by 5%.
The problem is that the flat percent raise gives unequal increased rewards. . .
Practical Application for Understanding
Variance and Standard Deviation
TANDROOST Toilet Cleaning Services
Salary Pool: Rs. 200,000
Incomes:
President: Rs. 100K; Manager: 50K; Secretary: 40K; and Toilet Cleaner: 10K
Mean: Rs. 50K
Range: Rs. 90K
Variance: Rs. 1,050,000,000
Standard Deviation: Rs. 32.4K
Now, let’s apply a 5% raise.
These can be considered
“measures of inequality”
Practical Application for Understanding
Variance and Standard Deviation
After a 5% raise, the pool of money increases by Rs.10K to Rs.210,000
Incomes:
President: Rs.105K; Manager: 52.5K; Secretary: 42K; and Toilet Cleaner: 10.5K
Mean: Rs.52.5K – went up by 5%
Range: Rs.94.5K – went up by 5%
Variance: Rs.1,157,625,000
Measures of Inequality
Standard Deviation: Rs.34K –went up by 5%
The flat percentage raise increased inequality. The top earner got 50% of the new money. The
bottom earner got 5% of the new money. Measures of inequality went up by 5%.
Last year’s statistics:
TANDROOST Toilet Cleaning Services annual payroll of Rs.200K
Incomes:
Rs.100K, 50K, 40K, and 10K
Mean: Rs.50K
Range: Rs.90K; Variance: Rs.1,050,000,000; Standard Deviation: Rs.32.4K
Descriptive Statistics
Summarizing Data:
 Central Tendency (or Groups’ “Middle Values”)
 Mean
 Median
 Mode
 Variation (or Summary of Differences Within Groups)
 Range
 Interquartile Range
 Variance
 Standard Deviation
– …Wait! There’s more
Box-Plots
A way to graphically portray almost all the
descriptive statistics at once is the box-plot.
A box-plot shows:
Upper and lower quartiles
Mean
Median
Range
Outliers
Box-Plots
180.00
IQR = 27; There is
no outlier.
162
160.00
140.00
123.5
120.00
M=110.5
106.5
100.00
96.5
82
80.00
IQ
The Data
2 Regions – Monthly Sales
The Calculations
Calculate Min, Max, Median, Quartile 1 and Quartile 3
Calculate Box Heights
SAMPLING
Sampling denotes the selection of a part of the
aggregate statistical material with a view to
obtaining information about the whole. This
aggregate or totality of statistical information on a
particular character of all the members covered by an
investigation, is called population.
Types of sampling
•
Simple random, Stratified, Systematic, Cluster, and Multistage.
Simple random
It is the simplest of all sampling techniques where all units in a
population has equal chance of being included in the sample.
There are two methods:
1. With replacement ; 2. Without replacement
•
In ‘with replacement’ method, the probability of selection of any particular
number of the population at any drawing remains a constant 1/N; because
before any draw the population contains all the N members.
•
Interestingly, this result is also true in without replacement method, although
the population size varies at each stage of selection. Thus the probability of
obtaining the population member Xk (suppose) at the ith draw is a constant 1/N
in both the cases i.e.,
•
P(Xi=Xk)=1/N for i=1,2,…..n and K=1,2,……N.
98
89
87
75
51
69
41
10
35
8
79
100
98
85
31
95
29
17
99
57
Random number series - 1
65
33
98
67
42
62
60
72
79
14
22
78
11
78
11
70
90
50
1
8
72
28
33
47
61
80
13
59
81
91
41
79
35
98
58
8
51
27
34
46
41
79
81
28
33
46
44
87
46
85
82
32
17
57
12
93
69
28
30
47
93
7
48
26
82
76
15
21
11
30
15
75
61
69
91
15
26
94
15
47
Random number series – 2
297
117
273
66
214
293
256
140
108
80
42
169
15
33
281
156
13
214
165
241
299
284
198
122
279
237
197
163
203
47
26
112
58
138
44
39
98
15
274
79
198
81
113
60
114
142
149
91
150
269
Stratified sampling is generally used when the population is
heterogeneous, but can be subdivided into strata within each of
which the heterogeneity is not so prominent. Some prior knowledge
is necessary for subdivision, termed as stratification.
If a proper stratification can be made such that the strata differ from
one another as much as possible, but there is much homogeneity
within each of them, then a stratified sample will yield better
estimates than a random sample of the same size. This is because in
stratified sampling different sections of population are suitably
represented through the sub samples,
which in random sampling some of these sections may be over or
under represented or may even be omitted.
The principle purpose of stratification are –
I. To increase the overall estimates,
II. To ensure that all sections of populations are
adequately represented
III.To avoid heterogeneity of the population.
Solve this problem??
A company has a total of 360 employees in four different categories:
How many from each category should be included in a stratified random sample of
size 20 ?
Managers
36
Drivers
54
Administrative Staff
90
Production Staff
180
To create a sample of size 20 we need 20/360 or 1/18 of the workforce.
So we take this fraction of the number of employees in each category.
Managers
1/18 × 36
= 2
Drivers
1/18 × 54
= 3
Administrative Staff
1/18 × 90
= 5
Production Staff
1/18 × 180
= 10
TOTAL
= 20
Systematic sampling: Systematic sampling involves the selection of
sample units at equal intervals, after all the units in the population have
been arranged in some order. If the population size is finite, the units
may be serially numbered and arranged. From the first K of these, a
single unit is chosen at random. This unit and every k-th unit thereafter
constitutes a Systematic sample. In order to obtain a systematic sample
of 500 villages out of 40,000 in Assam, i.e., one out of 80 on an average
, all the villages have to be numbered serially. From the first 80 of these
a village is selected at random, suppose with the serial number 27. Then
the villages with serial numbers 27, 107, 187, 267, 347,…. Constitute
the systematic sample.
If the characteristics under study is independent of the order of
arrangement of the units, then a systematic sample is practically
equivalent to a random sample. The actual selection of the sample is
easier and quicker.
Systematic sampling is suitable when the units are described on serial
numbered cards, e.g., workers listed on cards. Then the sample can be
drawn easily by looking at the serial numbers. The sample may be
biased if there are periodic features associated with the sampling
interval.
Multi-stage Sampling: Multi-stage Sampling refers to
a sampling procedure which is carried out in several
stages. The population is divided into large groups ,
called first stage units. These 1st stage units are again
divided into smaller units, called 2nd stage units- the 2nd
stage units into 3rd stage units, and so on, until we reach
the ultimate units. e.g., in order to introduce a scheme
on an experimental basis in the villages, we may have to
select a few villages from the whole of the state. If we
apply 3 stage sampling, sub-divisions may be used as 1st
stage units.
Cluster sampling: It involves grouping the population and
then selecting the groups or clusters rather than individual
elements for inclusion in the sample. Suppose some deptt.
Store wishes to sample its credit holders. It has issued its
cards to 15000 customers. The sample size is to be kept say
450. For cluster sampling this list of 15000 card holders
could be formed into 100 clusters of 150 card holders each.
Three clusters might then be selected randomly. The sample
size must be larger than simple random sampling to ensure
same level of accuracy, because possibilities of both
sampling or non-sampling error is more. The clustering
approach can make the sampling procedure relatively easier
and increase efficiency of field works, specially in the case
of personal interviews.
How can exam score data be summarised?
Exam marks for 60 students (marked out of 65)
mean = 30.3
sd = 14.46
Summary statistics
n
• Mean =
x
i 1
n
x
Standard deviation (s) is a measure of how much the individuals
differ from the mean
n
2


x

x

i
s  i 1
n 1
Large SD = very spread out data
Small SD = there is little variation from the mean
For exam scores, mean = 30.5, SD = 14.46
IQ is normally distributed
Above average
Average
Mean = 100, SD = 15.3
95% 1.96 x SD’s from the mean
95% of values
P(score > 130) =
0.025
100
70
mean  1.96  SD 
100  1.96  15.3  70
130
mean  1.96  SD 
100  1.96  15.3  130
95% of people have an IQ between 70 and 130
Assessing Normality
Charts can be used to informally assess whether data is:
Normally
distributed
Or….Skewed
The mean and median are very
different for skewed data.
Sometimes the median makes more sense!
2/3rd people
50% people
Source: Households Below Average Income: An analysis of the income
distribution1994/95 – 2011/12, Department for Work and Pensions
www.statstutor.ac.uk
Choosing summary statistics
Which average and measure of
spread?
Scale
Normally
distributed
Mean (Standard
deviation)
Skewed data
Median
(Interquartile
range)
Categorical
Ordinal:
Median
(Interquartile
range)
Nominal:
Mode
(None)
Hypothesis Testing
Hypothesis testing
• An objective method of making
decisions or inferences from sample
data (evidence)
• Sample data used to choose between
two choices i.e. hypotheses or
statements about a population
• We typically do this by comparing
what we have observed to what we
expected if one of the statements
(Null Hypothesis) was true
Hypothesis testing Framework
What the text books might say!
• Always two hypotheses:
HA: Research (Alternative) Hypothesis
• What we aim to gather evidence of
• Typically that there is a
difference/effect/relationship etc.
H0: Null Hypothesis
statstutor.ac.uk
• What we assume is true to begin with
• Typically that there is no
difference/effect/relationship etc.
Discussion
• How could you help a student
understand what hypothesis
testing is and why they need to
use it?
Could try explaining things in the
context of “The Court Case”?
• Members of a jury have to decide whether
person is guilty or innocent based on evidence
a
Null: The person is innocent
Alternative: The person is not innocent (i.e. guilty)
• The null can only be rejected if there is enough evidence
to doubt it
• i.e. the jury can only convict if there is beyond reasonable
doubt for the null of innocence
• They do not know whether the person is really guilty or
innocent so they may make a mistake
Types of Errors
Controlled via sample size
(=1-Power of test)
Typically restrict to a 5% Risk
= level of significance
Study reports
NO difference
(Do not reject H0)
H0 is true
Difference Does
NOT exist in
population
HA is true
Difference DOES
exist in population
Study reports
IS a difference
(Reject H0)
X
X
Type I
Error
Type II
Error
Prob of this = Power of test
Steps to undertaking a Hypothesis test
Define study question
Set null and alternative hypothesis
Calculate a test statistic
Calculate a p-value
Make a decision and interpret
your conclusions
Choose a
suitable
test
What does it mean for two categorical variables
to be related?
• Remember that Chi-Square is used to test for a relationship
between 2 Categorical variables.
Ho: There is no relationship between the variables.
Ha: There is a relationship between the variables.
• If two categorical variables are related, it means the chance
that an individual falls into a particular category for one
variable depends upon the particular category they fall into
for the other variable.
• Let’s say that we wanted to determine if there is a
relationship between religion (Christian, Jew, Muslim, Other)
and smoking. When we test if there is a relationship between
these two variables, we are trying to determine if being part
of a particular religion makes an individual more likely to be a
smoker. If that is the case, then we can say that Religion and
Smoking are related or associated.
Chi-squared test statistic
• The chi-squared test is used when we want to see if
two categorical variables are related
• The test statistic for the Chi-squared test uses the
sum of the squared differences between each pair of
observed (O) and expected values (E)
n
Oi  Ei 
i 1
Ei
 
2
2
TABLE A
Chi-Square test for 2-way tables
• Suppose we are studying two categorical variables in a population, where the
first variable has r levels (i.e. possible outcomes) and the second one has s
levels.
• We can summarize a sample from this population using a table with r rows and
c columns.
• A two-way table, also called contingency table, displays the counts of how
many individuals fall into each possible combination of categories of two
categorical variables. So, each cell of the table (total number of cells is r xc)
represents a combination of categories of the two variables.
• The following table presents the data on race and smoking. The two variables
of interest, race and smoking, have r = 4 and c = 2, resulting in 4x2=8
combinations of categories.
Race
NSmoke
Smoke
Caucasian
620
75
Black
240
41
Hispanic
130
29
Other
190
38
Chi-Square test for 2-way tables
•
By considering the number if observation falling into each category, we
will see how to test the hypotheses of the form:
H0: The two variables are not associated.
Ha: The two variables are associated.
•
Two different experimental situations will lead to contingency tables
1. If we have two populations under study, both of which have a
particular trait with respect to a categorical variable. In this case the
null hypothesis is a statement of homogeneity among the two
populations.
2. If we have one population under study, and we are interested to
check the relationship between two categorical variables. In this case
the null hypothesis is a statement of independence between the two
variables.
•
For sufficiently large samples, the same test is appropriate for both of
these situations. This test is called chi-square test, and in the following we
will go over the steps in for testing the relationship between two
variables.
Some Notation!
• For i taking values from 1 to r (number of rows) and j taking
values from 1 to c (number of columns), denote:
Ri = total count of observations in the i-th row.
Cj = total count of observations in the j-th column.
Oij = observed count for the cell in the i-th row and the j-th column.
Eij = expected count for the cell in the i-th row and the j-th column if the two
variables were independent, i.e if H0 was true. These counts are calculated
as
Ri  C j
Row total  Column tot al
Expected 
, thus Eij 
Total sample size n
n
Example
Race
NSmoke
Smoke
Total
Caucasian
O11 = 620
O12 = 75
R1 = 695
Black
O21 = 240
O22 = 41
R2 = 281
Hispanic
O31 = 130
O32 = 29
R3 = 159
Other
O41 = 190
O42 = 38
R4 = 228
Total
C1 = 1180
C2 = 183
n=1363
E11=(695x1180)/1363
E21=(281x1180)/1363
E31=(159x1180)/1363
E41=(228x1180)/1363
E12=(695x183)/1363
E22=(281x183)/1363
E32=(159x183)/1363
E42=(228x183)/1363
Chi-Square Analysis Details
The 5 Steps in a Chi-Square Test:
•
Step 1: Write the null and alternative hypothesis.
H0: There is no relationship between the variables.
Ha: There is a relationship between the variables.
•
•
Step 2: Compute expected values
Step 3: Calculate Test Statistic and p-value.
The test statistic measure the difference between the observed
counts and the expected counts assuming independence.
2
2
(
O

E
)
(Observed
Expected)
ij
ij
2  

Expected
Eij
all cells
i, j
This is called chi-square statistic because if the null hypothesis is
true, then it has a chi-square distribution with (r-1)x(c-1) degrees of
freedom.
Chi-Square Analysis Details
• Step 3 Cont. Find the p-value.
 If the χ2- statistic is large, it implies that the observed counts are not close to the
counts we would expect to see if the two variables were independent. Thus,
''large'' χ2 gives evidence against the null hypothesis, and supports the
alternative.
 The p-value of the chi-square test is the probability that the χ2- statistic, is as
large or larger than the value we obtained if H0 is true.
 Thus, the p-value for Chi-Square test is ALWAYS the area to the right of the test
statistic under the curve, i.e. p-value = P(X> χ2), where X has a chi-square
distribution with (r-1)x(c-1) df curve.
 To get this probability we need to use a chi-square distribution with (r-1)x(c-1)
df (Table A).
Chi-Square Analysis Details
•
•
Step 4: Decide whether or not the result is statistically significant.
The results are statistically significant if the p-value is less than
alpha, where alpha is the significance level (usually α = 0.05).
Step 5: Report the conclusion in the context of the situation.


The p-value is ______ which is < a, this result is statistically
significant. Reject the H0 Conclude that (the two variables) are
related.
The p-value is ______ which is > a, this result is NOT statistically
significant. We cannot reject the H0 Cannot conclude that (the
two variables) are related.
Detailed Example
• Derek wants to know if the geographical area that a student grew
up in is associated with whether or not that the student drinks
alcohol. Below are the results he obtained from a random sample
of PSU students
No
Yes
Total
Big City
21
65
86
Rural
11
130
141
Small Town
18
198
216
Suburban
37
345
382
Total
87
738
825
Detailed Example
1. Ho: There is no relationship between the geographical area that a
student grew up and whether or not that the student drinks alcohol.
Ha: There is relationship between the geographical area that a student
grew up and whether or not that the student drinks alcohol.
2. To check the conditions we need to calculate the expected counts for
each cell.
E11 = (R1xC1)/n = (86x87)/825 = 9.07,
E12 = (R1xC2)/n = (86x738)/825 = 76.93, …
E32 = (R3xC2)/n = ___________________, …
Detailed Example
3. Chi- Square statistic and P-value:
χ2 = sum {(Observed – Expected)2/Expected}
= (21-9.07)2/9.07+ (65-76.93)2/76.93
+ (11-14.87)2/14.87+ (130-126.13)2/126.13
+ (18-22.78)2/22.78+ (198-193.22)2/193.22
+ (37-40.28)2/40.28+ (345-341.72)2/341.72
= 20.091
df = (4-1)x(2-1) =3
p-value= P(X> 20.091) < P(Xc> 16.17) = 0.001 (see Table A)
4. Since the p-value< 0.05, the test is significant, and we can reject the null.
5. We can conclude that there is a relationship between the geographical area
that a student grew up and whether or not that the student drinks alcohol.
Example: Titanic
• The ship Titanic sank in 1912 with the loss of
most of its passengers
• 809 of the 1,309 passengers and crew died
= 61.8%
• Research question: Did class (of travel) affect
survival?
Chi squared Test?
• Null:
There is NO association between
class and survival
• Alternative: There IS an association between
class and survival
What would be expected if the null is true?
• Same proportion of people would have died in each class!
• Overall, 809 people died out of 1309 = 61.8%
What would be expected if the null is true?
• Same proportion of people would have died in each class!
• Overall, 809 people died out of 1309 = 61.8%
Chi-Squared Test Actually Compares Observed and
Expected Frequencies
Expected number dying in each class = 0.618 * no. in class
Using SPSS
Analyse  Descriptive Statistics  Crosstabs
Click on ‘Statistics’ button & select Chi-squared
Test Statistic = 127.859
p- value
p < 0.001
Note: Double clicking on the output will display the p-value to
more decimal places
www.statstutor.ac.uk
Hypothesis Testing: Decision Rule
• We can use statistical software to undertake a
hypothesis test e.g. SPSS
• One part of the output is the p-value (P)
• If P < 0.05 reject H0 => Evidence of HA being
true (i.e. IS association)
• If P > 0.05 do not reject H0 (i.e. NO association)
Comparing means
T-tests
Paired or Independent (Unpaired) Data?
T-tests are used to compare two population means
₋ Paired data: same individuals studied at two different
times or under two conditions PAIRED T-TEST
₋ Independent: data collected from two separate
groups INDEPENDENT SAMPLES T-TEST
Comparison of hours worked in 1988 to today
Paired or unpaired?
If the same people have reported their hours for 1988 and 2014
have PAIRED measurements of the same variable (hours)
Paired Null hypothesis: The mean of the paired differences = 0
If different people are used in 1988 and 2014 have independent
measurements
Independent Null hypothesis: The mean hours worked in 1988 is
equal to the mean for 2014
H 0 : 1988  2014
SPSS data entry
Paired Data
Independent Groups
What is the t-distribution?
 The t-distribution is similar to the standard normal distribution
but has an additional parameter called degrees of freedom (df
or v)
For a paired t-test, v = number of pairs – 1
For an independent t-test,
v  ngroup1  ngroup2  2
 Used for small samples and when the population standard
deviation is not known
 Small sample sizes have heavier tails
Relationship to normal
• As the sample size gets big, the t-distribution
matches the normal distribution
Normal curve
Oneway ANOVA
• Analysis of variance is used to test for differences among
more than two populations. It can be viewed as an extension
of the t-test we used for testing two population means.
• The specific analysis of variance test that we will study is often
referred to as the oneway ANOVA. ANOVA is an acronym for
ANalysis Of VAriance. The adjective oneway means that
there is a single variable that defines group membership
(called a factor). Comparisons of means using more than one
variable is possible with other kinds of ANOVA analysis.
Why Not Use Multiple T-tests?
• It might seem logical to use multiple t-tests if we wanted to
compare a variable for more than two groups. For example,
if we had three groups, we might do three t-tests: group 1
versus group 2, group 1 versus group 3, and group 2 versus
group3.
• However, doing three hypothesis tests to compare groups
changes the probability that we are making an error (the
alpha error rate). When conducting multiple tests of
significance, the chance of making at least one alpha error
over the series of tests is greater than the selected alpha
level for each individual test. Thus, if we do multiple t-tests
on the same variables with an alpha level of 0.05, the
chances that we are making a mistake in applying our
findings to the population is actually greater than 0.05.
Step 1. Assumptions for the Test
• Level of measurement of the group variable can be any level
of variable that identifies groups.
• Level of measurement of the test variable is interval.
• The test variable is normally distributed in the population:
– skewness and kurtosis between –1.0 and +1.0, or
– number is each group is greater than 10 (central
limit theorem)
• The variances (dispersion) of the groups are equal.
Step 2. Hypotheses and alpha
• The research hypothesis is that the mean of at least one of
the population groups is different from the means of the
other groups.
• The null hypothesis is that the means of all of the
population groups are equal.
• If we don’t have a specific reason for setting the level of
significance to a specific probability, we can use the
traditional benchmark of 0.05. This means that we are
willing to risk making a mistake in our decision to reject the
null hypothesis if it only happens once in every 20
decisions, or our decision would be correct 19 out of 20
times.
Step 3. Sampling distribution and test
statistic
• In the ANOVA test, the probability is obtained from the “F”
distribution instead of the normal curve distribution.
• The test statistic is also referred to as the F-ratio or F-test
because it follows the f-distribution.
Step 4. Computing the Test Statistic
• Conceptually the test statistic is computed in a way similar to
the independent samples t-test. Both are computed by
dividing the differences in means by the measure of variability
among the groups.
• We identify the probability of the test statistic from the SPSS
statistical output.
Step 5. Decision and Interpretation
• If the probability of the test statistic is less than or equal to
the probability of the level of significance (alpha error rate),
we reject the null hypothesis and conclude that our data
supports the research hypothesis.
• If the probability of the test statistic is greater than the
probability of the level of significance (alpha error rate), we
fail to reject the null hypothesis and conclude that our data
does not support the research hypothesis.
Interpreting Differences in
Population Means
• If we fail to reject the null hypothesis, we can state that we
found no differences among the means for the population
groups for this characteristic. We do not say they are equal.
• If we reject the null hypothesis, we can conclude that the
mean for at least one population group is different from the
others.
The ANOVA test itself does NOT tell us which group means are
different. To determine this, we use a Post Hoc test, such as
the Tukey HSD (honestly significant differences), LSD (least
significant difference) Post Hoc Test.
Post Hoc Test for Difference in Means
• Just as we used a post hoc test to identify which cells in a
frequency table were responsible for the statistically
significant result, we use a post hoc test to identify the
differences in pairs of means that produce a statistically
significant result in an ANOVA table.
• We only look at the post hoc test when the probability of
the ANOVA statistic causes us to reject the null hypothesis,
i.e. the probability of the test statistic is less than the level
of significance.
• The Post Hoc Test may NOT reveal differences among
group means even when we reject the null hypothesis in
the ANOVA test.
Inflation of Type I Error (Alpha)
• Type I Error: Probability of falsely rejecting null hypothesis
when it is true.
• The only time you need to worry about inflation of Type I
error rate is when you look for a lot of effects in your data.
• The more effects you look for, the more likely it is that you
will turn up an effect that doesn't really exist (Type I error!).
• Doing all possible pair-wise comparisons (t-test) on a oneway ANOVA would increase the overall Type I error rate.
ANOVA post hoc Test in SPSS (1)
Next step is to examine the
distribution of the dependent
variable. You can check whether the
dependent variable is normally
distributed or not in:
Analyze > Descriptive Statistics >
Descriptives…
ANOVA post hoc Test in SPSS (2)
After moving [age] into
“Variable(s):” box, click
“Options…” button to
select the distribution
statistics.
ANOVA post hoc Test in SPSS (3)
Select “Kurtosis” and
“Skewness” to examine
whether [age] is normally
distributed or not.
Then, click “Continue”
and “OK” buttons.
ANOVA post hoc Test in SPSS (4)
[Age] satisfied the criteria for a normal
distribution. The skewness of the distribution
(.590) was between -1.0 and +1.0 and the
kurtosis of the distribution (-.150) was
between -1.0 and +1.0.
ANOVA post hoc Test in SPSS (5)
You can conduct ANOVA by
clicking:
Analyze > Compare Means >
One-Way ANOVA…
ANOVA post hoc Test in SPSS (6)
Now, click “Post Hoc…”
button to select post hoc
test option.
ANOVA post hoc Test in SPSS (7)
Select “Tukey” in
“Equal Variances
Assumed” panel.
Enter alpha in the “Significance level:”
textbox. It is same as the alpha level
(.01) in the problem.
Then, click “Continue” and “OK”
buttons.
Data Sheet
Control
Mean
S.Dev
T1
32
32
1.50
34
Mean S.Dev T2 Mean S.Dev T3 Mean S.Dev T4 Mean S.Dev
34
0.42
33
33
0.38
33
32
0.25
35
34
34
33
32
35
31
33
33
32
34
Arrangement for ANOVA
Treatment Observation
Control-1
32
Control-2
34
Control-3
31
T2R1
34
T2R2
34
T2R3
33
T3R1
33
T3R2
33
T3R3
33
T4R1
33
T4R2
32
35
0.38
Correlation
Correlation quantifies the extent to which two quantitative variables, X and Y, “go together.”
W hen high values of X are associated with high values of Y, a positive correlation exists. W
hen high values of X are associated with low values of Y, a negative correlation exists.
Now, we have some data, but How to start?
The first step is create a scatter plot of the data.
Let us deal with an example!!
We use the following data set to illustrate correlational methods. In this crosssectional data set, each observation represents a district of Assam. The X
variable is socioeconomic status measured as the percentage of children in a
neighborhood receiving mind day meals at school. The Y variable is the
percentage of school children owning bicycle. Twelve districts are considered:
X
District
Dhubri
Kokrajahr
Dhemaji
Dibrugarh
Morigaon
Kamrup
Goalpara
Sonitpur
Sivsagar
Darrang
Nagaon
Barpeta
(% receiving midday meal)
50
11
2
19
26
73
81
51
11
2
19
25
Y
(% owning bicycle)
22.1
35.9
57.9
22.2
42.4
5.8
3.6
21.4
55.2
33.3
32.4
38.4
Y (% of bicycle)
A scatter plot of the illustrative data is shown to the right. The plot reveals that high
values of X are associated with low values of Y. That is to say, as the number of children
receiving
X (% of mid day meal)
Correlation Coefficient
Correlation coefficients (denoted r) are statistics that quantify the relation between X and Y in
unit-free terms. W hen all points of a scatter plot fall directly on a line with an upward incline,
r = +1
When all points fall directly on a downward incline, r = -1.
Such perfect correlation is seldom encountered. W e still need to measure correlational
strength, –defined as the degree to which data point adhere to an imaginary trend line
passing through the “scatter cloud.” Strong correlations are associated with scatter clouds
that adhere closely to the imaginary trend line. Weak correlations are associated with scatter
clouds that adhere marginally to the trend line.
The closer r is to +1, the stronger the positive correlation. The closer r is to -1, the stronger
the negative correlation. Examples of strong and weak correlations are shown below.
Note: Correlational strength can not be quantified visually. It is too subjective and is easily
influenced by axis-scaling. The eye is not a good judge of correlational strength.
Oneway ANOVA
• Analysis of variance is used to test for differences among
more than two populations. It can be viewed as an extension
of the t-test we used for testing two population means.
• The specific analysis of variance test that we will study is often
referred to as the oneway ANOVA. ANOVA is an acronym for
ANalysis Of VAriance. The adjective oneway means that there
is a single variable that defines group membership (called a
factor). Comparisons of means using more than one variable
is possible with other kinds of ANOVA analysis.
Why Not Use Multiple T-tests
• It might seem logical to use multiple t-tests if we wanted to
compare a variable for more than two groups. For example,
if we had three groups, we might do three t-tests: group 1
versus group 2, group 1 versus group 3, and group 2 versus
group3.
• However, doing three hypothesis tests to compare groups
changes the probability that we are making an error (the
alpha error rate). When conducting multiple tests of
significance, the chance of making at least one alpha error
over the series of tests is greater than the selected alpha
level for each individual test. Thus, if we do multiple t-tests
on the same variables with an alpha level of 0.05, the
chances that we are making a mistake in applying our
findings to the population is actually greater than 0.05.