Transcript Chapter 2

Psy B07
DESCRIBING AND EXPLORING
DATA
Chapter 2
Slide 1
Psy B07
Outline







Plotting data
Grouping data
Terminology
Notation
Measures of Central Tendency
Measures of Variability
Properties of a Statistic
Chapter 2
Slide 2
Psy B07
Plotting Data
 Once a bunch of data has been
collected, the raw numbers must be
manipulated in some fashion to make
them more informative.
 Several options are available including
plotting the data or calculating descriptive
statistics
Chapter 2
Slide 3
Psy B07
Plotting Data
 Raw data of
typical age and
weight in a
second year
course (madeup data)
Chapter 2
Age
18
26
21
21
25
18
20
21
18
21
21
21
20
21
20
23
22
20
21
22
24
26
19
19
Weight
107
115
108
111
163
119
119
200
178
135
143
113
103
166
112
151
192
135
117
138
137
161
117
142
Age
20
21
20
19
19
21
22
19
20
20
19
19
19
20
20
19
20
20
20
22
22
19
23
20
Weight
108
110
109
127
143
121
112
136
161
131
144
123
101
193
127
158
149
138
129
138
137
156
122
132
Slide 4
Psy B07
Plotting Data
 Often, the first thing one does with a set
of raw data is to plot frequency
distributions.
 Usually this is done by first creating a
table of the frequencies broken down by
values of the relevant variable, then the
frequencies in the table are plotted in a
histogram
Chapter 2
Slide 5
Psy B07
Plotting Data
 Example: Typical age in a second year course
Chapter 2
Age
Frequency
18
19
20
21
22
23
24
25
26
3
10
14
10
5
2
1
1
2
 Note: The frequencies in
the adjacent table were
calculated by simply
counting the number of
subjects having the
specified value for the
age variable
Slide 6
Psy B07
Plotting Data
16
14
Age
18
19
20
21
22
23
24
25
26
12
Frequency
10
8
Frequency
3
10
14
10
5
2
1
1
2
6
4
2
0
18
19
20
21
22
23
24
25
26
Age
Chapter 2
Slide 7
Psy B07
Grouping Data
 Plotting is easy when the variable of
interest has a relatively small number of
values (like our age variable did).
 However, the values of a variable are
sometimes more continuous, resulting
in uninformative frequency plots if done
in the above manner.
Chapter 2
Slide 8
Psy B07
Grouping Data
 For example, our weight variable ranges
from 100 lb. to 200 lb. If we used the
previously described technique, we
would end up with 100 bars, most of
which with a frequency less than 2 or 3
(and many with a frequency of zero).
 We can get around this problem by
grouping our values into bins. Try for
around 10 bins with natural splits.
Chapter 2
Slide 9
Psy B07
Grouping Data
Weight Bin
100 - 109
110 - 119
120 - 129
130 - 139
140 - 149
150 - 159
160 - 169
170 - 179
180 - 189
190 - 199
200 - 209
Chapter 2
Midpoint
Frequency
104.5
114.5
124.5
134.5
144.5
154.5
164.5
174.5
184.5
194.5
204.5
6
10
6
10
5
3
4
1
0
2
1
Slide 10
Psy B07
Grouping Data
12
Frequency
10
8
6
Weight
Frequency
104.5
6
114.5
10
124.5
6
134.5
10
144.5
5
154.5
3
164.5
4
174.5
1
184.5
0
194.5
2
204.5
1
Check out this demo
which clearly shows how
the width of the bin that
you select can clearly
affect the “look” of the
data
4
2
204.5
194.5
184.5
174.5
164.5
154.5
144.5
134.5
124.5
114.5
104.5
0
Here is another similar
demonstration of the
effects of bin width
Weight (lbs)
 See section in text on cumulative frequency distributions
Chapter 2
Slide 11
Psy B07
Terminology
 Often, frequency histograms tend to have a roughly
symmetrical bell-shape and such distributions are called
normal or Gaussian
14
12
Frequency
10
60.5
62.5
64.5
66.5
68.5
70.5
72.5
74.5
76.5
3
8
7
12
7
6
4
0
1
8
6
4
2
0
60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5
Height (Inches)
Chapter 2
Slide 12
Psy B07
Terminology
 Sometimes, the bell shape is not
symmetrical
 The term positive skew refers to the
situation where the “tail” of the
distribution is to the right, negative skew
is when the “tail” is to the left
Chapter 2
Slide 13
Psy B07
Terminology
14
12
Frequency
10
60.5
62.5
64.5
66.5
68.5
70.5
72.5
74.5
76.5
3
8
7
12
7
6
4
0
1
0.75
2.75
4.75
6.75
8.75
10.75
12.75
14.75
16.75
18.75
20.75
8
7
13
12
5
5
2
0
1
1
0
1
6
4
2
Chapter 2
20.8
18.8
16.8
14.8
12.8
10.8
8.75
6.75
4.75
2.75
0.75
0
Slide 14
Psy B07
Notation
 Variables
 When we describe a set of data
corresponding to the values of some
variable, we will refer to that set using a
letter such as X or Y.
 When we want to talk about specific
data points within that set, we specify
those points by adding a subscript to
the letter like X1.
Chapter 2
Slide 15
Psy B07
Notation
5,
8, 12,
X1, X2, X3,
Chapter 2
3,
X4,
6,
X5,
8,
7
X6, X7
Slide 16
Psy B07
Notation
 The Greek letter sigma, which looks like
, means “add up” or “sum” whatever
follows it.
 Thus, Xi, means “add up all the Xis.
 If we use the Xis from the previous
example, Xi = 49 (or just X).
Chapter 2
Slide 17
Psy B07
Nasty Example
Midterm
Student Mark
X
1
2
3
4
5
Chapter 2
82
66
70
81
61
Real
Mark
Y
84
51
72
56
73
Slide 18
Psy B07
Nasty Example
X = 360
Y = 336
(X-Y) = 24
X2 = 26262
(X)2 = 129600
Chapter 2
Slide 19
Psy B07
Your turn
(XY) = 24283
((X-Y))2 = 576
(X2-Y2) = 2956
Chapter 2
Slide 20
Psy B07
Notation
 Sometimes things are made more
complicated because letters (e.g., X) are
sometimes used to refer to entire data
sets (as opposed to single variables)
and multiple subscripts are used to
specify specific data points.
Chapter 2
Slide 21
Psy B07
Notation
1
Week
2 3 4
5
7
3
3
6
4
4
2
3
4
2
4
6
Student
1
2
3
4
4
5
X24 = 3
X or Xij = 61
Chapter 2
Slide 22
Psy B07
Measures of Central Tendency
 While distributions provide an overall
picture of some data set, it is sometimes
desirable to represent the entire data
set using descriptive statistics.
 The first descriptive statistics we will
discuss, are those used to indicate
where the centre of the distribution lies.
Chapter 2
Slide 23
Psy B07
Measures of Central Tendency
14
12
Frequency
10
60.5
62.5
64.5
66.5
68.5
70.5
72.5
74.5
76.5
3
8
7
12
7
6
4
0
1
8
6
4
2
0
60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5
Height (Inches)
Chapter 2
Slide 24
Psy B07
Measures of Central Tendency
 There are, in fact, three different
measures of central tendency.
 The first of these is called the mode.
 The mode is simply the value of the
relevant variable that occurs most often
(i.e., has the highest frequency) in the
sample.
Chapter 2
Slide 25
Psy B07
Measures of Central Tendency
 Note that if you have done a frequency
histogram, you can often identify the mode
simply by finding the value with the highest
bar.
 However, that will not work when grouping
was performed prior to plotting the histogram
(although you can still use the histogram to
identify the modal group, just not the modal
value)
Chapter 2
Slide 26
Psy B07
Measures of Central Tendency
 Create a non-grouped frequency table as described
previously, then identify the value with the greatest
frequency.
 Example: Class height.
Value Freq
61
62
63
64
65
66
67
68
Chapter 2
3
4
4
4
3
7
5
4
Value Freq
69
70
71
72
73
74
75
76
3
2
4
4
0
0
0
1
Slide 27
Psy B07
Measures of Central Tendency
 A second measure of central tendency is
called the median.
 The median is the point corresponding
to the score that lies in the middle of
the distribution (i.e., there are as many
data points above the median as there
are below the median).
Chapter 2
Slide 28
Psy B07
Measures of Central Tendency
 To find the median, the data points
must first be sorted into either
ascending or descending numerical
order.
 The position of the median value can
then be calculated using the following
formula:
N 1
Median Location 
2
Chapter 2
Slide 29
Psy B07
Measures of Central Tendency
1) If there are an odd number of data points:
(1, 3, 3, 4, 4, 5, 6, 7, 12)
Median Location
9 1
5
2
The median is the item in the fifth position of
the ordered data set, therefore the median
is 4
Chapter 2
Slide 30
Psy B07
Measures of Central Tendency
2) If there are an even number of data points:
(1, 3, 3, 3, 5, 5, 6, 7)
Median Location
8 1
 4.5
2
We take the average of the two adjacent
values – in this case giving us 4
Chapter 2
Slide 31
Psy B07
Measures of Central Tendency
 Finally, the most commonly used measure of
central tendency is called the mean (denoted x
for a sample, and μ for a population).
 The mean is the same of what most of us call
the average, and it is calculated in the
following manner:
X
X 
N
Chapter 2
Slide 32
Psy B07
Measures of Central Tendency
 For example, given the data set that we used
to calculate the median (odd number
example), the corresponding mean would be:
X 45
X

5
N
9
Chapter 2
Slide 33
Psy B07
Measures of Central Tendency
 When a distribution is fairly
symmetrical, the mean, median, and
mode will be quite similar
 However, when the underlying
distribution is not symmetrical, the
three measures of central tendency can
be quite different
Chapter 2
Slide 34
Psy B07
Measures of Central Tendency
 This raises the issue of which measure is best.
Example: Pizza Eating
Value Freq
Value Freq
0
1
2
3
4
5
6
4
2
8
6
6
6
5
8
10
15
16
20
40
5
2
1
1
1
1
Mode =
2 slices per week
Median =
4 slices per week
Mean =
5.7 slices per week
 Note that if you were calculating these values, you
would show all your steps (it’s good to be a prof!).
Chapter 2
Slide 35
Psy B07
Measures of Central Tendency
 Here is a demonstration that allows you to
change a frequency histogram while
simultaneously noting the effects of those
changes on the mean versus the median.
 As you use the demo, you should easily be
able to think about how these changes are
also affecting the mode, right?
Chapter 2
Slide 36
Psy B07
Measures of Variability
 In addition to knowing where the centre
of the distribution is, it is often helpful
to know the degree to which individual
values cluster around the centre.
 This is known as variability
Chapter 2
Slide 37
Psy B07
Measures of Variability
 There are various measures of variability, the
most straightforward being the range of the
sample:
Highest value minus lowest value
 While range provides a good first pass at
variance, it is not the best measure because of
its sensitivity to extreme scores (see text).
Chapter 2
Slide 38
Psy B07
Measures of Variability
 One approach to estimating variability is to
directly measure the degree to which
individual data points differ from the mean and
then average those deviations.
 This is known as the average deviation
( X  X )
N
Chapter 2
Slide 39
Psy B07
Measures of Variability
 However, if we try to do this with real data,
the result will always be zero:
Example: (2,3,3,4,4,6,6,12)
( X  X ) (3,2,2,1,1,1,1,7) 0

 0
N
8
8
Chapter 2
Slide 40
Psy B07
Measures of Variability
 One way to get around the problem
with the average deviation is to use the
absolute value of the differences,
instead of the differences themselves.
 The absolute value of some number is
just the number without any sign:
For Example: |-3| = 3
And: |+3| = 3
Chapter 2
Slide 41
Psy B07
Measures of Variability
 Thus, we could re-write and solve our average deviation
question as follows:
MAD 
X X
N
3  2  2 1111 7

8
18

 2.25
8
 Therefore, this data set has a mean of 5, and a MAD of
2.25
Chapter 2
Slide 42
Psy B07
Measures of Variability
 Although the MAD is an acceptable
measure of variability, the most
commonly used measure is variance
(denoted s2 for a sample and 2 for a
population) and its square root termed
the standard deviation (denoted s for a
sample and  for a population).
Chapter 2
Slide 43
Psy B07
Measures of Variability
 The computation of variance is also based on
the basic notion of the average deviation
however, instead of getting around the “zero
problem” by using absolute deviations (as in
MAD), the “zero problem” is eliminating by
squaring the differences from the mean

Chapter 2
2
( X  X )
N
2
Slide 44
Psy B07
Measures of Variability
 Example: (2,3,4,4,4,5,6,12)
2

(
X

X
)
2

N
 (9  4  1  1  1  0  1  49)
8
 8.25
Chapter 2
Slide 45
Psy B07
Measures of Variability
 To convert the variance into SD, we simply
take a square root of it:
( X  X ) 2

N
(9  4  1  1  1  0  1  49)

8
 8.25
 2.87
Chapter 2
Slide 46
Psy B07
Measures of Variability
 This demonstration allows you to play
with the mean and standard deviation
of a distribution. Note that changing
the mean of the distribution simply
moves the entire distribution to the left
or right without changing its shape. In
contrast, changing the standard
deviation alters the spread of the data
but does not affect where the
distribution is “centered”
DEMO
Chapter 2
Slide 47
Psy B07
Measures of Variability
 Population vs. Sample
 As mentioned, we usually deal with
statistics, not parameters. σ2 and σ are
parameters. Their counterparts, when
dealing with samples are s2 and s. The
formulae are slightly different
( X  X )
s 
N 1
2
Chapter 2
( X  X )
s
N 1
Slide 48
Psy B07
Properties of a Statistic
 So, the mean (X) and variance (s2) are the
descriptive statistics that are most commonly
used to represent the data points of some
sample.
 The real reason that they are the preferred
measures of central tendency and variance is
because of certain properties they have as
estimators of their corresponding population
parameters; μ and 2.
Chapter 2
Slide 49
Psy B07
Properties of a Statistic
 Four properties are considered desirable in a
population estimator; sufficiency,
unbiasedness, efficiency, & resistance.
 Both the mean and the variance are the best
estimators in their class in terms of the first
three of these four properties.
 To understand these properties, you first need
to understand a concept in statistics called the
sampling distribution
Chapter 2
Slide 50
Psy B07
Properties of a Statistic
 We will discuss sampling distributions off and on
throughout the course, and I only want to touch on the
notion now.
 Basically, the idea is this – in order to examine the
properties of a statistic we often want to take repeated
samples from some population of data and calculate the
relevant statistic on each sample. We can then look at
the distribution of the statistic across these samples and
ask a variety of questions about it.
 Check out this demonstration which I hope makes the
concept of sampling distributions more clear.
Chapter 2
Slide 51
Psy B07
Properties of a Statistic
1) Sufficiency
 A sufficient statistic is one that makes
use of all of the information in the
sample to estimate its corresponding
parameter.
Chapter 2
Slide 52
Psy B07
Properties of a Statistic
2) Unbiasedness
 A statistic is said to be an unbiased
estimator if its expected value (i.e., the
mean of a number of sample means) is
equal to the population parameter it is
estimating.
 Explanation of N-1 in s2 formula.
Chapter 2
Slide 53
Psy B07
Properties of a Statistic
 Using the procedure, the mean can be
shown to be an unbiased estimator (see
p 47).
 However, if the σ2 formula is used to
calculate s2 it turns out to
underestimate σ2
Chapter 2
Slide 54
Psy B07
Properties of a Statistic
 The reason for this bias is that, when we
calculate s2, we use x, an estimator of the
population mean
 The chances of x being EXACTLY the same as μ
are virtually nil, which results in the bias
 To compensate, we use N-1
 Note that this is only true when calculating s2,
if you have a measurable population and you
want to calculate 2, you use N in the
denominator, not N-1
Chapter 2
Slide 55
Psy B07
Properties of a Statistic
 Degrees of Freedom
 The mean of 6, 8, & 10 is 8.
 If I allow you to change as many of
these numbers as you want BUT the
mean must stay 8, how many of the
numbers are you free to vary?
Chapter 2
Slide 56
Psy B07
Properties of a Statistic
 The point of this exercise is that when the
mean is fixed, it removes a degree of freedom
from your sample -- this is like actually
subtracting 1 from the number of observations
in your sample.
 It is for exactly this reason that we use N-1 in
the denominator when we calculate s2 (i.e.,
the calculation requires that the mean be fixed
first which effectively removes -- fixes -- one
of the data points).
Chapter 2
Slide 57
Psy B07
Properties of a Statistic
3) Efficiency
 The efficiency of a statistic is reflected
in the variance that is observed when
one examines the means of a bunch of
independently chosen samples. The
smaller the variance, the more efficient
the statistic is said to be
Chapter 2
Slide 58
Psy B07
Properties of a Statistic
4) Resistance
 The resistance of an estimator refers to
the degree to which that estimate is
effected by extreme values.
 As mentioned previously, both X and s2
are highly sensitive to extreme values
Chapter 2
Slide 59
Psy B07
Properties of a Statistic
4) Resistance
 Despite this, they are still the most
commonly used estimates of the
corresponding population parameters,
mostly because of their superiority over
other measures in terms sufficiency,
unbiasedness, & efficiency
Chapter 2
Slide 60