introduction to Statistics
Download
Report
Transcript introduction to Statistics
Introduction to statistics I
Sophia King
Rm. P24 HWB
[email protected]
Using statistics in Psychology
Carrying out psychological research means the collection
of data. Statistics are a way of making use of this data
Descriptive Statistics: used to describe characteristics of
our sample
• Statistics describe samples
Inferential Statistics: used to generalise from our sample
to our population
• Parameters describe populations
Any samples used should therefore be representative of
the target population
Descriptive Statistics
Statistical procedures used to summarise, organise, and
simplify data. This process should be carried out in such a
way that reflects overall findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
•
•
•
•
Frequency tables
Graphical techniques
Measures of Central Tendency
Measures of Spread (variability)
Plotting Data: describing spread of data
A researcher is investigating short-term memory capacity:
how many symbols remembered are recorded for 20
participants:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
We can describe our data by using a Frequency
Distribution. This can be presented as a table or a graph.
Always presents:
• The set of categories that made up the original category
• The frequency of each score/category
Three important characteristics: shape, central tendency,
and variability
Frequency Distribution Tables
X
11
10
9
8
7
6
5
4
3
f
1
2
1
2
2
4
3
3
2
fX
11
20
9
16
14
24
15
12
6
Highest Score is placed at top
All observed scores are listed
Gives information about
distribution, variability, and
centrality
X = score value
f = frequency
fx = total value associated with
frequency
f = N
X =fX
Frequency Table Additions
X
11
10
9
8
7
6
5
4
3
f
1
2
1
2
2
4
3
3
2
fX
11
20
9
16
14
24
15
12
6
p
0.05
0.1
0.05
0.1
0.1
0.2
0.15
0.15
0.1
%
5%
10%
5%
10%
10%
20%
15%
15%
10%
Frequency tables can display more
detailed information about distribution
Percentages and proportions
p = fraction of total group
associated with each score (relative
frequency)
p = f/N
As %: p(100) =100(f/N)
What does this tell about this
distribution of scores?
Grouped Frequency Distribution Tables
X
95-99
90-94
85-89
80-84
75-79
70-74
65-69
60-64
55-59
50-54
f
1
1
0
1
2
4
7
0
6
3
Sometimes the spread of data is too wide
Grouped tables present scores as class
intervals
About 10 intervals
An interval should be a simple round number
(2, 5, 10, etc), and same width
Bottom score should be a multiple of the
width
Class intervals represent Continuous variable
of X:
E.g. 51 is bounded by real limits of 50.5-51.5
If X is 8 and f is 3, does not mean they all
have the same scores: they all fell
somewhere between 7.5 and 8.5
Percentiles and Percentile Ranks
X
11
10
9
8
7
6
5
4
3
f
1
2
1
2
2
4
3
3
2
cf
20
19
17
16
14
12
8
5
2
C%
100%
95%
85%
80%
70%
60%
40%
25%
10%
X values = raw scores, without
context
Percentile rank = the percentage of
the sample with scores below or at
the particular value
This can be represented be a
cumulative frequency column
Cumulative percentage obtained by:
c% = cf/N(100)
This gives information about relative
position in the data distribution
Representing data as graphs
Frequency Distribution Graph
presents all the info available in
a Frequency Table (can be fitted
to a grouped frequency table)
Uses Histograms
5
Frequency
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
memory score
8
7
6
Frequency
Bar width corresponds to real
limits of intervals
Histograms can be modified to
include blocks representing
individual scores
5
4
3
2
1
2
3
4
5
6
7
8
9
10
11
12
0
45 49 54 59 64 69 74 79 84 89 94 99
score
Frequency Distribution Polygons
5
Shows same information with
lines: traces ‘shape’ of
distribution
Both histograms and
polygons represent
continuous data
For non numerical data,
frequency distribution can be
represented by bar graphs
Bar graphs have spaces
between adjacent bars to
represent distinct
categories
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
16
14
12
10
8
6
4
2
0
phone numbers
historical dates
family dates
12
Frequencies of Populations and Samples
Population
All the individuals of interest to the study
Sample
The particular group of participants you are testing:
selected from the population
Although it is possible to have graphs of population
distributions, unlike graphs of sample distributions, exact
frequencies are not normally possible. However, you can
Display graphs of relative frequencies (categorical data)
Use smooth curves to indicate relative frequencies
(interval or ratio data)
Frequency Distribution: the Normal Distribution
Bell-shaped: specific shape that can be defined as an equation
Symmetrical around the mid point, where the greatest frequency
if scores occur
Asymptotes of the perfect curve never quite meet the horizontal
axis
Normal distribution is an assumption of parametric testing
Frequency Distribution: Different Distribution
shapes
Measures of Central Tendency
A way of summarising the data using a single value that is
in some way representative of the entire data set
It is not always possible to follow the same procedure in
producing a central representative value: this changes
with the shape of the distribution
Mode
Most frequent value
Does not take into account exact scores
Unaffected by extreme scores
Not useful when there are several values that occur
equally often in a set
Measures of Central Tendency
Median
The values that falls exactly in the midpoint of a ranked
distribution
Does not take into account exact scores
Unaffected by extreme scores
In a small set it can be unrepresentative
Mean (Arithmetic average)
Sample mean: M = X
Population mean: = X
n
N
Takes into account all values
Easily distorted by extreme values
Measures of Central Tendency
For our set of memory scores:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
Mode = 6: Median = 6: Mean = 6.35
The mean is the preferred measure of central tendency,
except when
There are extreme scores or skewed distributions
Non interval data
Discrete variables
Central Tendencies and Distribution Shape
Describing Variability
Describes in an exact quantitative measure, how spread
out/clustered together the scores are
Variability is usually defined in terms of distance
How far apart scores are from each other
How far apart scores are from the mean
How representative a score is of the data set as a whole
Describing Variability: the Range
Simplest and most obvious way of describing variability
Range = Highest - Lowest
The range only takes into account the two extreme scores
and ignores any values in between. To counter this there
the distribution is divided into quarters (quartiles). Q1 =
25%, Q2 =50%, Q3 =75%
• The Interquartile range: the distance of the middle two
quartiles (Q3 – Q1)
• The Semi-Interquartile range: is one half of the
Interquartile range
Describing Variability: Deviation
A more sophisticated measure of variability is one that
shows how scores cluster around the mean
Deviation is the distance of a score from the mean
X - , e.g. 11 - 6.35 = 3.65, 3 – 6.35 = -3.35
A measure representative of the variability of all the scores
would be the mean of the deviation scores
(X - )
Add all the deviations and divide by n
n
• However the deviation scores add up to zero (as mean
serves as balance point for scores)
Describing Variability: Variance
X
3
3
4
4
4
5
5
5
6
6
6
6
7
7
8
8
9
10
10
11
Sum
X-
-3.35
-3.35
-2.35
-2.35
-2.35
-1.35
-1.35
-1.35
-0.35
-0.35
-0.35
-0.35
0.65
0.65
1.65
1.65
2.65
3.65
3.65
4.65
0
(X -)²
11.22
11.22
5.52
5.52
5.52
1.82
1.82
1.82
0.12
0.12
0.12
0.12
0.42
0.42
2.72
2.72
7.02
13.32
13.32
21.62
106.55
To remove the +/- signs we simply
square each deviation before finding
the average. This is called the
Variance:
(X - )²
n
= 106.55
20
= 5.33
The numerator is referred to as the
Sum of Squares (SS): as it refers to
the sum of the squared deviations
around the mean value
Describing Variability: Population Variance
Population variance is designated by ²
² = (X - )² = SS
N
N
Sample Variance is designated by s²
Samples are less variable than populations: they therefore
give biased estimates of population variability
Degrees of Freedom (df): the number of independent (free
to vary) scores. In a sample, the sample mean must be
known before the variance can be calculated, therefore
the final score is dependent on earlier scores: df = n -1
s² = (X - M)² = SS = 106.55 = 5.61
n-1
n -1
20 -1
Describing Variability: the Standard Deviation
Variance is a measure based on squared distances
In order to get around this, we can take the square root of
the variance, which gives us the standard deviation
Population () and Sample (s) standard deviation
= (X - )²
N
s = (X - M)²
n-1
So for our memory score
example we simple take the
square root of the variance:
=
5.61 = 2.37
Describing Variability
The standard deviation is the most common measure of
variability, but the others can be used. A good measure of
variability must:
Must be stable and reliable: not be greatly affected by little
details in the data
• Extreme scores
• Multiple sampling from the same population
• Open-ended distributions
Both the variance and SD are related to other statistical
techniques
Descriptive statistics
A researcher is investigating short-term memory capacity:
how many symbols remembered are recorded for 20
participants:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
What statistics can we display about this data, and what
do they mean?
Frequency table: show how often different scores occur
Frequency graph: information about the shape of the
distribution
Measures of central tendency and variability
Descriptive statistics
5
4
X
11
10
9
8
7
6
5
4
3
f
1
2
1
2
2
4
3
3
2
fX
11
20
9
16
14
24
15
12
6
p
0.05
0.1
0.05
0.1
0.1
0.2
0.15
0.15
0.1
%
5%
10%
5%
10%
10%
20%
15%
15%
10%
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
References and Further Reading
Gravetter & Wallnau
Chapter 2
Chapter 3
Chapter 4