Week 1 Review of basic concepts in Statistics
Download
Report
Transcript Week 1 Review of basic concepts in Statistics
Week 1
Review of basic concepts in
statistics
handout available at
http://homepages.gold.ac.uk/aphome
Trevor Thompson
30-9-2007
1
Review of following topics:
Population vs. sample
Measurement scales
Plotting data
Mean & Standard deviation
Degrees of freedom
Transforming data
Normal distribution
- Howell (2002) Chap 1-3. ‘Statistical Methods for Psychology’
2
Population vs. sample
Population - an entire collection of
measurements
(e.g. reaction times, IQ scores, height or even
height of male Goldsmiths students)
Sample – smaller subset of observations
taken from population
sample should be drawn randomly to make
inferences about population. Random
assignment to groups improves validity
3
Population vs. sample
In general:
population parameters =Greek letters
sample statistics=English letters
Population
mean
variance
μ (mu)
σ2 (sigma)
Sample
X
s2
-worth learning glossary of other symbols now to
avoid later confusion (e.g. Σ=the sum of)
4
Measurement scales
Categorical or ‘Nominal’
e.g. male/female, or catholic/protestant/other
Continuous
Ordinal - e.g. private/sergeant/admiral
Interval- e.g. temperature in celsius
Ratio - e.g. weight, height etc
5
Plotting data
Basic rule is to select plot which
represents what you want to say in the
clearest and simplest way
Avoid ‘chart junk’ (e.g. plotting in 3D
where 2D would be clearer)
Popular options include bar charts,
histograms, pie charts etc - see any text
book. SPSS charts discussed in workshop
6
Summary statistics
Two essential components of data are:
(i) central tendency of the data &
(ii) spread of the data (e.g. standard deviation)
Although mean (central tendency) and
standard deviation (spread) are most
commonly used, other measures can also
be useful
7
Measures of central tendency
Mode
the most frequent observation: 1, 2, 2, 3, 4 ,5
Median
the middle number of a dataset arranged in numerical
order: 0, 1, 2, 5, 1000
(average of middle two numbers when even number of scores
exist)
relatively uninfluenced by outliers
Mean =
8
Measures of dispersion
Several ways to measure spread of data:
Range (max-min), IQR or Inter-Quartile Range (middle 50%),
Average Deviation, Mean Absolute Deviation
Variance – average of the squared deviations
Variance for population of 3 scores (-10,0,10) is 66.66
(200/3)
Standard deviation is simply the square root of the
variance
9
Calculating sample variance
Population variance (2) is the true variance of the
population calculated by
-this equation is used when we have all
values in a population (unusual)
However, the variance of a sample (S2) tends to be
smaller than the population from which it was drawn. So,
we use this equation:
The correction factor of ‘N-1’ increases the variance to
be closer to the true population variance (in fact, the
average of all possible sample variances exactly equals
2)
10
Degrees of freedom
Why is ‘N-1’ used to calculate sample variance?
When calculating sample variance, we calculate the
sample mean thus making make the last number in
the dataset redundant – i.e. we lose a ‘degree of
freedom’ (last no. is not free to vary)
e.g. M=10, sample data: 12, 9, 10, 11, 8
Calculating the sample mean (10) means that we have already
(implicitly) included the last number in our calculations.
If we (knew and) used the population mean rather than the sample
mean this would not be the case so we could use N not N-1.
Howell illustrates this with a worked example (and
mathematical proof can be retrieved with internet search)
Bottom line is whenever we have to estimate a statistic
11
(e.g. mean) we lose a degree of freedom
Transforming data
One reason we might ‘transform’ data is to convert from
one scale to another
e.g. feet into inches, centigrade into fahrenheit,
raw IQ scores into standard IQ scores
Scale conversion can usually be achieved by simple
linear transformation (multiplying/dividing by a constant
and adding/subtracting a constant)
Xnew = b*Xold + c
So to convert centigrade data into fahrenheit we would apply the
following:
12
Transforming data
Z-transform (standardisation) is one common type of
linear transform, which produces a new variable with
M=0 & SD=1
Z -scores= X
Standardisation is useful when comparing the same
dimension measured on different scales (e.g. anxiety
scores measured on a VAS and questionnaire)
After standardisation these scales could also be added
together (adding two quantities on different scales is
obviously problematic)
13
Normal Distribution
Many real-life variables
(height, weight, IQ etc etc)
are distributed like this
Mathematical equation
mimics this normal
(or Gaussian) distribution
14
Normal Distribution
The mathematical normal distribution is useful
as its known mathematical properties give us
useful info about our real-life variable (assuming
our real-life variable is normally distributed)
For example, 2 standard deviations above the
mean represent the extreme 2.5% of scores
(calculus equations used to derive this)
Consequently, a person with an IQ score of 130
(M=100, SD=15), would be in the top 2.5%
(assuming IQ is normally distributed)
15
Normal Distribution
Normality is important assumption (though more
about this next week). Violations of normality
generally take two forms:
SKEWNESS
KURTOSIS
16