Distribution William Simpson 11th April 2014

Download Report

Transcript Distribution William Simpson 11th April 2014

•Distributions &
Descriptive statistics
•Dr William Simpson
•Psychology, University of Plymouth
1
Defining and measuring
variables
2
Independent & dependent
variables
• Independent variable: something we manipulate in
an experiment
• Dependent variable: something we measure
• By manipulating the IV, we expect to produce a
change in the DV
3
Scales of measurement
• variables classified according to type of scale
–type of analysis depends on type of
scale
• Worst to best: Nominal, ordinal, interval,
ratio
4
Nominal
•Nominal data: assign categorical labels to
observations
•Not really measurement
•E.g. male/female;
married/single/widowed/divorced
•Numbers on football jerseys
5
Ordinal
•Ordinal data: values can be ranked (ordered).
Categorical but rankable
•E.g. small, medium, large; movie rating 1-5;
Likert scale
•Can only be ranked. Rating scale is not like cm.
The diff between
&
is not nec the
same as between
&
6
• Adding a response of "strongly agree" (5) to
two responses of "disagree" (2) would give us
a mean of 4, but what is the meaning of that
number?
7
Interval
•Interval data: ordinary measurement, e.g.
temperature
•Unlike ordinal data, we can say the diff
between 1 & 2 deg C is same as diff between 4
& 5 deg
8
Ratio
•Ordinary measurements, but with an absolute, nonarbitrary zero point
•E.g. weight, length: any scale must start at zero
•deg C: not ratio, because 0 arbitrarily set at freezing pt of
water
9
Discrete & continuous variables
• variables measured on interval & ratio scales are
further identified as either:
–discrete – Integers, no intermediate values. E.g.
#Smarties in a box
–continuous - measurable to any level of
accuracy. E.g. Weight of Smarties contents
10
Frequency distributions
11
•We have a pile of scores
•Not all scores are equally likely
•How were scores distributed?
12
•Subjects were timed (in sec) while completing a
problem-solving task:
•7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1,
8.8, 7.4, 7.7, 8.2
13
Stem & leaf
•Two components: the stem and the leaf
•In problem-solving example, stem = ones, leaf =
tenths
•Stems range between 5 and 9
14
•7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1,
5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2
•
• 5|98
• 6|821
• 7|6347
• 8|1182
• 9|2
•Key: 9|2 means 9.2
15
•Heights in cm:154, 143, 148,139, 143,
147, 153, 162, 136, 147, 144, 143, 139,
142, 143, 156, 151, 164, 157, 149, 146
•- Put 2 digits in stem; split stems 0-4, 59
•13|969
•14|334323
•14|87796
•15|431
•15|67
•16|24
16
• GSR values: 23.25, 24.13, 24.76, 24.81,
24.98, 25.31, 25.57, 25.89, 26.28, 26.34,
27.09
•- Round the last 2 digits
•23|3
•24|188
•25|0369
•26|33
•27|1
•Key: 23|3 means 23.3
17
Histogram
•Alternative way to look at distribution
•It is like a version of stem-and-leaf turned 90
deg
18
Example
• Time to complete task (min):
• 8 2 6 12 9 14 1 7 7 9 11 8
12 10 5 7 10 9 10 11 4 8 2
11 10 11 13 13 14 11 13 10
12 13 5 16 11 17 10 6 13 11
5 9 12 14 8 2 12 4
19
•Sort
scores into
about 10 or
so bins
(similar to
stem in
stem-andleaf)
20
•Decide on sensible bins
•Count the number of observations in each
bin (length of each leaf in stem-and-leaf)
•This number in each bin is called the
frequency
21
time
0-1
2-3
4-5
6-7
8-9
10-11
12-13
14-15
16-17
frequency
1
3
5
5
8
13
10
3
2
22
•This table is then used to make
the histogram
•Histogram is bar chart with
frequency on y axis and score on
x axis
•Sometimes done other ways,
e.g. connect the dots (frequency
distrib polygon)
23
Frequency
15
10
5
0
0
2
4
6
8
10
12 14
Time (min)
16
18
20
24
in R
•x<-c(8, 2, 6, 12, 9, 14, 1, 7, 7, 9, 11, 8, 12,
10, 5, 7, 10, 9, 10, 11, 4, 8, 2, 11, 10, 11,
13,13, 14, 11, 13, 10, 12, 13, 5, 16, 11, 17,
10, 6, 13, 11, 5, 9, 12, 14, 8, 2, 12, 4)
•hist(x)
•stem(x)
•boxplot(x)
25
Probability distributions
•Histogram is estimate of true
probability distribution
•Many theoretical probability
distributions exist
•Basis of statistical models used to
make inferences about population
26
Binomial distribution
• Binomial distribution is a discrete distribution
• the binomial distribution applies when:
–there is a series of n trials (e.g., 10 coin
tosses)
–only 2 possible outcomes per trial
–outcomes are mutually exclusive (head or
tail)
–outcome of each trial independent of others
27
•The binomial distribution gives the chance
of getting each total number of ‘successes’
after doing all the (binary) trials of the expt
•E.g. it gives the chance of getting 1, 2, or 3
girls after giving birth to 6 children
•p = p(success) = p(girl) = 0.5 each trial
•q = p(failure) = p(boy) = 1-p = 0.5
•n = number of trials = 6
28
probability
• prob distribution where n = 6 and the
prob of each outcome is 0.5 on each
trial looks like:
number of girls
29
•For any probability distribution, the yaxis is given by a formula
•For the binomial, it looks like this:
• k successes in n trials; () is binomial
coefficient
• you don’t need to know it
30
Normal distribution
•Continuous probability distribution
•Every probability distribution’s y-axis is given by
a formula
•For normal distribution, the y-axis (probability
density) is:
31
32
Descriptive statistics
33
•We have a pile of scores
•Have made stem-and-leaf, histogram
•Want to summarise further: descriptive
statistics
34
1. Centre (location)
•What is the ‘typical’ score? If you were to make
a prediction for a new score, what would it be?
35
a) Mean (average)
•Mean = sum(x)/n
36
Mean as balance point
•Imagine that each observation is a toy block
•Place the blocks on a ruler; the position (1, 2,
etc inches) represents the value
•The balance point is the mean
37
•1 2 2 3
1225
1229
Mean is pulled towards extreme
observation (outlier)
38
b) Median
•Median is middle score; 50th percentile
•useful when extreme scores (outliers) lie in one
tail of distribution (skewed)
•
39
Calculate the median
•Sort scores
•If odd n, median is middle value
•If even n, median is mean of 2 middle
values
•25 13 9 18 1 -> 1 9 13 18 25; med=13
•25 13 9 18 -> 9 13 18 25
•Median= (13+18)/2 = 15.5
40
Median and outliers
•1 2 2 3
•1 2 2 5
•1 2 2 9
•Median = 2 in all cases
41
c) Mode
•Mode is most frequently occurring score
•Mean should really be used only for
interval/ratio data. Mode good otherwise
•E.g. mean movie rating – not really
sensible. Mode sensible
•Sometimes no unique mode exists (e.g.
bimodal)
42
•Bimodality can be
due to mixture of two
different populations
(e.g. male and female)
43
Time to complete task (min)
Frequency
15
10
5
0
0
2
4
6
• Mean = 9.36
8
10
12 14
Time (min)
Median = 10
=11
16
18
20
Mode
44
•mean(x)
•median(x)
•Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x,
ux)))]
}
•Mode(x)
45
Likert scale
• e.g. Brief Psychiatric Rating scale (BPRS)
• Interview + observations of patient's
behaviour over preceding 2–3 days
• Each item scored 0-7
46
• Suppose we have a new treatment
• Does it reduce anxiety?
• Define “anxiety” as score on Q2
47
• We use BPRS on lots of patients
• Compare treatment and placebo
• How? Find mean(treatment) vs
mean(placebo)?
48
NO
49
• The numbers 0-7 are not really numbers!
• They have only rank (order) info
• Ordinal
50
• The “numbers” are really ordered labels:
“normal”, “a bit anxious”, … , “extremely
anxious”
51
• They lack a quantitative distance between
them; calculating a mean level of anxiety for
the group is not really appropriate
52
• It makes sense to find the mode
• Most frequently occurring anxiety score
53
• It makes sense to measure the median: person
in the middle of the group in terms of anxiety,
with half the responses below and the other
half above
54
Example
• Family-Focused Treatment Versus Individual
Treatment for Bipolar Disorder: Results of a
Randomized Clinical Trial
• J. Consulting & Clinical Psychology, 2003,
71, 482– 492
55
“The psychiatrist made ratings of compliance on
a 7-point Likert scale ranging from full
compliance (1) to discontinued medication
against medical advice (7)” p.486
56
• “On the whole, the participants were
quite compliant with their medication,
with at least 78% of the patients scoring
within the compliant range at each
assessment point” p.489
• - Must have made mistake before: 1 is
bad, 7 is good compliance
57
• For each 3-month follow-up period, participants were placed
in one of the following clinical outcome categories:
(a) relapse, defined as a rating of 6 or 7 on the BPRS/SADS-C
core symptoms of depression (depressed mood, loss of
interest), mania (hostility, elevated mood, grandiosity), or
psychosis (unusual thought content, suspiciousness,
hallucinations, conceptual disorganization) and at least two
ancillary symptoms (suicidality, guilt, sleep disturbance,
appetite disturbance, lack of energy, negative evaluation,
discouragement, increased energy activity), or
(b) nonrelapse, defined as a score of 5 or below on all relevant
BPRS/SADS-C core symptoms during the 3-month interval
58
59
60
2. Spread (dispersion)
•Measure of centre (e.g. mean) tells what value
we expect
•Measure of spread tells how close a value will
typically be to the centre
61
a) Interquartile range
•Interquartile range (IQR) finds distance
between the top 25% and bottom 25% of scores
Quartiles
•Quartiles divide the data into quarters
•The median (Q2) divides the data into
2 piles (50% above, 50% below)
•Q1 is the cutoff below which fall the
bottom 25% of scores
•Q3 is the cutoff below which fall the
bottom 75%
– Q1 has 25% of scores below it, Q2 has
50% (i.e. it is the median) and,Q3 has
75% of scores below it (25% above)
Finding quartiles
1. Sort the data
2. Find the median = Q2 = value that
splits the data into two equal piles,
half below it and half above
3. Q1 = median of lower half
4. Q3 = median of upper half
5. IQR = Q3 – Q1
•x<-c(8, 2, 6, 12, 9, 14, 1, 7, 7, 9, 11, 8,
12, 10, 5, 7, 10, 9, 10, 11, 4, 8, 2, 11, 10,
11, 13,13, 14, 11, 13, 10, 12, 13, 5, 16,
11, 17, 10, 6, 13, 11, 5, 9, 12, 14, 8, 2,
12, 4)
•x<- sort(x); x
•1 2 2 2 4 4 5 5 5 6 6 7 7 7 8 8 8
8 9 9 9 9 10 10 10 10 10 10 11 11 11
11 11 11 11 12 12 12 12 12 13 13 13 13
13 14 14 14 16 17
67
•n=50
•Q2=(x[25]+x[26])/2 =
(10+10)/2=10
•Q1 = x[13] = 7
•Q3= x[38] =12
•IQR=Q3-Q1=12-7=5
•We expect scores near 10, plus-or68
in R
•fivenum(x)
• 1 7 10 12 17
•= min, Q1, Q2, Q3, max
•IQR(x)
•5
69
•boxplot(x)
70
b) Standard deviation
•Each point is some distance away from mean
•Each distance from the mean is a deviation
•Deviation = score - mean
71
•Each deviation contributes to the
spread of the data about the mean
•Is the total spread just the sum of the
deviations, then?
•No. Mean is a balance point, so
positive and negative deviations cancel
out
•Can find a “sort of” average or
“typical” deviation if we get rid of the
“Average” deviation
•Average deviation actually is zero
because signs cancel. Need to get rid
of signs
•Idea: square each deviation, average,
then take (positive) square root. [RMS]
•That is the standard deviation!
Calculating the SD
•Find the deviations
•Square them
•Find the average
•Take the square root to undo the squaring
•In symbols:
2
(X   )

•

N
N or n-1
c) variance
•Variance = SD squared
•Useful for ANOVA (ANalysis Of VAriance)
76
Likert scale
• These “numbers” are not really numbers
• Therefore cannot do operations like
subtraction, division, sqrt
• Use IQR
77
78
79
Statistical Inference
•Usually we are interested in more
than describing or summarising the
numbers we have on hand
•E.g. have a sample, calculate mean.
What is mean of larger pop?
•E.g. have done an expt, means differ.
Is this a fluke or “real”?
80
• The data we have on hand are samples from
some (real or theoretical) population
• We want to make inferences about population
81
Summary
•IV, DV
•Nominal, ordinal, interval, ratio
•Continuous, discrete
•Stem & leaf, histogram
•Probability distribution
•Mean, median, mode
•IQR, SD, variance
82