LECTURE 2 (Week 1)

Download Report

Transcript LECTURE 2 (Week 1)

Review BPS chapter 1
Picturing Distributions with Graphs
•
What is Statistics ?
•
Individuals and variables
•
Two types of data: categorical and quantitative
•
Ways to chart categorical data: bar graphs and pie charts
•
Ways to chart quantitative data: histograms and stem plots
•
Interpreting histograms
•
Time plots
Example BPS chapter 1
Indicate whether each of the following variables is categorical or
quantitative.
a. We have data on 20 individuals measuring amount of time it takes to
climb five flights of stairs.
Quantitative
b. During a clinical trial, an experimental pain relief drug is administered to
individuals. Each individual is then asked whether s/he experienced
any pain relief.
Categorical
Objectives (BPS chapter 2)
Describing distributions with numbers
•
Measure of center: mean and median
•
Measure of spread: quartiles and standard deviation
•
The five-number summary and boxplots
•
IQR and outliers
•
Choosing among summary statistics
Measure of center: the mean
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
individuals. It is the “center of mass.”
Sum of heights is 1598.3
Divided by 25 women = 63.9 inches
58 .2
59 .5
60 .7
60 .9
61 .9
61 .9
62 .2
62 .2
62 .4
62 .9
63 .9
63 .1
63 .9
64 .0
64 .5
64 .1
64 .8
65 .2
65 .7
66 .2
66 .7
67 .1
67 .8
68 .9
69 .6
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.5
i=3
x3= 60.7
i = 16
x16= 64.1
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
x20= 66.2
i=8
x8= 62.2
i = 21
x21= 66.7
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.9
i = 24
i = 12
x12= 63.1
i = 25
i = 13
x13= 63.9
n=25
x
x24= 68.9
= 69.6
25
Mathematical notation:
x 1  x 2  ....  xn
x
n
1 n
x   xi
n i 1
1598.3
x
 63.9
25
S=1598.3
Learn right away how to get the mean using your calculators.
Measure of center: the median
The median(M) is the midpoint of a distribution—the number
such that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations from smallest to largest.
2. Find the location of the median (L)
(1). If n is odd, the median is
observation (n+1)/2 down the list
n = number of observations
 n = 25
L=(n+1)/2 = 26/2 = 13
M = 3.4
(2). If n is even, the median is the
mean of the two center observations
n = 24 
L=(n+1)/2 = 12.5
M= (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. In a skewed distribution, the mean is usually farther out in
the long tail than is the median. The median is a measure of center that
is resistant to skew and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Left skew
Mean
Median
Mean and median for
skewed distributions
Mean
Median
Right skew
Mean and median of a distribution with outliers
Percent of people dying
x  3.4 x  4.2
Without the outliers
With the outliers
The mean is pulled to the
The median, on the other hand,
right a lot by the outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Impact of skewed data
Mean and median of a symmetric
distribution
Disease X:
x  3.4
M  3.4
Mean and median are the same.
and a right-skewed distribution
Multiple myeloma:
x  3.4
M  2.5
The mean is pulled toward
the skew.
Example: STAT 200 Midterm Score
Midterm
30
35
40
40
40
40
45
45
45
45
50
50
55
55
60
65
65
70
100
100
Descriptive Statistics: Midterm
Variable N Mean StDev Minimum Q1 Median Q3 Maximum
Midterm 20 53.75 18.98 30.00
40.00 47.50 63.75 100.00
Measure of spread: quartiles
The first quartile, Q1, is the value in
the sample that has 25% of the data
at or below it.
M = median = 3.4
The third quartile, Q3, is the value in
the sample that has 75% of the data
at or below it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
Center and spread in boxplots
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
“Five-number summary”
Boxplots for skewed data
Comparing box plots for a normal
and a right-skewed distribution
Years until death
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain true
to the data and clearly
depict symmetry or
skewness.
Disease X
Multiple myeloma
IQR and outliers
The interquartile range (IQR) is the distance between the first
and third quartiles (the length of the box in the boxplot)
IQR = Q3 - Q1
An outlier is an individual value that falls outside the overall
pattern.
• How far outside the overall pattern does a value have to fall
to be considered an outlier?
•
The 1.5 X IQR Rules for Outliers
Low outlier: any value < Q1 – 1.5 IQR
High outlier: any value > Q3 + 1.5 IQR
Example: STAT 200 Midterm Score
IQR = Q3 - Q1 =63.75-40.00=23.75
Low outlier: any value < Q1 – 1.5 IQR = 40.00 - 1.5(23.75) = 4.375
High outlier: any value > Q3 + 1.5 IQR = 63.75 + 1.5(23.75) =99.375
Outliers !!
Midterm
30
35
40
40
40
40
45
45
45
45
50
50
55
55
60
65
65
70
100
100
Measure of spread: standard deviation
The standard deviation is used to describe the variation around the mean.
1) First calculate the variance s2.
1 n
2
s 
(
x

x
)
 i
n 1 1
2
2) Then take the square root to get
the standard deviation s.
x
Mean
± 1 s.d.
1 n
2
s
(
x

x
)
 i
n 1 1
Calculations …
Women’s height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
−4.4
19.0
2
60
63.4
−3.4
11.3
3
61
63.4
−2.4
5.6
4
62
63.4
−1.4
1.8
5
62
63.4
−1.4
1.8
6
63
63.4
−0.4
0.1
7
63
63.4
−0.4
0.1
8
63
63.4
−0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
Sum of squared deviations from mean = 85.2
11
65
63.4
1.6
2.7
Degrees freedom (df) = (n − 1) = 13
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
1 n
2
s
( xi  x )

n 1 1
Mean = 63.4
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
Mean
63.4
We’ll never calculate these by hand, so make sure you know
how to get the standard deviation using your calculator.
Choosing among summary statistics
• Otherwise, use the median in the
five-number summary, which can
be plotted as a boxplot.
Height of 30 women
69
68
67
Height in inches
• Because the mean is not
resistant to outliers or skew, use
it to describe distributions that
are fairly symmetrical and don’t
have outliers.
 Plot the mean and use the
standard deviation for error bars.
66
65
64
63
62
61
60
59
58
Boxplot
plot
Box
Mean +/sd
Mean
± s.d.
Example 1
Suppose a sample of twelve lab rats is found to
have the following glucose levels:
3 4 4 6 6 6 8 8 9 10 12 15
1. Find the five-number summary of the data
and construct box-plot .
Min=3, Q1=5, M=7, Q3=9.5, Max=15
2.
Based on the box plot, the data set is
a. Skewed to left
b. roughly symmetric
c. skewed to right
Example 2
Suppose a researcher is recording fifty values in a database. Suppose she
records every value correctly except the lowest value, which is supposed to be
“2” but which she incorrectly types as “200”.
In the above scenario, the effect of the researcher’s error on mean and Median
is:
a. Her calculated mean will be lower than it would have been without the error,
but her calculated Median will remain unchanged.
b. Her calculated mean will be higher than it would have been without the error,
but her calculated Median will remain unchanged.
c. Her calculated mean will remain unchanged, but her calculated Median will be
lower than it would have been without the error.
d. Her calculated mean will remain unchanged, but her calculated Median will be
lower than it would have been without the error.
Example 2
In the above scenario, the effect of the researcher’s error on standard
deviation is:
a. The error will not affect standard deviation.
b. Her calculated standard deviation will be smaller than it would have been
without the error.
c. Her calculated standard deviation will be larger than it would have been
without the error.
d. The error is likely to make the calculated standard deviation negative.
Example 3
There are three children in a room -- ages 3, 4, and 5. If a four-year-old child
enters the room, the
a.mean age and variance will stay the same.
b.mean age and variance will increase.
c.mean age will stay the same but the variance will increase.
d.mean age will stay the same but the variance will decrease.