Looking at data: Describing distributions with numbers
Download
Report
Transcript Looking at data: Describing distributions with numbers
Exploration of Mean & Median
Go to the website of “Introduction to the
Practice of Statistics”
Click on the link to “Statistical Applets”
Select the “Mean and Median” applet
Perform exercises 1.51 and 1.52 on p.58
of you textbook.
Percentiles
Median: 50th percentile, boundary value
separating bottom and top halves of
population
xth percentile separates the bottom x%
from the top (100-x)%
First quartile (Q1): 25th percentile (marks
boundary for lower 4th of data)
Third quartile (Q3): 75th percentile
(marks boundary for upper 4th of data)
Inter-quartile range (IQR)
IQR = Q3 – Q1
Answers the question, “How far is the
median of the top half of the data from
the median of the bottom half?”
IQR is a resistant measure
Isn’t as affected by outliers as variance or
range.
Useful for identifying outliers:
A data point that lies more than [1.5(IQR)]
above Q3 or more than [1.5(IQR)] below Q1.
This method is called the 1.5 x IQR criterion.
Five-number summary
Consists of the minimum, Q1,
median, Q3, and maximum.
These numbers give you an idea of
center and spread, though of course
they can’t give the full picture.
Boxplot
Represents the 5-number summary
graphically.
A box spans the IQR, with a line in the
middle marking the median.
Lines leave the box from both sides,
going out to the minimum and maximum.
Modified Boxplot (what Minitab creates)
Lines only extend to the smallest and largest
observations that are not outliers.
Outliers are marked separately beyond the
endpoints of these lines.
Boxplots of Time to Detection of
Cancer – Simulation Study
Exponential
Gompertzian
0
5
10
Years
15
20
Sample variance (s2)
Is almost the average squared distance,
but divide by n - 1 instead of n.
n
s2
i 1
i
n 1
Standard deviation (s or SD) is the
square root of variance.
Very sensitive to outliers
x x
2
Squaring of the deviations.
Use of the mean.
s2 and s can’t be negative!
Exercise 1.65 from book
This is a standard deviation contest. You
must choose four numbers from the
whole numbers 0 to 10, with repeats
allowed.
a) Choose four numbers that have the
smallest possible standard deviation.
b) Choose four numbers that have the
largest possible standard deviation.
c) Is more than one choice possible in
either (a) or (b)? Explain.
Linear transformations
What does it mean to say that a
variable Y is a linear transformation
of a variable X?
It means that you can write Y as a
function of X in this format:
Y = a + bX (a and b are constants)
Effects of linear
transformations
Let’s say we know the mean and s
of a variable X.
We perform a linear transformation
of the variable; each of the old X
values changes according to a+bX.
How are the mean and s of Y related
to those of X?
Avgerage Monthly
Temperature, NYC
J
F
M
A
M
J
J
A
S
O
N
D
°F 32 33 41 52 62 72 77 75 68 58 47 35
Effects of linear
transformations (cont.)
Y a b X This also true for the
median and the percentiles.
SD(Y)=|b| SD(X) This is also true
for IQR.
Basic shape of the distribution
doesn’t change due to the linear
transformation.