Section 3.1: Elementary Graphical Treatment of Data

Download Report

Transcript Section 3.1: Elementary Graphical Treatment of Data

Ch3 Elementary Descriptive
Statistics
Section 3.1: Elementary Graphical
Treatment of Data
Before doing ANYTHING with data:
• Understand the question.
– An approximate answer to the exact question is
always better than an exact answer to an
approximate question. John Tukey.
• Know how the experiment was conducted.
The FIRST thing to do with the data is to
PLOT THE DATA
– Plot all individual points.
– If there are connections between points, e.g.
points are from same pairs (or sometimes
separate blocks), show connections between
related points.
Plotting data is an extremely important step.
• More often than not data I get when
consulting have problems like incorrect data
or attributes they didn’t tell me about.
• Plotting helps reveal relationships and
answers.
• Plotting is a very effective way to present
results.
– “A picture is worth a thousand words.”
Example:
8 lb. test fishing line question: Which type(s) of line are strongest?
Listing numerical data
Trilene XL 11.5 11.3 11.7 11.6 11.7 11.4 11.5 11.5 11.6 11.4
Trilene XT 11.6 11.8 11.7 11.7 11.5 11.6 11.6 11.8 11.5 11.7
Stren
11.1 11.1 11.2 11.0 11.1 11.3 11.2 10.9 11.0 11.1
It’s hard to see what’s happening without organizing the data.
A “dot” diagram
XL
11.8
11.7
11.6
11.5
11.4
11.3
11.2
11.1
11.0
10.9
**
**
***
**
*
XT
**
***
***
*
*
Stren
*
**
****
**
*
Stem and leaf plot
It shows the distribution shape and at the
same time preserves the original values.
In the gears’ runouts example, for the gears
hung group, we have data points of
7, 8, 8, 10, 10, 10, 10, 11, 11, 11, 12, 13…
A stem and leaf plot is
0
1
788
000011123
Two groups can be compared with back to
back stem and leaf diagrams
E.g. Stopping distances of bikes
Treaded tire
5
64
1
20
34
35
36
37
38
39
40
Smooth tire
189
5
5
1
Or dot diagrams
|
|
| * | ** |
| * |**
340 350 360 370 380 390 400
|*** | * |
| * |
|* |
Treaded
Smooth
When there are associations between sets of data values,
plot the data accordingly.
E.g., Snowfall for duluth and White Bear Lake 1972-2000
A not very good way to plot the data
WB Lake
130
120
110
** 100
* 90
80
******
70
*** 60
**********
50
*** 40
*** 30
*** 20
Duluth
*
*
**
***
*****
******
**
**
****
***
*
snow_total
Snowfall plot
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Duluth
White Bear
1972
1977
1982
1987
year
1992
1997
A study of trace metals in South Indian
River
5
3
1
2
6
4
T=top water zinc concentration (mg/L)
B=bottom water zinc (mg/L)
1
2
3
Top
0.415
0.238
0.390
Bottom
0.430
0.266
0.567
4
0.410
0.531
5
0.605
0.707
6
0.609
0.716
• One of the first things to do when analyzing data is
to PLOT the data
0.8
0.7
0.6
Zinc
0.5
0.4
0.3
0.2
0.1
0
Top
Bottom
• This is not a useful way to plot the data. There is not
a clear distinction between bottom water and top
water zinc—even though Bottom>Top at all 6
locations.
A better way
0.7
Zinc
0.6
0.5
0.4
0.3
0.2
Top
Connect points in the same pair.
Bottom
Another way (scatter plot)
0.8
0.6
Bottom=Top
0.4
0.2
0
0
0.2
0.4
0.6
0.8
• This following plot would imply a natural ordering of
sites from 1 to 6.
0.8
0.7
0.6
Zinc
0.5
0.4
Top
0.3
Bottom
0.2
0.1
0
0
1
2
3
4
5
6
7
Site
• This would not be the best way to plot the data unless
the sites 1-6 correspond to a natural ordering such as
distance downstream of a factory.
Run charts (a version of scatter plot)
• The variable on the x axis is a time variable.
• Table: 30 consecutive outer diameters turned on a
lathe
joint
Diameter (inches above nominal)
joint
1
-0.005
2
0
3
-0.01
4
-0.03
5
-0.01
6
-0.025
7
-0.03
8
-0.035
9
-0.025
10
-0.025
11
-0.025
12
-0.035
13
-0.04
14
-0.035
15
-0.035
Diameter (inches above nominal)
16
0.015
17
0
18
0
19
-0.015
20
-0.015
21
-0.015
22
-0.015
23
-0.015
24
-0.01
25
-0.015
26
-0.035
27
-0.025
28
-0.02
29
-0.025
30
-0.015
Moving along time, the outer diameters tend to get smaller until
part 16, where there is a large jump, followed by a pattern of
diameter generally decreasing in time.
Diameter (inches above nominal)
0.02
0.01
0
0
5
10
15
20
25
30
35
-0.01
Diameter (inches above nominal)
-0.02
-0.03
-0.04
-0.05
Section 3.2: Quantiles and Related
Graphical Tools
Quantile:
Roughly speaking, for a number p between 0 and
1, the p quantile of a distribution is a number
such that a fraction p of the distribution lies to the
left and a fraction 1-p of the distribution lies to
the right.
p quantile = 1O0*pth percentile
Q(0.10) = 0.10 quantile = 10 th percentile
Q(0.50) = 0.50 quantile = 50 th percentile = median
Q(0.25) =0.25 quantile = 25 th percentile= first quartile
Q(0.75) =0.75 quantile = 75 th percentile= third quartile
The p th quantile is ordered point
corresponding to the point with index
So the comulative probability corresponding to the i th point is
Consider the following n=10 points
Q(0.25) = 0.25 quantile = 857
Q(0.50) = median =
.
Q(0.75) = 9614
IQR = Interquartile Range = Q(0.75) - Q(0.25)= 9614- 8572= 1042
To find the 93rd percentile:
0.93 is part way between 0.85 and 0.95 .
So the Q(0.93) is 0.8 of the way from Q(0.85) to Q(0.95)
Q(0.85) + 0.8(Q(0.95)-Q(0.85))
=0.2*Q(0.85) + 0.8*Q(0.95)
= 0.2(9614)+ 0.8(10,688)
= 10,473.
• Boxplots are useful summaries, particularly when
there are too many points for a dot plot.
• To make a boxplot, we need essentially 5 numbers.
Section 3.2.3 Q-Q Plots and Comparing
Distributional Shapes
• Most of the statistical tools we will use in this
class assume normal distributions (a bell
shaped distribution for the population of
possible values).
• In order to know if these are the right tools for
a particular job, we need to be able to assess
if the data appear to have come from a
normal population.
• With large amounts of data, one can draw a
histogram of the measured values and see if it
is bell-shaped.
• A normal plot is a method for assessing
normality that works well with big or small
data sets. It gives a good visual check for
normality.
Simulation: 100 observations,
normal with mean=5, st dev=1
2
3
4
5
6
7
8
x<-rnorm(100, mean=5, sd=1)
qqnorm(x)
x
•
•
-2
-1
0
Quantiles of Standard Normal
1
2
• A normal plot is a plot of the data in a way
such that data from normal populations will
come out pretty much in a straight line.
• We plot the corresponding quantiles of a
"standard normal'' distribution versus ordered
y values
In other words
In order to plot the data and check for
normality, we compare
• our observed data to
• what we would expect from a sample of
standard normal data.
A standard normal distribution is a normal distribution with
 mean 0
 standard deviation 1.
Any normal population can be thought of as a rescaled
standard normal population.
For example if Z is standard normal, then
 100 + 5Z will have
 100 and  5.
Multiplying all values by 5 multiplies the standard deviation by 5.
Adding 100 to every number adds 100 to the mean.
• So if we plot ordered values from a normal
population against corresponding quantiles of
a standard normal population, we expect to
get a reasonably straight line, since any
normal distribution is linearly related to the
standard normal distribution.
With Excel normal quantile can be found with the NORMINV function.
NORMDIST finds probabilities given a particular value.
NORMINV is the inverse function finding a value with a given
probability of being less than that.
A cell is assigned for example the formula
 = NORMINV(C3, 0, 1)
 The 0, 1 indicates 0 and 1
o A standard normal quantile
The textbook plots the
• standard normal quantiles on the vertical axis and
• the ordered data points on the horizontal axis.
Many software packages and other books plot the
standard normal quantiles on the horizontal axis and
the ordered data points on the vertical axis.
Either way, the plot should look ``fairly'' straight if the
data are from a normal distribution.
Here are ordered lifetimes of springs under 2 levels of stress. (page 379)
n
10
i
1
2
3
4
5
6
7
8
9
10
(i-0.5)/n
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
Normal
Quantile
-1.645
-1.036
-0.674
-0.385
-0.126
0.126
0.385
0.674
1.036
1.645
950 stress
Lifetime
117
135
135
162
162
171
189
189
198
225
900 stress
Lifetime
153
162
189
216
216
216
225
225
243
306
Since n=10 for both sets the corresponding normal quantiles are the same for both sets.
To construct normal plots for these two data sets, we plot
 each ordered data set versus
 the standard normal quantiles from Excel.
350
300
Life-length
250
200
950 stress
900 stress
150
100
50
-2.000
-1.000
0
0.000
1.000
Normal Quantiles
Since both plots are fairly straight, these data are fairly normal.
2.000
Excel File of Lifetime of Springs Data
n
10
i
1
2
3
4
5
6
7
8
9
10
(i-0.5)/n
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
Normal
Quantile
-1.645
-1.036
-0.674
-0.385
-0.126
0.126
0.385
0.674
1.036
1.645
E(Z)
-1.539
-1.001
-0.656
-0.376
-0.123
0.123
0.376
0.656
1.001
1.539
Ordered
Ordered
900 stress 950 stress
153
117
162
135
189
135
216
162
216
162
216
171
225
189
225
189
243
198
306
225
Section 3.3: Numerical Summaries
Measures of Location: The data are found
spread around what value ?
Median = Q(O.50) = 50th percentile.
n
Sample mean = arithmetic mean = average x 
The mean is more affected by unusual values
than the median.
x
i 1
n
i
Measures of Spread:
• R = Range = Biggest – Smallest
• The size of the range can be affected by how
many values we have. Many number will tend to
have a larger range than fewer numbers.
• IQR = lnterquartile Range = Q(0.75) – Q(0.25)
Range that include half of the values.
x  x


• Sample variance = s 
2
2
i
n 1
Essentially an average squared deviation from
the mean.
• Sample standard deviation = s
 s 
2
 x  x 
i
n 1
2
Example: X1 = 8 X2 = 9 X3 = 4
894
x
7
3
8  7   9  7    4  7 


2
s
2
2
2
s  7  2.65
2
7
Statistics and Parameters
A statistic is a numerical summary of the
sample data.
x
= sample mean
s2 = sample variance
A parameter is a summary of an entire
population or a theoretical distribution, for
example a normal distribution.
N
 = population mean
x


i 1
i
N
2 = population variance
N
2 
  xi   
2
i 1
N
Average squared deviation from the mean.
 = population standard deviation    2
• For a sample of size n, the sample variance is
n
1
2
s 
( xi  x )

n  1 i 1
2
2
• Why divide by n -1? This makes s an
unbiased estimator of  2 . Unbiased means on
the average correct.
Suppose we have a large population of ball bearings with diameters
=1cm and   0.02  2  0.0004
Sample
1
2
3
4
.
.
∞
Mean
0.98
1.03
1.01
1.02
.
.
-----1.00
If we knew  we would find
Fact
min
 (x
s2
x
i
0.00032
0.00031
0.00045
0.00052
.
.
-------0.0004
n
ˆ  
2
 m) 2   ( x i  x ) 2
( xi   )2
i 1
(x
So  (xi  x )   ( xi   ) and 
2
2
i
 x)2
n
n
would be too small for 2.
Dividing by n-1 makes s2 come out right (2 )on average.
Notice that s2 is undefined if n=1; we can't
divide by zero.
This makes sense.
If we have only one number, that number
tells us nothing about potential spread in the
population.
Plotting summary statistics over time is useful
for issues such as quality control.
Read section 3.3.4 for general information.