Statistical Analysis - Graphical Techniques

Download Report

Transcript Statistical Analysis - Graphical Techniques

Systems Engineering Program
Department of Engineering Management, Information and Systems
EMIS 7370/5370 STAT 5340 :
PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS
Statistical Analysis - Graphical Techniques
Dr. Jerrell T. Stracener, SAE Fellow
Leadership in Engineering
1
•Time Series Graph or Run Chart
• Box Plot
• Histogram and Relative Frequency Histogram
• Frequency Distribution
• Probability Plotting
2
Time Series Graph or Run Chart
• A plot of the data set x1, x2, …, xn in the order
in which the data were obtained
•Used to detect trends or patterns in the data
over time
3
Box Plot
• A pictorial summary used to describe the
most prominent statistical features of the data
set, x1, x2, …, xn, including its:
- Center or location
- Spread or variability
- Extent and nature of any deviation from symmetry
- Identification of ‘outliers’
4
Box Plot
• Shows only certain statistics rather than all the
data, namely
- median
- quartiles
- smallest and greatest values in the sample
• Immediate visuals of a box plot are the center,
the spread, and the overall range of the data
5
Box Plot
Given the following random sample of size 25:
38, 10, 60, 90, 88, 96, 1, 41, 86, 14, 25, 5, 16,
22, 29, 34, 55, 36, 37, 36, 91, 47, 43, 30, 98
Arranged in order from least to greatest:
1, 5, 10, 14, 16, 22, 25, 29, 30, 34, 36, 36, 37,
38, 41, 43, 47, 55, 60, 86, 88, 90, 91, 96, 98
6
Box Plot
•First, find the median, the value exactly in the
middle of an ordered set of numbers.
The median is 37
• Next, we consider only the values to the left of
the median:
1, 5, 10, 14, 16, 22, 25, 29, 30, 34, 36, 36
We now find the median of this set of numbers.
The median for this group is (22 + 25)/2 = 23.5,
which is the lower quartile.
7
Box Plot
• Now consider the values to the right of the
median.
38, 41, 43, 47, 55, 60, 86, 88, 90, 91, 96, 98
The median for this set is (60 + 86)/2 = 73, which
is the upper quartile.
We are now ready to find the interquartile range
(IQR), which is the difference between the upper
and lower quartiles, 73 - 23.5 = 49.5
49.5 is the interquartile range
8
Box Plot
The lower quartile 23.5
The median is 37
The upper quartile 73
The interquartile range is 49.5
The mean is 45.1
lower
extreme
0
lower
quartile
median
mean
upper
quartile
upper
extreme
10 20 30 40 50 60 70 80 90 100
9
Histogram
A graph of the observed frequencies in the data
set, x1, x2, …, xn versus data magnitude to
visually indicate its statistical properties, including
Guidelines for Constructing Histograms – Discrete Data
- shape
- location or central tendency
- scatter or variability
10
Guidelines for Constructing Histograms – Discrete Data
• If the data x1, x2, …, xn are from a discrete
random variable with possible values y1, y2, …, yk
count the number of occurrences of each value
of y and associate the frequency fi with yi,
for i = 1, …, k,
k
Note that
f
i 1
i
n
11
Guidelines for Constructing Histograms – Continuous Data
• If the data x1, x2, …, xn are from a continuous
random variable
- select the number of intervals or cells, r,
to be a number between 3 and 20, as an
initial value use r = (n)1/2, where n is the
number of observations
- establish r intervals of equal width, starting
just below the smallest value of x
- count the number of values of x within
each interval to obtain the frequency
associated with each interval
- construct graph by plotting (fi, i) for
i = 1, 2, …, k
12
Histogram and Relative Frequency Example
To illustrate the construction of a relative frequency distribution,
consider the following data which represent the lives of 40 car
batteries of a given type recorded to the nearest tenth of a year.
The batteries were guaranteed to last 3 years.
2.2
3.4
2.5
3.3
4.7
4.1
1.6
4.3
3.1
3.8
3.5
3.1
3.4
3.7
3.2
Car Battery Lives
4.5
3.2
3.3
3.8
3.6
2.9
4.4
3.2
2.6
3.9
3.7
3.1
3.3
4.1
3
3
4.7
3.9
1.9
4.2
2.6
3.7
3.1
3.4
3.5
13
Histogram and Relative Frequency Example
For this example, using the guidelines for constructing a histogram,
the number of classes selected is 7 with a class width of 0.5. The
frequency and relative frequency distribution for the data are shown
in the following table.
Relative Frequency Distribution of
Battery Lives
Class
Class
Frequency Relative
interval
midpoint
f
frequency
1.5-1.9
1.7
2
0.050
2.0-2.4
2.2
1
0.025
2.5-2.9
2.7
4
0.100
3.0-3.4
3.2
15
0.375
3.5-3.9
3.7
10
0.250
4.0-4.4
4.2
5
0.125
4.5-4.9
4.7
3
0.075
Total
40
1.000
14
Histogram and Relative Frequency
The following diagram is a relative frequency histogram of the battery
lives with an approximate estimate of the probability density function
superimposed.
Relative frequency histogram
0.400
Relative Frequency
0.350
0.300
0.250
0.200
0.150
0.100
0.050
0.000
1.7
2.2
2.7
3.2
3.7
4.2
4.7
Battery Lives (years)
15
Probability Plotting
• Data are plotted on special graph paper
designed for a particular distribution
- Normal
- Weibull
- Lognormal
- Exponential
• If the assumed model is adequate, the plotted
points will tend to fall in a straight line
• If the model is inadequate, the plot will not
be linear and the type & extent of departures
can be seen
• Once a model appears to fit the data
reasonably well, percentiles and parameters can
be estimated from the plot
16
Probability Plotting General Procedure
We need value estimates corresponding to each of the
sample values in order to plot the data on the probability
paper. These estimates are accomplished with what are
called median ranks.
Median ranks represent the 50% confidence level (“best
guess”) estimate for the true value of F(t), based on the
total sample size and the order number (first, second,
etc.) of the data.
17
Benard’s Approximation
There is an approximation that can be used to estimate
median ranks, called Benard’s approximation. It has the
form:
i  0.3
F̂x i   MR i 
(100%)
n  0.4
where n is the sample size and i is the sample order
number. Tables of median ranks can be found in many
statistics and reliability texts.
18
Probability Plotting Procedure
• Step 1: Obtain special graph paper, known as
probability paper, designed for the distribution under
examination. Weibull, Lognormal and Normal paper
are available at:
http://www.weibull.com/GPaper/index.htm
• Step 2: Rank the sample values from smallest
to largest in magnitude i.e., X1  X2  ..., Xn.
19
Probability Plotting General Procedure
• Step 3:
Plot the Xi’s on the paper versus
^
i  0.3

 i  0.3 
,
F (x ) 
*100% or F(xi ) 
i
 n  0.4 
n  0.4
depending on whether the marked axis
on the paper refers to the % or the proportion
of observations. The axis of the graph paper on
which the Xi’s are plotted is referred to as
the observational scale, and the axis for
^
 i  0.3 
F ( xi ) 
*100% as the cumulative scale.
 n  0.4 
20
Probability Plotting General Procedure
• Step 4: If a straight line appears to fit the data,
draw a line on the graph, ‘by eye’.
• Step 5: Estimate the model parameters from
the graph.
21
Weibull Probability Plotting Paper
If
T ~ Wβ, θ
the cumulative probability distribution function is
F(t )  1  e
t
 
 

We now need to linearize this function into the form
y = ax +b
22
Weibull Probability Plotting Paper
Then

ln 1  F(T )   ln e


x
 
 
x
ln 1  F(T )    
 






x
ln  ln 1  F(T)    ln  
 
  1 
   ln  x    ln  
ln ln 
  1  F(T ) 
which is the equation of a straight line of the form
y = ax +b
23
Weibull Probability Plotting Paper
where

 1 

y  ln  ln 
 1  F( t ) 

a
and
x  ln t 
b   ln  , i.e.,
24
Weibull Probability Plotting Paper
y  x   ln  
which is a linear equation with a slope of b and an
intercept of  ln   . Now the x- and y-axes of the Weibull
probability plotting paper can be constructed. The x-axis
is simply logarithmic, since x = ln(T) and

 1 

y  ln  ln 
 1  F( t ) 

25
Weibull Probability Plotting Paper
cumulative
probability
(in %)
x
26
Probability Plotting - Example
To illustrate the process let 10, 20, 30, 40, 50, and 80 be a
random sample of size n = 6.
27
Probability Plotting - Example
Based on Benard’s approximation,
i  0.3
(100%)
n  0.4
we can now calculate ^
F(t) for each observed value of X.
F̂x i   MR i 
For example, for x2=20,
2  0.3
*100%
6  0.4
 26.6%
F̂20 
28
Probability Plotting - Example
In summary,
i
xi
1
2
3
4
5
6
10
20
30
40
50
80
^
F(xi)
10.9%
26.6%
42.2%
57.8%
73.4%
89.1%
29
Probability Plotting - Example
Now that we have y-coordinate values to go with the xcoordinate sample values so we can plot the
x, F̂ x  points on Weibull probability paper.


^
F(x)
(in %)
x
30
Probability Plotting - Example
The line represents the estimated relationship between x
and F(x):
^
F(x)
(in %)
x
31
Probability Plotting - Example
In this example, the points on Weibull probability paper fall
in a fairly linear fashion, indicating that the Weibull
distribution provides a good fit to the data. If the points
did not seem to follow a straight line, we might want to
consider using another probability distribution to analyze
the data.
32
Probability Plotting - Example
33
Probability Plotting - Example
34
Probability Paper - Normal
35
Probability Paper - Lognormal
36
Probability Paper - Exponential
37
Example - Probability Plotting
Given the following random sample of size n=8, which
probability distribution provides the best fit?
i
1
2
3
4
5
6
7
8
xi
79.40968
88.12093
91.06394
98.73094
104.1536
105.1019
106.5036
112.0354
38
40 Specimens
40 specimens are cut from a plate for tensile tests. The tensile tests
were made, resulting in Tensile Strength, x, as follows:
i
1
2
3
4
5
6
7
8
9
x
48.5
54.7
47.8
56.9
54.8
57.9
44.9
53.0
54.7
i
11
12
13
14
15
16
17
18
19
x
55.0
55.7
49.9
54.8
49.7
58.9
52.7
57.8
46.8
i
21
22
23
24
25
26
27
28
29
x
53.1
49.1
55.6
46.2
52.0
56.6
52.9
52.2
54.1
i
31
32
33
34
35
36
37
38
39
x
54.6
49.9
44.5
52.9
54.4
60.2
50.2
57.4
54.8
Perform a statistical analysis of the tensile strength data.
39
40 Specimens
Time Series plot:
65.0
60.0
55.0
50.0
45.0
40.0
35.0
30.0
0
5
10
15
20
25
30
35
40
By visual inspection of the scatter plot, there seems to be no trend.
Therefore, sample appears to be a random sample.
40
40 Specimens
Using the descriptive statistics function in Excel, the following
were calculated:
Descriptive Statistics
Count
Minimum
Maximum
Range
Sum
Mean
Median
Sample Variance
Standard Deviation
Kurtosis
Skewness
40
42.35
61.18
18.84
2104.82
52.62
53.03
19.83
4.45
2.51
-0.34
41
40 Specimens
Using the histogram feature of excel the following data was calculated:
Bin
40
45
50
55
60
More
Frequency
0
3
10
16
9
2
and the graph:
Histogram of Tensile Strengths
18
16
From looking at the Histogram and the
Normal Probability Plot, we see that
the tensile strength can be estimated
by a normal distribution.
14
12
10
8
6
4
2
0
40
45
50
55
60
More
42
40 Specimens
Box Plot
The lower quartile 49.45
The median is 53.03
The mean 52.6
The upper quartile 55.3
The interquartile range is 5.86
lower
extreme
40
median upper
lower
mean
quartile
quartile
45
50
55
upper
extreme
60
65
43
40 Specimens
Normal Probability Plot
99.90%
99%
95%
90%
80%
70%
60%
50%
40%
30%
20%
10%
5%
1%
0.10%
40
45
50
55
60
65
44
40 Specimens
LogNormal Probability Plot
99.90%
99%
95%
90%
80%
70%
60%
50%
40%
30%
20%
10%
5%
1%
0.10%
10
100
45
40 Specimens
Weibull Probability Plot
99.90%
99%
95%
90%
80%
70%
60%
50%
40%
30%
20%
10%
5%
3%
2%
1%
0.50%
0.30%
0.20%
0.10%
41
44
48
52
56
61
46
40 Specimens
The tensile strength distribution can be
estimated by X ~ Nμ̂  52.62, ˆ  4.45
1
^
F(x)
0.8
0.6
0.4
^
f(x)
0.2
0
49
50
51
52
53
54
55
47
Solve the Example using Minitab
http://www.minitab.com/en-US/default.aspx
48
49
50
51
52
53
54
55
56