Module Two: Graphical and Numerical Methods for One Variable
Download
Report
Transcript Module Two: Graphical and Numerical Methods for One Variable
Module Two: Graphical and numerical exploration of
univariate variables
In this module, we will demonstrate how to use Minitab to conduct
graphical and descriptive summary of data.
In particular, we shall focus on
• How to construct and when to apply each tool
• For most situations, we will skip the ‘WHY’, the
except a few essential concepts.
theory behind,
• The tools introduced here are general. They can be applied to
inter-laboratory data analysis as
well as others.
1
In general, the process of an inter-laboratory study can be
described:
Lab I
Lab J
Operator
Operator
System
Measured
System
Measured
Instruments
Instruments
External Environment
2
When planning an inter-laboratory study, one should:
1. Clearly identify the purpose of the study and list the needed
instruments.
2. To make the uncontrollable environment as uniform as possible in
order to eliminate unexplainable error and uncertainties. .
3. To make the experimental units as homogeneous as possible in
order to reduce the random errors.
4. To make sure the testing process is under statistical control.
5. To collect variables that directly answer the purpose of the study.
6. To collect potentially confounding factors and co-variates that
have potential impact to the response variables.
7. To plan the study in a reasonable time period with a reasonable
cost.
3
Once the experiment is conducted and data are collected, a
typical procedure for data analysis includes
1. Data screening to take care of trivial mistakes and identify possible
causes that may lead to these mistakes.
2. Graphical summary and descriptive summary to check for
nontrivial outliers and identify causes that may lead to these
outliers. If there is no clear reason for the unusual data values, they
should not be deleted immediately.
3. Check the validity of statistical assumption such as
•
If the data follow normal curve,
•
If the variances are approximately constant among different
factor levels.
4
4. Conduct thorough and appropriate data analysis. In many
situations, the type of analysis has already been determined
when the the design of experiment was determined.
5. Properly interpret results based on both quantitative evidence as
well as qualitative aspect of the study. It is often that pure
empirical data without a close connection with the study could
be misleading. It is extremely important that the results and
interpretation should be logic from the study itself, not to be
interpreted purely based on numbers.
6. Write an appropriate report, depending on the audiences. The
report could be for general audience or for technical experts.
7. In many situations, the results may lead to
improvement/modification of the testing procedure, testing
process and operator training, and/or lead to further studies.
5
In this module, we will analyze some inter-laboratory data
and the data we collected in class using graphical and
descriptive summaries.
When we collect data for a study, a very first thing we should be aware is
the types of data we collect. Different types of data should be
summarized and analyzed differently.
Blood type,
Gender
Number of defective parts from
inspecting 10 parts. Number of
bacteria A in a ml of water.
temperature, strength of a fiber. In
most of inter-laboratory testing,
we mostly likely collect this types
of response.
6
Graphical methods to show uncertainty
and distribution of measurements
Qualitative data or categorical data:
• Some useful tools are pie chats, bar charts, Pareto charts, stem &
leaf plots.
For Continuous data:
•Some useful tools: stem & leaf plot, histogram, box-plots, timeseries plots
For demonstrating relationship between two or more responses:
•Some useful tools are scatter plots, matrix plots.
For checking if a variable follows a normal curve:
•Some useful tools are Q-Q plots, Normal probability plots.
7
Graphical tools are powerful for detecting outliers
For checking outliers (unusual cases) for one response situation:
• Some useful tools include Box-plots, h-plots, k-plots,
For checking outliers (unusual cases) in two-sample situation:
• Some useful tools include Youden’s plot, Confidence region
plots, side-by-side box-plots, scatter plots.
A detailed discussion will be given in a separate module for
outlier detection.
8
Pie Chart, Bar Chart and Pareto Chart:
A survey of 400 individuals are survey to rate the school quality in USA.
The data is summarized: Rating
Frequency
A
B
35
260
C
D
93
12
Relative Frequency
Draw a pie chart, a Bar Chart and a Pareto chart
B
C
A
Pareto chart: A ordered Bar Chart starting with highest bar and down. It
provides a quick check of the categories occur most often.
D
Hands-on exercise – Identify types of data, use Minitab to
construct Pie, Bar, Pareto charts.
A set of 20 students is randomly chosen from a university, and the
following measurements are recorded in the next table.
1.
Identify qualitative variables:
2.
Identify discrete variables:
3.
Identify continuous variables:
4.
Construct a pie chart, Bar chart and Pareto chart for the variable Year.
Student
GPA
Gender
Year
Major
Hour study/wk
1
2.0
M
3
Biology
3.6
2
3.2
F
1
Biology
7.4
3
2.5
F
4
Biology
4.8
4
2.8
M
3
Accounting
5.0
5
3.6
F
2
Accounting
6.5
6
3.1
M
3
Law
4.2
7
2.8
M
2
Law
3.8
8
2.4
M
2
Math
2.5
9
2.8
F
1
Math
5.2
10
2.6
M
3
Math
3.5
11
3.0
F
4
Math
6.8
12
3.2
M
2
Computer
9.3
13
3.7
M
1
Computer
7.2
14
2.7
F
3
Computer
5.3
15
2.9
M
2
Computer
4.2
16
2.5
M
4
Language
2.8
17
2.8
F
4
Language
3.8
18
3.2
F
1
Language
4.8
19
3.4
M
3
Engineer
4.4
20
3.1
F
3
Engineer
7.4
11
A Quick Guide of using Minitab
Minitab (version13) has four windows:
Worksheet window for data in column/row format.
Session window for results, a text file which is editable.
Graph window for graphs, which has a full screen editing capability.
Project manager window, which keeps all the history and program codes.
• Minitab has nine pull down menu. Most of them are self-explanatory (File, Edit,
Manip, Editor, Window, and Help). Three data analysis related manus: are:
Calc Menu: for data transformation and a variety of statistical distributions
and random number generations.
Stat Menu: consists of a variety of statistical methods.
Graph Menu: for constructing a variety of graphs.
Create, manipulate, and retrieve data
Once Minitab is open, Session and Worksheet are open as default.
•To create a data set, e.g., the 20 student records:
•C1, C2 are for variables. The row underneath C1,C2, and above the first case is for Variable Name (can have up
to 512 characters).
•Define your variable names and begin to enter the data just like any spreadsheet.
For the 20 student record data, you can define the variables to be ID in C1, GPA in C2, Gender in C3, Year
in C4, Major in C5, and Hourstudy in C6. Note: C5 is a non-numeric variable. C5 is change to C5-T once
you enter the data.
•Save the worksheet:
•Go to File menu, you can save your file as a minitab Worksheet, a minitab Project, or other types of data. Minitab
worksheet saves only the data file, and has the extension as .MTW. Minitab Project saves everything you have
created, including any results or graphs from analysis. It has the extension: .MPJ.
•Editing data points.
•Simply move the cursor to the data cell and modify the data as needed. In addition, Edit menu and Editor Menu
can be used for editing your data.
•If you need to compute new variables or transform an existing variable: Under the Calc Menu, you can use
Calculator or Column/Row statistics to manipulate the data.
•Retrieving existing data:
•If the existing data is in Minitab format, go to File, Open Project or Worksheet, and find your data file, double
click to open the file. You are ready to work.
13
• If your data is in EXCEL or in TEXT format (.txt or .dat), go to File Menu, click on Open Worksheet, in the
Dialog, click on the Files of type, choose the correct file format, find the file, open it. You are ready to work.
The basic structure of handling data, outputs
and graphs in Minitab
The default windows are Session window and Worksheet window.
The graph window will appear whenever a graph is created.
A typical procedure is:
1.
Create a new data set or open an existing dataset.
2.
Conduct data screening and data manipulations and transformations as needed.
3.
Conduct descriptive summary and graphic summary:
4.
1.
Go to Stat Menu, choose Basic Statistics, then Display Descriptive Statistics, and enter the dialog box. The
results will be held in the Session Window.
2.
Go to Graphs, choose an appropriate Graphical tool, and enter the dialog box. The result will appear in the
Graph Window.
Conduct a specific data analysis procedure:
1.
Go to Stat Menu, choose the appropriate statistical procedure, enter the dialog box.
e.g., Basic Statistics is for descriptive summary, one-sample, two sample tests, and normality test.
Regression is for model building and model selection. ANOVA is for balanced and generalized analysis of
variance, and post-hoc tests. DOE is for generating a variety of designs for experiments. Control Charts
have a variety of control charts, and Quality Control is for Capability analysis, Gage R&R analysis,
14 and so
on. Tables is for cross-tabulation and frequency summaries. And there are many others.
•
•
How to edit a graph?
1.
To edit a graph, click on the graph twice, two editing palettes will appear. The tool palette gives you drawing
tools, and the Attribute palette gives you tools for modifying the graph. There are other tools and functions in
the Editor Menu for editing graphs.
2.
To add any text, first select ‘T’ icon on the Tool palette, then go to the position where a text will be created,
press the left mouse and drag a text box, enter your text, such as title of the graph.
3.
To edit any part of the graph, first, click on the ‘Arrow’ sign on the Tool palette. Move the cursor to the graph,
then highlight the part of the graph to be edited by pressing the left mouse and draging the area. A dotted box
will appear. You can enlarge the line, the font, change the color, font size and so on.
4.
By default, the graph that is created by the data values is locked. You may unlock it by : go to Editor, then,
choose ‘Unlock Data Display’ for editing.
5.
To rotate the angle of the X-makers, Y-Marker, and any text in the graph, you can highlight the text, and go to
Editor Menu, choose Rotate Left or Rotate Right for a variety of angles.
6.
Once the graph is edited, you can save it: go to File Menu, then save Graph. You can print it. You can paste it
to your reports.
How to integrate output, graphs into the report document?
1.
You can copy the graph and paste it to your report: First, go to Editor to select ‘View’, then go to Edit, choose
‘Copy Graph’. Then go to your report document, and paste the graph.
2.
You can edit any output in the Session just like a word processing file, highlight and copy any part of the
output from the Session Window, and paste it to your report document.
15
Revisit: How to construct these charts using Minitab?
A survey of 400 individuals are survey to rate the school quality in USA.
The data is summarized: Rating
Frequency
A
B
35
260
C
D
93
12
Relative Frequency
Draw a pie chart, a Bar Chart and a Pareto chart
B
C
A
Pareto chart: A ordered Bar Chart starting with highest bar and down. It
provides a quick check of the categories occur most often.
D
In Minitab,
1.
Enter School rate in C1 and Frequency in C2
C1
C2
School rate
Frequency
A
35
B
260
C
93
D
12
2. For Pie Chart: Go to Graph, choose Pie Chart. Click ‘Chart Table’, enter Categories in C1,
Frequencies in C2. For the rest, you may leave them as they are, or you may change to your like.
(NOTE: If the data is not summarized in table form, for example, the variable of Gender, Year and
Major in the student information data, you will enter the variable into ‘Chart Data in____’.
3. For Bar Chart: Go to Graph, choose Chart. In the Graph box, choose ‘Mean’ for Function, enter
C2 for Y, and enter C1 for X. You may change the Display Box for different types of displays of
the Bar chart. Now, click on Annotation, and choose ‘Data Label’, then, ‘Show Data Label’. This
will show the frequencies on the chart.
4. Pareto Chart: Go to Stat, choose Quality Tools, then choose Pareto Chart. In the Dialog box,
click on ‘Chart Defect Table’, and enter Labels in C1, Frequencies in C2.
17
Hands-on exercise Revisit: use Minitab to construct Pie,
Bar, Pareto charts for the student information data.
1.
Construct a pie chart, Bar chart and Pareto chart for the variable Year.
Steps for Bar Chart: Go to Graph, choose Chart. In the Graph box, choose ‘count’ for Function, enter
‘Year’ for Y and also enter ‘Year’ for X.
Note: Minitab only allows to construct pie chart one at a time. However, we can use Chart or Pareto
Chart to construct a side-by-side charts for different categories of another variable.
2.
Construct bar graphs and Pareto charts for Year by Gender.
For Bar Graph:
1.
2.
3.
Go to Graph, choose ‘Count’ for Function, enter ‘Year’ fro Y and also enter ‘Year’ for X.
in the Display box, choose , for each ‘Group’, the group variable is ‘Gender.
Click on Options, in the Option Dialog box, for ‘Groups within X boxes, click on ‘cluster ‘Gender.
You can also click on Annotation’ to display Data Label.
For Pareto Chart:
1.
2.
Go to Stat, choose Quality Tools, choose Pareto Chart.
In the Dialog box, click Chart Defects in , and enter ‘Year’, and By Variable in ‘Gender’.
Student
GPA
Gender
Year
Major
Hour study/wk
1
2.0
M
3
Biology
3.6
2
3.2
F
1
Biology
7.4
3
2.5
F
4
Biology
4.8
4
2.8
M
3
Accounting
5.0
5
3.6
F
2
Accounting
6.5
6
3.1
M
3
Law
4.2
7
2.8
M
2
Law
3.8
8
2.4
M
2
Math
2.5
9
2.8
F
1
Math
5.2
10
2.6
M
3
Math
3.5
11
3.0
F
4
Math
6.8
12
3.2
M
2
Computer
9.3
13
3.7
M
1
Computer
7.2
14
2.7
F
3
Computer
5.3
15
2.9
M
2
Computer
4.2
16
2.5
M
4
Language
2.8
17
2.8
F
4
Language
3.8
18
3.2
F
1
Language
4.8
19
3.4
M
3
Engineer
4.4
20
3.1
F
3
Engineer
7.4
19
Another example:
The amount of money expended in fiscal year 1995 by the U.S. Department
of Defense in various categories is shown in Table 1.6. Construct both a pie
chart and a bar chart to describe the data. Compare the two forms of
presentation.
Table 1.2
Category
Amount (in billions)
Military personnel
$70.8
Operation and maintenance
90.0
Procurement
55.0
Research and development
34.7
Military construction
6.8
Total
$258.2
Refresh:
• the category of expenditure is (qualitative or quantitative).
• the amount of the expenditure is (qualitative or quantitative).
How to use pie chart for this data?
Each “pie slice” represents the proportion of the total expenditures ($258.2
billion) corresponding to its particular category. For example, for the
research and development category, the angle of the sector is
34.7
360 48.4
258.2
In percentage:
34.7
100 13.4%
258.2
Bar Chart for representing the amount in each category of expenditure
Graphical summary for continuous data:
Stem-Leaf Plots and Histograms
The following lists the prices (in dollars) of 19 different brands of walking shoes.
Construct a tem and leaf plot to display the distribution of the data.
90
65
75
70
Solution
70
68
70
70
60
68
70
74
65
75
70
40
70
95
65
Hands-on activity for Stem&Leaf Plots
1.
Use the student information data. Use Minitab to construct a stem-leaf plot for variable
Hourstudy, one for Male and one for Female.
1. In Minitab, go to Graph, choose Stem and Leaf.
2. In the Dialog box, enter ‘Hourstudy in the Variables box,
3. Click on By Variable and enter ‘Gender’ into this box. You notice, that ‘Gender’ does not
appear in the available list of variables. This is because ‘Gender’ is a Text-variable. For
Stem-Leaf plot, the ‘By Variable’ must be numeric variable. Before you can do this, you will
need to convert the Gender variable from Text to Numeric.
•
Converting Gender from Text to Numeric:
•
Go to Manip Menu, choose Code, then select ‘Text to Numeric.
•
In the Dialog box, enter ‘Gender’ into ‘Code Data in Columns’ box, and a new
column, say, C7, to ‘the ‘into columns’ box.
•
Enter Original values: M, New: 1 and Original Value: F, New : 2.
4. Now, go back to step Three, and replace ‘Gender’ by ‘C7’. You are in business.
24
Relative Frequency Histograms
A relative frequency histogram for a quantitative data set is a bar graph in which
the height of the bar (Y-axis) represents the proportion or relative frequency
of occurrence for a particular class or subinterval of the variable being
measured. The class or subintervals of the variable are plotted along the x axis.
Constructing a relative frequency histogram:
1.
2.
3.
4.
5.
Choose the number of classes, usually between 5 and 15.
Calculate the approximate class width by dividing the difference between the largest and
smallest values (Range = largest – smallest) by the number of classes.
Round the approximate class width up to a convenient number.
If discrete, assign one or more integers to a class.
Locate the class boundaries.
If continuous, use Method of left inclusion: Include the left class boundary point but not
the right boundary point in the class.
–
NOTE: Different methods may be used in different software. Some may use right
inclusion. Some may add an additional decimal place for the class boundary.
6.
Construct a table containing the classes, their boundaries, and their relative frequencies.
7.
Construct the histogram like a bar graph. X: the class boundaries, Y the relative
frequency. Each rectangle bar represent the relative frequency (or frequency) of the
variable in each class.
Hands-on activities for constructing Histograms using Minitab
Data: the 20 student information
•
•
Construct histogram for GPA and Hourstudy.
1.
Go to Graph, enter ‘GPA’ and Hourstudy as graph variables into X box.
2.
Use Annotation for Data Labels. Use Options to modify the type of histogram.
Construct histogram for Hourstudy by Gender with the same X and Y scales.
1.
Go to Graph, enter ‘Hourstudy’ as the graph variable in X box.
2.
In the Display Box, select ‘Group’ into ‘for each’ box, and enter ‘Gender’ into the ‘Group
Variables box.
3.
Go to Frame, choose Multiple Graphs.
4.
Choose ‘Same X and Same Y’.
Exercise: Compare the histogram distributions of GPA between Male and Female.
26
Student
GPA
Gender
Year
Major
Hour study/wk
1
2.0
M
3
Biology
3.6
2
3.2
F
1
Biology
7.4
3
2.5
F
4
Biology
4.8
4
2.8
M
3
Accounting
5.0
5
3.6
F
2
Accounting
6.5
6
3.1
M
3
Law
4.2
7
2.8
M
2
Law
3.8
8
2.4
M
2
Math
2.5
9
2.8
F
1
Math
5.2
10
2.6
M
3
Math
3.5
11
3.0
F
4
Math
6.8
12
3.2
M
2
Computer
9.3
13
3.7
M
1
Computer
7.2
14
2.7
F
3
Computer
5.3
15
2.9
M
2
Computer
4.2
16
2.5
M
4
Language
2.8
17
2.8
F
4
Language
3.8
18
3.2
F
1
Language
4.8
19
3.4
M
3
Engineer
4.4
20
3.1
F
3
Engineer
7.4
27
Interpreting Graphs – What to observe from a graph?
What to look for as you describe the data:
- the degree of uncertainty – how wide the data spread.
- the center of the data set , such as mean, median, mode.
- shape of the distribution – Is normal curve a reasonable
distribution?
- outliers – Are there rare or unusual data? What may be the
causes?
• Distributions are often described by their shapes:
- symmetric
- skewed to the right (long tail goes right)
- skewed to the left (long tail goes left)
- unimodal, bimodal, multimodal (one peak, two peaks, many
peaks)
What kind of information can a histogram provide?
Relative frequency can give us information such as:
the proportion of measurements that fall in a particular class
or group of classes
the probability that a measurement drawn at random from a set will fall in a particular
class or group of classes
Different samples from the same population will produce different histograms.
The shape of histogram describe the distribution of the variable of interest. The following are
common shapes we may find in real world applications:
Symmetric
Skew-to-right
Skew-to-left
Bimodal
Skew-to-right with outliers
•Skew-to-right: Most values are small. Only a few are much larger. The long tail is on the right side.
•Skew-to-left: Most vales are large. Only a few are much smaller. The long tail is on the left side.
•Bimodal: Two peaks.
•Outliers: Observations which are extremely away from the majority (We will discuss how to identify them
more specifically.
Q:
Base on your common experience, what would you say about the distribution shape
of the following variables, if we observe 200 data values:
(Symmetric,
Skew-to-right,
Skew-to-left)
Adult height:
Entry level Salary:
Salary for individuals who are 40 years or older:
Hours on the net per week:
Scores from an easy test:
Scores from a difficult test:
Q: Can you find a variable has the distribution shape of
(a) Symmetric:
(b) Skew-to-right:
(c) Skew-to-left:
Describing Data with Numerical Measurements
When introducing numerical measurements, one can not ignore to introduce some
commonly used notations. The main reason we compute the numerical
measurements is to try to use the sample information to make a good sense of
the the unknown nature (or population). When we compute a sample measure,
there is a corresponding population measure. For example, when we compute a
sample mean from a data set, we usually try to use this sample mean to estimate
the true mean of the unknown nature (or population mean). In order to make it
clear, we use two different terminologies. One for sample, and one for
population.
• Measurements summarized from sample data: we call them
• Measurements from the unknown population: we call them
statistics.
parameters.
A Table for some commonly used notations
Some commonly used notations
Measurements from Sample data
Sample mean (average)
Corresponding population measurements
x nx
i
Population mean,
m
Sample median, m,
the middle value when data are in ascending order.
Population median
Sample Mode, M,
the observations occur most frequently.
Population Mode
Sample Variance, s2,
a measure of uncertainty
Population Variance, s2
Sample standard deviation, s,
a measure of uncertainty
Population standard deviation, s
Sample range, R, a measure of uncertainty
Population range
Relative frequency histogram
Probability distribution, P(x), or f(x)
Sample percentile, e.g., 70th percentile: 70 % of the
Population percentile
observations are less than the
70th
percentile, and 30% are larger.
Measure of Center of a data set:
Sample Mean,
x nx
i
Sample median , m of a set of n measurements is the value of x that falls in the middle position when
•
the measurements are ordered from smallest to largest.
The value .5(n + 1) indicates the position of the median in the ordered data set.
If .5(n+1) is an integer, the position in the order data set is the median.
– If .5(n+1) is not an integer, the median is the average of two nearby middle
observations.
Sample Mode: the data values that occur most frequently. May be more than one mode in a data set.
–
Measure of relative standing: Percentile •
•
A set of n measurements on the
variable x has been arranged in order of magnitude.The pth percentile is the value of x that
exceeds p% of the measurements and is less than the remaining (100 - p)%.
The value p(n + 1) indicates the position of the pth percentile in the ordered data set.
– If p(n+1) is an integer, the position in the order data set is the pth percentile.
– If p(n+1) is not an integer, the pth percentile is the average of two nearby observations.
Some commonly used percentiles:
25th percentile ( Q1, or 1st quartile),
50th percentile (Q2, median, or 2nd Quartile), 75th percentile ( Q2, or 3rd quartile).
Hands-on activity:
Obtain average, median Q1, Q3, 60th percentile, and mode for the following
inter-laboratory testing results (in mg) from 8 labs by hand:
Notation: x1
Results: 84
x2
74
x3
78
x4
84
x5
92
x6
86
x7
84
x8
80
Q-a. The value 78 was a typo. It should be 48. Compute average, median, Q1, Q3and mode
again. How the change of the value from 78 to 48 impacts average, median, Q1, Q3 and
mode?
Q-b. Each sample was tested twice. The results from the 2nd test indicates every lab had 5 mg
lower than the 1st test. Compute average, median and mode. How the reduction of 5 mg from
each lab impact average, median, Q1, Q3 and mode?
Mean
Median
Mode
Q1
Q3
60th percentile
1st test-incorrect
1st test-correct
2nd test
Explain your observations:
35
Relative frequency distribution showing the effect
of extreme values on the mean and median
Note: Median is less sensitive than average to extreme values.
Why?
Measures of Variability
•
Variability or dispersion is a very important characteristic of data.
It measures the spread of data values.
•
Example:
–
–
Scores of 20 students are all 80% -- There is no variability.
Scores of 20 students range from 30 to 100%. – There is a large variability.
Measure of Variability: The range, R, of a set of n measurements is
defined as the difference between the largest and the smallest
measurements.
Range = Largest – Smallest.
Visualizing Variability using Histogram
Numerical Measures of Variability
Variance and Standard Deviation
What are they? How can they be used to measure data spread?
The following figure showing the deviations of points from the mean
Sample Variance:
Standard deviation:
s 2
( xi - x )2
n -1
s s2
Measure of Variability for Population:
The variance of a population of N measurements is denoted by s 2 and is given by
the formula s 2 ( xi - m )2
N
•
•
The population standard deviation is s = s 2
This measure will be relatively large for highly spread data and relatively small
for less spread data.
Measure of Variability for Sample:
The variance of a sample of n measurements is given by
The sample standard deviation is given by:
x
s s2
s 2
( xi - x )2
n -1
•
The shortcut method for calculating s 2 :
s2
2
xi -
( xi )2
n -1
n
where x 2i sum of the squares of the individual
measurements
and ( x )2 square of the sum of the individual measurements.
i
The measures of center (mean, median, mode), the measure of relative standing (pth
percentile), the measure of variability (Range, S2 , S) can be easily computed
using Minitab.
In minitab:
1.
Go to Stat, choose Basic Statistics, then choose Display Descriptive Statistics.
2.
In the Dialog box, enter the variable names.
3.
One can choose to have some graphs of the variables, such as Histogram and so on, by
clicking on ‘Graphs’ in the Dialog, and choose the wanted graphs.
Hands-on activity for Numerical measurements using Minitab
High blood pressures from 8 patients before and after a medication
are recorded:
Patient 1
2
3
4
5
6
7
8
Before 220 245 186
190
245 264 252
248
After
155 180 172
162
165 178 210
158
Improve 65
Q-a: Find the improvement. Compute the average and median improvement.
Q-b:
Q-c:
Compute the sample variance, s2, and sample standard deviation, s , of the
improvement.
Compute sample average – 2(s) and sample average + 2(s). How many
percent of patients whose improvements are within this interval?
Points to remember about variance and standard deviation:
- The value of s is always greater than or equal to zero.
- The larger the value of s 2 or s, the greater the variability of the data set.
- If s 2 or s is equal to zero, all measurements must have the same value.
- The standard deviation s is computed in order to have a measure of
variability measured in the same units as the observations.
In real world applications, the shape of the distribution is usually related to the
mean, median and standard deviation.
An Example:
The gas price is a concern for people. A random sample of 40 stations gives the
following data summary:
Sample mean = $1.85
Median = $1.82
Q: Is the distribution of the gas prices more likely to be
(a)
Symmetric (b) skewed-to-right (c) Skewed-to-left
And WHY?
S = $.15
On the Practical Significance of the
Standard Deviation
NOTE:
s measures the uncertainty of observed data. x s provides an interval of
potential blood pressures. Further more, it also tells us approximately 68% of
blood pressures will be within the interval. It is a way of reporting
measurement uncertainty.
Based on the same thought, x 2( s ) is an interval of blood pressure, which
approximately will cover 95% of all possible blood pressures.
WHY?
The Empirical Rule, and the Normal Curve
The above claim is correct in many real world situations. This is due to the fact
that many real world variables follow a distribution, the Normal Curve, which
says, most observations are in the middle, around the mean. A few are small,
and a few are large. And, the approximate proportion can be determined using
the Normal Curve.
This is described as the Empirical Rule. The following graph shows the rule.
34%
34%
2.5%
m-2s
2.5%
m-s
m
m+s
m+2s
Empirical Rule: Given a distribution of measurements that is
approximately mound-shaped:
- The interval (m s) contains approximately 68% of the
measurements
- The interval (m 2s) contains approximately 95% of the
measurements.
- The interval (m 3s) contains almost all of the measurements.
Empirical rule is often applied to identify rare (unusual, extreme)observations.
If an observation falls outside two s.d. range, it only has 5% of chance to occur.
Therefore, it is considered rare.
We will discuss the Normal Curve and learn how to apply it to real world situations.
In most of our discussions on analyzing inter-laboratory testing data, detecting
outliers, and quality control, the chance of occurrence of the response variable
will be assumed following a normal curve.
Hands-on Activity for using Empirical Rule to identify rare cases
2.5 A Check on the Calculation of s
•
•
Range 4s or s Range / 4
Use the range approximation to detect gross errors in calculating, such as the
failure to divide the sum of squares
of deviations by (n -1) or the failure to take the square root
of s 2.
Measures of Relative Standing
Definition: The sample z score is a measure of relative
standing
defined by
z - score
•
•
•
•
x - x
s
A z-score measures the distance between an observation
and the mean, measured in units of standard deviation.
An outlier is an unusually large or small observation.
z-scores between -2 and +2 are highly likely.
z-scores exceeding 3 in absolute value are very unlikely.
Definition: A set of n measurements on the variable x has been arranged in order of
magnitude.The pth percentile is the value of x that exceeds p% of the
measurements and is less than the remaining (100 - p)%.
•
The value p(n + 1) indicates the position of the pth percentile in the ordered
data set.
–
–
•
If .5(n+1) is an integer, the position in the order data set is the pth percentile.
If .5(n+1) is not an integer, the pth percentile is the average of two nearby
observations.
Example 2.13 is an example of the use of a percentile. Figure 2.12 shows a
percentile on a relative frequency histogram. Figure 2.13 illustrates the location of
quartiles.
Example 2.13
Suppose you have been notified that your score of 610 on the Verbal Graduate
Record Examination placed you at the 60th percentile in the distribution of scores.
Where does your score
of 610 stand in relation to the scores of others who took the examination?
Solution
Scoring at the 60th percentile means that 60% of all examination scores were
lower than yours and 40% were higher.
•
•
The median is the same as the 50th percentile.
The 25th and 75th percentiles are called the lower and
upper quartiles.
Figure 2.12
Figure 2.13
Definition: A set of n measurements on the variable x has been arranged
in order of magnitude.
•
The lower quartile (first quartile), Q1, is the value of x that exceeds
one-fourth of the measurements and is less than the remaining 3/4.
•
The second quartile is the median.
•
The upper quartile (third quartile), Q 3, is the value of x that exceeds
three-fourths of the measurements and is less than
one-fourth.
•
When the measurements are arranged in order of magnitude, the lower
quartile, Q1, is the value of x in the position .25(n +1).
•
The upper quartile, Q 3, is the value of x in the position
.75(n + 1).
•
When these positions are not integers, the quartiles are found by
interpolation, using the values in the two adjacent positions.
•
See Example 2.14 to illustrate the determination of the lower and
upper quartiles. Figure 2.14 gives the Minitab output for the example.
Definition: The interquartile range (IQR) for a set of measurements is the
difference between the upper and lower quartiles; that is, IQR Q 3 - Q
1.
•
The trimmed mean is the mean of the middle 90% of the
measurements after excluding the smallest 5% and the
largest 5%.
Data: Beta-Carotene In serum.MTW
Row
Laboratory
Sample
Material A
Material B
Material C
Material D
1
1
1
0.066
0.146
0.472
0.986
2
1
2
0.062
0.143
0.436
0.904
3
2
1
0.070
0.140
0.390
0.840
4
2
2
0.070
0.140
0.390
0.820
5
3
1
0.089
0.213
0.390
0.840
6
3
2
0.082
0.196
0.523
1.241
7
4
1
0.044
0.120
0.452
1.292
8
4
2
0.050
0.120
0.472
1.131
9
5
1
0.064
0.142
0.411
0.883
10
5
2
0.058
0.148
0.416
0.874
11
6
1
0.076
0.149
0.399
0.886
12
6
2
0.073
0.145
0.396
0.859
13
7
1
0.080
0.230
0.390
0.830
14
7
2
0.080
0.250
0.380
0.780
15
8
1
0.062
0.140
0.370
0.890
16
8
2
0.057
0.150
0.390
0.910
17
9
1
0.060
0.170
0.450
1.040
18
9
2
0.070
0.170
0.460
1.070
54
19
10
1
0.071
0.155
0.458
1.093
20
10
2
0.074
0.159
0.444
1.061
21
11
1
0.050
0.140
0.420
0.970
22
11
2
0.060
0.140
0.420
0.980
23
12
1
0.060
0.080
0.180
0.320
24
12
2
0.050
0.060
0.190
0.870
25
13
1
0.051
0.145
0.371
0.832
26
13
2
0.062
0.145
0.328
0.870
27
14
1
0.100
0.240
0.520
1.380
28
14
2
0.090
0.230
0.590
1.180
29
15
1
0.063
0.146
0.426
0.899
30
15
2
0.060
0.149
0.458
1.002
31
16
1
0.095
0.173
0.437
0.969
32
16
2
0.097
0.177
0.447
0.978
33
17
1
0.070
0.138
0.389
0.914
34
17
2
0.069
0.149
0.393
0.919
35
18
1
0.040
0.090
0.230
0.540
36
18
2
0.040
0.090
0.230
0.530
55
Project One: Analysis of Data using graphical and numerical
summaries
Study: An inter-laboratory testing was conducted to investigate
the Beta-Carotene content of four materials, A,B,C,D. Two
samples were tested for each material.
Purpose: To study if there is a difference between labs and within
lab for each material tested.
Things to investigate:
1.
Comparing distributions of four materials.
2.
Comparing means and medians of four materials.
3.
Comparing variability of four materials.
4.
Comparing distributions of each material between two samples.
5.
Comparing variability of each material between two samples.
6.
Are there any unusual observations from any lab for each material?
We will have team presentations for each project.
56