What Every Auditor Should Know About Statistics

Download Report

Transcript What Every Auditor Should Know About Statistics

Thoratec
Workshop in
Applied Statistics for
QA/QC, Mfg, and R+D
Part 1 of 3:
Basic Statistical Concepts
Instructor : John Zorich
www.JOHNZORICH.COM
[email protected]
Part 1 was designed for students who
know high-school algebra but who have
never had a college-level statistics course.
John Zorich's Qualifications:
 20 years as a "regular" employee in the medical device
industry (R&D, Mfg, Quality)
 ASQ Certified Quality Engineer (since 1996)
 Statistical consultant and trainer (since 1999) for many
companies, including Siemens Medical, Boston
Scientific, Stryker, and Novellus
 Instructor in applied statistics for Ohlone College,
Silicon Valley Polytechnic Institute, and KEMA/DEKRA
 Past instructor in applied statistics for UC Santa Cruz
Extension, ASQ Silicon Valley Biomedical Group, & TUV .
 Publisher of 9 commercial, formally validated, statistical
application Excel spreadsheets that have been purchased
by over 80 companies, world wide. Applications include:
Reliability, Normality Tests & Normality Transformations,
Sampling Plans, SPC, Gage R&R, and Power.
 You’re invited to “connect” with me on LinkedIn.
Objectives
PART 1 (today's topics):
Obtain an understanding of BASIC Statistics in
general (its vocabulary, methods, & uses) as
needed to understand Parts 2 and 3.
PART 2 (not today's topics):
Learn INTERMEDIATE statistical applications,
and tests as needed to understand Part 3; also
includes "reliability calculations", power
calculations, and sample size determinations.
PART 3 (not today's topics):
Become familiar with commonly used
ADVANCED Statistical applications (Reliability
Plotting, Sampling plans, SPC, Process
Capability calculations, Equipment control).
Self-teaching & Reference Texts
RECOMMENDED by John Zorich
Clements: Handbook of Statistical Methods in
Manufacturing
Kaminsky et. al.: Statistics and Quality Control for
the Workplace
Mlodinow: The Drunkard’s Walk --- How
Randomness Rules Our Lives
Motulsky: Intuitive Biostatistics
NIST Engineering Statistics Internet Handbook, at...
http://www.itl.nist.gov/div898/handbook/index.htm
Philips: How to Think about Statistics
Main Topics in Today's Workshop
•
•
•
•
•
•
•
•
•
•
•
•
Regulatory Requirements
Population vs. Sample
Parameter vs. Statistic
Probability
Law of Large Numbers
Distributions (Charting and Graphing)
Binomial Distribution
Hypergeometric Distribution
Normal Distribution
Central Limit Theorem
Standard Deviation and Standard Error
Linear Regression & Correlation Coefficients
Regulatory Requirements
ISO 9001:2008 (8.1), and ISO 13485:2003(8.1)
" The organization shall plan and implement the
monitoring, measurement, analysis and
improvement processes needed to demonstrate
conformity [to requirements]....This shall include
determination of applicable methods, including
statistical techniques, & the extent of their use."
21CFR820.250 (FDA)
" Where appropriate, each manufacturer shall
establish and maintain procedures for identifying
valid statistical techniques required for
establishing, controlling, and verifying the
acceptability of process capability and product
characteristics."
(as used in this class...)
Sample means part of a Population
The sample could be the part of an individual batch or lot that
was purchased or produced; you inspect the sample prior to
applying an "approved" label on the entire batch or lot.
"Representative Sample": a sample represents the
population --- it is typically not a "Random Sample" but rather
is usually taken evenly from thruout the population
(e.g., a few items taken from each box in the batch).
"Sample size" can be anything from 1 to over 1,000,000.
The term "one sample" or "a single sample" means the entire
sample, no matter what the sample size is.
(as used in this class...)
Statistic is a mathematical summary value
calculated from data taken from a Sample.
All of the following are statistics:
 Avg thickness of every 100th cable produced last week.
 Range of thicknesses in that sample
 Median thickness in that sample.
Parameter is a mathematical summary value
calculated from data taken from the entire Population;
that is, every data point in the entire population
(e.g., average thickness of all cables produced last week).
"Statistics" as a science is the mathematical analysis of
"statistics", not of parameters. Statistics is the science of
using "statistics" to guesstimate "parameters".
As a group, let's discuss...
Which are parameters and which statistics?
1. Baseball "stats"
Answer: Parameters, because baseball "stats" are
calculated using all the data.
2. United States Census data
Answer: Some are statistics, because they are just a
sample of the population (this is the preference of
Democrats) whereas others are parameters because
we attempt to count the entire US population (this is the
preference of Republicans).
3. Average age of the people in this class
Depends -- is this class a population or sample?
Probability (as used in this class) means...
 The same as "chance" or "odds", but not based on a hunch or
intuition or what has historicly occurred.
• The following statements are not using probability in the
sense we mean here today:
 He'll probably come home before 9pm.
 They'll probably win tonight's game.
 They haven’t won a game in 6 weeks---they’re due!!
• Those are examples of “Adverbial Probability”
(see Inductive Logic, by Hibbens, 1896, chapter 15)
• Instead, the Science of Statistics uses...
“Mathematical Probability”.
 “Mathematical Probability” is the same as the
"theoretical expected frequency", that is, the # of times one
type of event would happen (if no cheating occurs) divided by
the total number of all possible equi-probable events; e.g.,...
Probability
1:1 = "Fifty-Fifty" = 1 / 2 = 0.50 = 50 %
Those terms (above) all mean the same thing.
They all mean that you have the same chance at winning as
you have at losing, as opposed to...
1 / 4 = 0.25 = 25 % chance or odds of...
1 / 10 = 0.10 = 10 % probability of...
1 / 3 = 0.3333 (rounded) = 33.33 %
1 / 6 = 0.1667 (rounded) = 16.67 %
(By definition...) Probability / chance / likelihood...
never can exceed 1.00 = 100%, and
never can be less than 0.00 = 0%
Probability
0.20
PROBABILITY OF ROLLING A GIVEN NUMBER ON 1 TOSS OF 1 DIE
The NULL HYPOTHESISdfdfis that the DIE is "honest".
0.18
0.16
Probability
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
1
2
3
4
5
Number Observed on the 1 Die
6
Probability
0.20
PROBABILITY OF OBSERVING A GIVEN SUM ON 1 TOSS OF 2 DICE
dfdfkjfkdjf;lskdjff
The NULL HYPOTHESIS
is that the dice are "honest".
Probability
0.15
0.10
0.05
0.00
2
3
4
5
6
7
8
9
10 11 12
Sum of Numbers Observed on 2 Dice
Probability
0.6
PROBABILITY OF OBSERVING HEADS ON FLIP OF 1 COIN
The NULL HYPOTHESIS isdfdfthat the COIN is "honest".
Probability
0.5
0.4
0.3
0.2
0.1
0.0
0
1
Number of heads observed
Probability
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS
The Null Hypothesis is that coins are dfdf
honest, i.e. probability of heads = 0.50
0.20
When tossing 30 honest coins, the "true" average is 15
heads, but by chance we may see some other result.
Probability
0.15
0.10
0.05
0.00
0
5
10
15
20
25
Number of observed HEADS
30
Probability of Independent Events
The MULTIPLICATIVE RULE:
The probability of Event A happening
and Event B and Event C (assuming
that they are independent events), is
the multiplication of their probabilities:
Pa x Pb x Pc (where Pa is the
probability of Event A, and so on).
-- Class Exercise -Let's try answering these questions:
The MULTIPLICATIVE RULE (examples):
The chance of rolling 2 dice and obtaining a 5 on both of them
is...
1 / 6 x 1 / 6 = 1 / 36 = 0.028 = 2.8%
The probability of flipping a coin 4 times and obtaining "heads"
every time is...
1 / 2 x 1 / 2 x 1 / 2 x 1 / 2 = 1 / 16 = 0.062 = 6.2%
Let's try that (flipping 4 coins & counting heads)
The likelihood of drawing 3 good parts from a lot of
100 million parts, 99% of which are good, is...
0.99 x 0.99 x 0.99 = 0.9703 = 97.03%
Probability
The MULTIPLICATIVE RULE (corollary):
Conditional Probability: If the probability changes after each
sampling event, then the separate probabilities are not
identical, because they are "conditional" not "independent"; e.g.
What is the probability of drawing 3 good parts from a lot of
100 parts, 99% of which are good (that is 99 of which are good
and one of which is bad)?
1st draw
2nd draw
3rd draw
99 / 100 x 98 / 99 x 97 / 98 = 0.9700
The probability of a given draw is "conditional" based upon
what happened in the previous draw.
( do not use: 99/100 x 99/100 x 99/100 = 0.9703 )
Probability of Independent Events
Assuming that only one event can happen at a time, the sum of
the probabilities of all possible events equals 1.000 exactly.
On a single die, only one number appears face up at a time.
Therefore P1 + P2 + P3 + P4 + P5 + P6 = 1.00, where P1 is
the probability of the #1 being face up, etc.
The ADDITIVE RULE:
The probability of Event A happening or Event B or Event C,
assuming that only one event can possibly happen at a time,
is the sum of their probabilities: Pa + Pb + Pc (where Pa is
the probability of Event A, and so on) --- in this case, there are
assumed to be other possible events, i.e., Pc, Pd, Pe, etc.
-- Class Exercise -Let's try answering these questions:
The ADDITIVE RULE (examples):
The chance of rolling 2 dice and obtaining a total of either 2 or
12 is
1 / 36 + 1 / 36 = 2 / 36 = 0.056 = 5.6%
The probability of flipping 4 coins and obtaining either all heads
or all tails is
1 / 16 + 1 / 16 = 2 / 16 = 0.125 = 12.5%
(based upon our example a few slides ago)
Likelihood that an n = 1 sample is out-of-spec if taken from a lot
with 2% out-of-spec high & 5% out low is...
0.02 + 0.05 = 0.07 = 7%
Probability
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS
The Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50
(assuming the coins are honest !!)
0.40
The probability of getting
3 or more heads in a single
toss of 4 coins is about
30% = 0.30 = the
approximate sum of the
individual histogram bar
probabilities for getting 3
heads or 4 heads
( 0.25 + 0.05 = 0.30 )
0.35
Calculation of
each of these
probabilities is
simple to do by
enumeration
(presenter has
demo file)
Probability
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
5
10
15
20
25
Number of observed HEADS
30
Probability
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS
(assuming the coins are honest !!)
Probability
0.15
0.10
0.05
Calculation of
each of these
probabilities is
done the same
way as for 4
coins, but is
much more
tedious because
there are over a
million
possibilities!!
The probability of
getting 22 or more
heads in a single toss
of 40 coins is about
30% ( ≈ the sum of the
individual histogram bar
probabilities of 22 and
above on the X-axis)
0.00
0
5
10
15
20
25
Number of observed HEADS
30
( 0.10 + 0.075 +
0.06 + 0.03 +
0.02 + 0.01 +
0.005 = 0.30 )
t-Test of Null Hypothesis
Probability
Probability or Frequency
Y-axis = Probability or Frequency
Null Hypothesis: True Average is not greater than the Specific
If the number of possible result
values is "infinite" or very large,
then the probability histogram is
One-ta
more conveniently represented
the Dis
by a smooth curve (such as this
equal t
one) rather than a histogram like
in previous slides.
For example: individual weights
Distrib
Always think of
of
thousands
of
coins,
or
the
differe
this area under
individual avg weights of Sampl
the curve as filled
thousands of samples Specif
with histogram bars
taken from a very large assum
that we are too cheap
to print.
population of coins. Hypoth
0
6
X-axis = measured
values,
Sample Average minus Specification
increasing in magnitude, from left to right
Probability or Frequency
= Frequency
Y-axis = Probability
t-Test of Null Hypothesis
Probability
Null Hypothesis: True Average is not greater than the Specification
The probability of getting a
measurement equal to or
greater than value "A" on the
X-axis is exactly 0.30 = the
fraction of the area under the
curve that is to the right of that
point on the X-axis (the redshaded area equals 30% of the
area under the entire curve).
e.g.,
x-axis =
widths of
cables
made last
week
A
X-axis = measured values,
increasing in magnitude,
from left to right
In the language of
calculus, the red area
is the integral of the
distribution function,
from "A" to infinity.
Y-axis = Probability = Frequency
Probability or Frequency
t-Test of Null Hypothesis
Probability
Null Hypothesis: True Average is not greater than the Specification
e.g.,
x-axis =
widths of
cables
made last
week
The probability of getting a
measurement equal to or greater
than value "B" on the X-axis is
exactly 0.05 ( = the fraction of the
area under the curve that is to the
right of that point on the X-axis)
(the red-shaded area equals 5% of
the area under the entire curve).
B
X-axis = measured values,
increasing in magnitude,
from left to right
We will use this concept
many times in Day 2.
Do you understand it
completely?
(Let's examine it with
Instructor's Excel files)
The "Law of Large Numbers"
(per JZ) This "law", generalized, is somewhat self evident, and
was known in principle to Archimedes over 2 millennia ago.
It applies to calculated "statistics", such as averages & standard
deviations, and says nothing about the distribution of raw data.
Possibly a better name for this law is the one used over 100
years ago: The “Law of Tendency” ---“…the law of tendency is that the larger the number of
instances, the greater [= better ] will be the approximation to
an accurate and definite result.”
(quote from pg 240 of Inductive Logic, 1896 by J.G. Hibbens, Scribner & Sons)
This quote shows that the “Law of Large Numbers” is part of our
common language, but is unfortunately often applied incorrectly.
It is misapplied here because the "Law of Large ("big") Numbers"
itself has nothing to do with "statistical significance” (we will
discuss "statistical significance” in Day 2 of this workshop).
DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS
(chart from
"Law of
Large PER
Numbers.xls"
Student
files)
( 250 SAMPLE
AVERAGES
EACH SAMPLEin
SIZE
)
130
SAMPLE AVERAGE
Parameter = 100
100
Each mark on each line represents the avg
of a different random sample taken from a
uniformly distributed population, 75 to 125.
70
1
3
5
7
9
11
13
15
17
19
21
23
25
27
SAMPLE SIZE
Law of Large Numbers translates (in this example) as...
The larger the sample size, the closer the calculated value is
likely to be to 100 = the population value (i.e., the closer the
statistic is likely to be to the population parameter).
29
-- Class Exercise -If this population has an average value of
100, the average value of a SMALL
sample from this population will, in the
long run,
be smaller or larger than 100 ?
This (below)
is bimodal
Will the average value of a LARGE
sample, in the long run,
be larger or smaller than 100?
Answer: In the long run, both small & large samples will be close
to 100. In the long run, samples avgs equal population avgs,
no matter what the sample size or population shape (that is,
in the long run, statistics = parameter).
Graphical Methods
used to Describe
Variability
Number Line
. . .. ..... ... . .
400
450
500
550
600
The small red squares graphically
depict the variability, or the
"distribution", of the data.
Histograms
and
Line Charts
Bar Charts
and
Line Charts
REASONS WHY CUSTOMERS RETURNED CHINA
PLACE SETTINGS ORDERED OVER THE INTERNET
FROM ZTC
30
This is shown here only
to "complete" a survey of
types of charts. We won't
mention Pareto charts in
the rest of the workshop.
25
25
20
17
15
12
10
4
5
2
1
EN
IV
G
SO
N
ED
N
O
R
EA
A
N
G
C
H
U
A
Q
N
G
R
O
M
N
TI
IN
D
TY
R
LO
N
G
W
R
O
W
W
R
O
N
G
PR
C
O
O
D
K
EN
U
C
T
0
BR
O
NUMBER OF RETURNS IN JAN 2005
Pareto Chart
-- Class Exercise -If a population distribution looks bimodal,
the distribution of data in a SMALL
sample from that population will, on
average, look like...what?
This (below)
is bimodal
The distribution of a LARGE sample from
that population will, on average, look
like...what?
Answer: On average, both samples will look ≈ bimodal.
On average, samples look like the parent population,
no matter what the sample size.
"Binomial Distribution" Histogram
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS
The Null Hypothesis is that coins are dfdf
honest, i.e. probability of heads = 0.50
0.20
The "binomial distribution"
describes frequencies
when there are only 2
possible outcomes,
(e.g., head or tails on a
coin, or a vote for or
against a proposed law).
Probability
0.15
0.10
0.05
0.00
0
5
10
15
20
25
Number of observed HEADS
30
30 coins at a time
The formula for the "Binomial Distribution" is used to
calculate, e.g., the probability of 26 heads appearing
on a toss of 30 coins. Part of the formula includes the
following calculation:
26 x 25 x 24 x 23 x ....... 4 x 3 x 2 x 1 = ???
= (approximately) 400 Million x Billion x Billion
Prior to computers, such calculations were
"impossible", except by idiot savants ( = the first
"computers" --- they were actually sought after and
well paid !)
Calculation of Binomial Probability
How to easily calculate the height of a single bar
in a Binomial Distribution Probability Histogram…
(MSExcel function)
=binomdist(N,S,B,false)
N = Number of heads observed in a given toss of coins
S = Sample size = number of coins per toss
B = Probability of getting heads on a single coin = 0.5
false = (tells Excel to give probability of single histogram bar)
e.g., =binomdist(11,30,0.5,false) = 0.0509
(check that value vs. histogram a couple slides ago)
Binomial distributions are symmetrical when probability = 0.500,
but skewed when probability is any other value (the farther from
0.500, the more extreme is the skewness --- see next slide).
"Binomial Distribution" Histogram
IF A DICE HAD 10 SIDES, ONE OF WHICH HAD A STAR ON IT,
dfdf
PROBABILITY OF MULTIPLE STARS FACE UP IN TOSS OF 30 DICE
0.25
This situation is modeled by the
Binomial distribution because we are
looking at only 2 possible outcomes:
Star or Not-a-Star.
The probability of a star coming face
up is = 1 / 10 = 10%.
The corresponding binomial
histogram has a peak at
30 x 10% = 3, but is not symmetrical
(it is skewed to the right).
Probability
0.20
0.15
0.10
0.05
0.00
0
5
10
15
20
25
Number of Stars Face Up in Toss of 30 Dice
30
"Hypergeometric Distribution"
The "Binomial distribution" describes frequencies of
independent events, where the probability of one result is
NOT influenced by a previous result
(e.g., coin tosses --- reference the "multiplicative rule" of
probability calculation, discussed previously).
The "Hypergeometric distribution" looks almost identical to
the Binomial, but describes frequencies where the probability
of one result is influenced by a previous result, and therefore
are NOT independent (e.g., sampling from a lot of 100 parts,
only 99 of which are good --- reference the "multiplicative
rule" "corollary", discussed previously).
The Hypergeometric Distribution
is very difficult to calculate by hand, but...
The MS Excel function of the probability for the
"Hypergeometric distribution" is...
=hypgeomdist(N,S,D,P)
N = Observed number of items in the Sample that exhibit
the sought-after characteristic (e.g., 7 "good" parts)
S = Sample size (e.g., 8 parts)
D = # of items in the Population that exhibit the soughtafter characteristic (e.g., 99 “good” parts )
P = Population Size (e.g., 100 parts in the lot)
"Hypergeometric Distribution"
(back in the discussion on "probability" we asked...)
What is the probability of drawing 3 good parts from a lot of
100 parts, 99% of which are good (that is 99 of which are
good and one of which is bad)?
Back then, we calculated it like so:
1st draw
2nd draw
3rd draw
99 / 100 x 98 / 99 x 97 / 98 = 0.9700
Now we can use the hypergeometric Excel function instead:
=hypgeomdist( 3, 3, 99, 100 ) = 0.9700
If we had instead used the binomial Excel function, we would
have obtained this wrong answer:
=binomdist( 3, 3, 0.99, false ) = 0.9703
( which equals 99/100 x 99/100 x 99/100 )
Binomial vs. Hypergeometric Formula
 As long as sample size is not more than 1% of lot size,
the two formulae give the "same" result. For example...
SmplSize = 10, LotSize = 1000 (= Sample is 1% of Lot)
=hypgeomdist( 10, 10, 990, 1000 ) = 0.904
=binomdist( 10, 10, 0.99, false ) = 0.904
SmplSize = 100, LotSize = 1000 (= Sample is 10% of Lot)
right
=hypgeomdist( 100, 100, 990, 1000 ) = 0.347
wrong
=binomdist( 100, 100, 0.99, false ) = 0.366
FYI: MS Excel cannot calculate every combination of
Hypergeometric values --- for example...
=hypgeomdist( 135, 135, 9900, 10000 ) = #NUM!
=binomdist( 135, 135, 0.99, false ) = 0.258
Examples of Normal Distributions
The single most-used distribution in statistical
analysis is the Normal distribution.
Each of these "normal"
curves describes a
population that has the same
average value, but different
degrees of variability within
the population.
( X-axis is in the same units as the raw data.
Y-axis is count, i.e., # of observed items of a given X-value.)
Examples of Normal Distributions
X-axis is in “standard units”
(which we will discuss later).
Y-axis is count, i.e., # of
observed items of a given
X-value.)
"Normal Distribution" equation
 The equation for what we now call the
"Normal distribution histogram" was discovered
around 1730, as a way to simplify calculation of the
Binomial distribution; only power & square root tables
were needed (rather than idiot savants).
 The Normal distribution histogram has the "same" shape
as the Binomial when sample size is large and the
probabilities of the outcomes are exactly 50:50
(for example, a histogram describing the various
possible number of heads in a toss of a 10,000 coins).
 The larger the sample (e.g., the more coins), the closer
the Normal histogram shape is to the Binomial histogram
shape.
"Normal Distribution" equation
 Independently re-discovered ≈ 1800 by 2 astronomers
(Gauss & Laplace); nowadays, sometimes called the
" Gaussian curve "
 They used it to describe the distribution of errors in
measurements; it became known as the " error curve "...
 ...because errors in measurements act like a binomial
situation, that is, a very precise measurement can be only
one of two possibilities, namely either greater than the true
value or less than the true value (ignoring the remote
possibility of being exactly equal to the true value).
 Renamed the " Normal Distribution " around 1900 after it
was discovered that the "error curve" closely described the
typical (i.e., the normal) distribution of many biological
values (e.g., heights of humans, weights of walruses,
lengths of lizards).
"Normal Distribution Histogram"
Y = # of items expected at X (divide by N to get probability)
N = # of items examined (e.g., 225 people)
This
equation
your "student"
files
i=
width
of eachlooks
singleintimidating,
bar ( = lengthbut
of interval)
on histogram
(for binomial
& other discreet
distributions,
i = 1 ) for you,
contain
a spreadsheet
that does
the calculations
X = x-axis
of a given histogram
andmidpoint
then automatically
createsbar
the histogram!
μ = average or expected value of all N items
σ = standard deviation of all N items (we'll explain in a few
minutes what a "standard deviation" is)
If a histogram of your measurement
data does not mimic the histogram
created by this equation, then your
data may actually not be "normal" !!
Let's examine
"Student Normal
Histogram.xls"
"Normal (quantity) Histogram"
Normal QUANTITY
Distribution HISTOGRAM
This was created using the
Normal Distribution Histogram
equation, with N = 225, i = 0.1,
Avg = 5.5, & StdDev = 0.33.
This could represent the
distribution of a heights of 225
randomly selected people.
The sum of all these
bars = N = 225.
30
QUANTITY
25
20
15
10
5
0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
"Normal (probability) Histogram"
Normal PROBABILITY
Distribution HISTOGRAM
0.14
This was created from
the previous chart by
dividing each quantity
by N. The sum of all
these bars = 1.000,
no matter what the
sample size is
( N = 225, or
N = 1,000,000,000 ).
PROBABILITY
0.12
0.10
0.08
0.06
0.04
0.02
0.00
4.0
4.5
5.0
5.5
6.0
6.5
7.0
"Normal (probability) Curve"
ddf
This was created from the
previous chart by drawing a
smooth line from top to top
of each bar, and then
deleting the bars. The sum
of the area under this
curve is defined as = 1.000
Normal PROBABILITY Curve
0.14
PROBABILITY
0.12
0.10
0.08
0.06
0.04
0.02
0.00
4.0
4.5
5.0
5.5
6.0
6.5
7.0
Always view such curves as really a histogram
whose bars we are too cheap to print.
The "Central Limit Theorem"
(The text above is a scanned image from
Bowker & Lieberman, Engineering Statistics, 2nd ed., p. 100)
Let's examine STUDENT file: Central Limit Theorem.xls
CENTRAL LIMIT THEOREM translates as...
 for any population of raw data with any shaped distribution...
 in regards to the distribution of a large number of statistics
taken repeatedly from the population (e.g., averages, ranges,
standard deviations, etc.)...
 the distribution of the statistics will look more+more "normal“
(“bell” shaped) the larger+larger is the sample size;
 that is true because the value of a statistic will be
somewhere near the parameter, either larger or smaller than
it (ignoring the unlikely event of equaling the parameter);
i.e., it has a binomial distribution, which as we saw before, is
modeled by the "Error Curve", which in modern-times is
called the "Normal distribution".
 the distribution of the statistics will never "be" Normal, except
in cases when N is very large and the raw data population
distribution is "normal". Often, the distribution of statistics is
“ t ” shaped, as we will see in Day 2 of this course.
Distribution of Sample Avgs. vs. Population
Theoretical distribution of thousands of
individual avgs taken from the population.
Shape is due to
Central Limit
Theorem. Width
is due to Law of
Large Numbers.
DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS
( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE )
SAMPLE AVERAGE
130
100
70
1
3
5
7
9
11
13
15
17
SAMPLE SIZE
19
21
23
25
27
29
Let's look at
this in more
detail using
MS Excel
Numerical Expressions
Range ( 1848 ? )
Standard Deviation ( 1893 ? )
Standard Error ( 1897 ? )
Another important term is the " Mean ",
which is another way to say the "average".
"Mean" in that sense was coined about 1750.
What is an "average" ?
About a hundred years ago, "average" usually meant the
"median" ("the median home price in Dallas is..."). However, in
more modern times, the word "average", by itself, always refers
to the sum of all the values, divided by the number of values
(i.e., the "arithmetic mean"):
+
+
+
+
=
Value#1
Average =
Value#2
Sum of all Values / N
Value#3
( etc. )
Value#N
Sum of all Values
What is a "range" ?
The "range" of a set of numbers refers to the difference
between the largest and smallest value in that set:
Range =
Largest Value – Smallest Value
The range of height of people in this room is
approximately...? (different for women than men...? )
What is a "range" ?
. . .. ..... ... . .
400
450
500
550
600
This "number line" uses small red squares to
graphically depict the variability, that is, the "distribution", of the
data in a small sample. The width of the difference between the
value on the far left-hand side and the value on the far righthand side is the "range".
In this data, the range looks to be about 200 units.
“Standard” calculations
Standard XXX
(the mathematical definition, for population parameter)
∑ ( Xi – Mean )2
# of data points in the Mean
Standard XXX (when using a sample to guess what the
population parameter Standard XXX is)
∑ ( Xi – Mean )2
# of data points in the Mean, minus Y
Y = whole number, greater than zero; value depends on which
"standard" statistic is being calculated.
Standard Deviation & Standard Error
Standard XXX (from a previous slide)
XXX = "Deviation" (that is "Standard Deviation") when
talking about raw data (e.g., heights of humans, and
lengths of lizards).
XXX = "Error" (that is, "Standard Error") when talking about
calculated values (i.e., Statistics), for example:
-- sample means ("Standard Error of the Mean"), or
-- sample standard deviations ("Standard Error of the
Standard Deviation").
100 Samples per data point (each point = average Std Dev of all 100 ).
Random samples taken from normal population with Std Dev = 10.0
Standard Deviation Calculated
12
10
8
Standard Deviation (n-1)
6
Standard Deviation (n)
Other random samples would produce differently
shaped curves; but, on average, the "n" curve would
be farther away (on the low side) from the true value
than the "n-1" curve. That is, the "n-1" statistic is a
better estimator of the parameter than the "n" one.
4
2
0
1
10
100
1000
10000
Sample Size
This is another example of how to think about the
"Law of Large Numbers"; that is, the larger the sample size,
the closer (on average) the "statistic" is to the "parameter".
(revisited)
Distribution of Sample Avgs. vs. Population
Theoretical distribution of thousands of
individual avgs taken from the population.
 As was stated on a previous slide, a distribution
(of raw data or of statistics such as "averages")
is "normal" if it's histogram mimics a "normal" one.
 Said differently, a distribution is "normal" if its distribution has
characteristics that mimic that of the "normal probability
curve", such as...
+/– 1 StdXXX from Avg = 68.3 % of area under curve
+/– 2 StdXXX from Avg = 95.5 % of area under curve
+/– 3 StdXXX from Avg = 99.7 % of area under curve
as seen in next few slides...
Areas under the "normal" curve
The
darkened
area equals
68.3 %
of the area
under the
curve.
70
80
90
100
110
120
130
Areas under the "normal" curve
The
darkened
area equals
95.5 %
of the area
under the
curve.
70
80
90
100
110
120
130
Areas under the "normal" curve
If a population with
Avg =100, StdXXX = 10
is believed to be
Normally distributed,
then...
(1 – 0.9973) / 2 of
population (≈ 0.135%)
is predicted to be
below X = 70
70
80
The darkened area
equals 99.73 % of
area under curve.
This +/− 3
interval is used
extensively in
“Statistical
Process Control”
(SPC).
90
100
110
120
130
Areas under the "normal" curve
This +/− 2.58
interval is used
extensively in
“Gage R&R”
and other
“Metrology”
methods.
The
darkened
area equals
99.0 % of the
area under
the curve.
70
80
90
100
110
120
130
Areas under the "normal" curve
, then...
10
+/– 1.96 Std XXX equals 95.00% of area under curve
If Standard XXX is
0.045
---
0.040
0.035
PROBABILITY
This +/− 1.96
interval is used in
some Reliability
calculations & in
some tests of
“Significance”.
The
darkened
area equals
95.0 % of the
area under
the curve.
0.030
0.025
0.020
0.015
0.010
0.005
0.000
67
70 73
80
80
87
90 93
100
100
110113
107
120
120
130 133
127
This is called
a " Z " Table
In a normal distribution,
+/– Z std deviations from
the Parameter Avg
encompasses 2 x A of the
population of numbers.
+/– 1.96 standard
deviations equals
2 x 0.4750 = 95.0%
of the area under
the normal curve
+/– 3.00 standard
deviations equals
2 x 0.4987 = 99.7%
of the area under
the normal curve
Class exercise: Estimation of Std Dev
Assuming this ≈ normal distribution of raw data,
approximately what is the Std Deviation?
Almost all of
distribution is
≈ Mean +/– 30.
If "normal", then
30 ≈ 3 StdDevs;
therefore
StdDeviation ≈ 10
70
80
90
100
110
120
130
Class exercise: Estimation of Std Error
Assuming this ≈ normal distribution of Smpl Avgs,
approximately what is the ≈ Standard Error?
Almost all of
distribution is
≈ Mean +/– 15.
If "normal", then
15 ≈ 3 Std Errors;
therefore
Std Error ≈ 5
85
90
95
100
105
110
115
Calculating a "standard error"
Any statistic from a single sample will likely not be identical to
the parameter. For example, you can expect a sample mean to
be off by some unknown amount from the population mean, i.e.
to have some amount of "error". The "standard" amount of error
to expect is called the "standard error". The theoretical
definitions of two important standard errors are:
Std Error of Mean = Std Dev of all possible (or at least a very
large number of) sample averages (of a single sample size)
taken from a Population.
Std Error of StdDev = Std Dev of all possible (or at least a
very large number of) "n-1" std deviations (of a single sample
size) taken from a Population.
Calculating a "standard error"
Avg#1
Avg#2
Avg#3
Avg#4
etc.
Avg#N
------------Std Dev of Avgs = Std Error
of the Mean
StdDev#1
StdDev#2
StdDev#3
StdDev#4
etc.
StdDev#N
----------------
Std Dev of StdDevs = Std Error of the
Std Deviation
Practical formula for "Std Error of Mean"
Standard Error of the (sample) Mean
( estimated from 1 sample )
Sample Standard Deviation
Sample Size
.
Linear Regression
& the
Correlation Coefficient
What is the meaning of a
Linear Regression Correlation Coefficient?
In 2009, a billion dollar manufacturing company submitted
to a government regulatory agency a report from a product
technical file, claiming that performance data between the
stressed and unstressed product were not significantly
different, because the “correlation coefficient” between the
data sets was large (about 0.99).
The regulatory personnel knew that such a claim is
nonsense, and they so officially requested a literature or text
book reference that explained such a rationale. After a few
rounds of emails and re-writings of the report (and still no
literature reference) the company consulted a professional
statistician, who recommended using a different statistical
method to prove equivalency.
Understanding Linear Regression
& the Correlation Coefficient
X (class = grade
Y (minutes of study before
in school)
2
3
5
6
being easily distracted)
4.2
5.9
10.4
11.5
Is there a linear relationship between class (= grade) in school
and tendency toward distraction? How strong is it?
How consistent is the relationship (that is, what is the degree of
co-relation (more commonly called "correlation")? Let's use
Excel to find out!
Understanding Linear Regression
& the Correlation Coefficient
14
12
y = 1.91x + 0.36
R2 = 0.9897
10
8
6
4
2
0
0
1
2
3
4
5
6
• This is a "linear regression plot" of the data.
• The "regression coefficient" is 1.91
• The "correlation coefficient" is " R " or " r " ,
that is, " r " = the square root of 0.9897 = 0.995
7
Understanding Linear Regression
& the Correlation Coefficient
14
y=
12
1.91x + 0.36
10
R2 = 0.9897
8
6
4
2
0
0
1
2
3
4
5
6
7
• Linear regression puts the "best" straight line thru a
plot of X vs. Y data points.
• The "regression coefficient" (= 1.91 = the slope of
this line) tells use how STRONG the relationship is.
Understanding Linear Regression
& the Correlation Coefficient
14
12
y = 1.91x + 0.36
R2 = 0.9897
10
8
6
4
2
0
0
1
2
3
4
This is an example of
"Reliability Plotting",
which is discussed in
6
7
Day5 3 of this
workshop.
• The linear regression equation (e.g., Y = 1.91X + 0.36 )
allows us to predict the Y value for a nearby X value.
• CLASS EXERCISE: What Y value do we expect at X = 1.0
ANSWER: ( 1.91 times 1.0 ) + 0.36 = 2.27
Understanding Linear Regression
& the Correlation Coefficient
MS Excel Spreadsheet functions...
linear regression coefficient
=SLOPE( known_y's, known_x's )
correlation coefficient --- same result given by either...
=CORREL( known_y's, known_x's )
or...
=CORREL( known_x's, known_y's )
Notice that the function formula for the slope cares
about which data set is X and which is Y, but the
formula for the correlation coefficient does not.
Are Correlation Coefficients the same if data
sets are the same except for magnitude...???
1000
0.955
800
0.955
600
0.955
400
200
YES !!
r
0
0
5
10
15
20
Does the Correlation
Coefficient
increase
Understanding
Linear
Regression
& the
in sizeCorrelation
with additional
data points...??
Coefficient
1000
0.955
800
0.962
600
0.971
400
r
200
NO !!
0
0
5
10
15
20
Does a large Correlation Coefficient indicate
that the data is truly linear...???
1000
0.955
800
0.955
600
0.955
400
200
NO !! (notice how the the
0
0
5
lower- most 2 data sets show a
slight curve) (the solid black lines
10straight, not curved)
15
are all
r
20
If
the data is closeLinear
to the line,
is the Correlation
Understanding
Regression
& the
Coefficient
always
large...???
Correlation
Coefficient
1000
0.955
NO !!
800
600
0.791
400
slight slope to this lowest regression line
200
0.064
0
0
5
10
15
20
r
85
Does a large Correlation Coefficient indicate that
Understanding
Linear
Regression
& the
the X,Y data have
a strong
relationship
(i.e., that the
regression coefficient
is large)...??
Correlation
Coefficient
1000
0.955
NO !!
800
600
0.955
400
200
0.955
slight slope & 2 dots per point in this lowest regression line
0
0
5
10
15
20
r
Understanding Linear Regression
& the Correlation Coefficient
|r
r|

Se
Sy
There are at least a dozen different formulas for the
Correlation Coefficient.
The instructor considers this the best formula for
teaching the meaning of Correlation.
The next few slides explain it....
Understanding Linear Regression
& the Correlation Coefficient
Ye is calculated from the linear regression equation that is used
to draw the "straight line" thru the data: Ye = y = aX + b
14
12
y = 1.91x + 0.36
R2 = 0.9897
10
8
6
4
2
0
0
1
2
3
4
5
6
7
The square root of 0.9897 = r = 0.995 = correlation coefficient
(this chart & equation were produced by MS Excel)
Understanding Linear Regression
& the Correlation Coefficient
Continuing with the data and equation from the previous slide:
This equation from previous slide
observed X
2
3
5
6
Std Dev
observed Y
4.2
5.9
10.4
11.5
3.505 = Sy
Ye = 1.91 ( X ) + 0.36
Ye
4.18
6.09
9.91
11.82
3.487 = Se
r = 3.487 / 3.505 = 0.995 = same as on previous slide
(this is not a trick; it is just one of many mathematically identical
formulas for calculating the magnitude of “ r ”)
Understanding Linear Regression
& the Correlation Coefficient
|r
r|

Se
Sy
The absolute value of the Correlation Coefficient :
Correlation Coefficient is the ratio of 2 standard deviations:
The numerator is the smallest possible standard deviation that
can be expected in the Y data points ( = Se ), and the
denominator is the observed standard deviation in the Y data
points ( = Sy ).
If the observed data were closer to the linear regression
line, then Sy would be smaller and then the Se / Sy ratio would
be closer to 1.000. THE CORRELATION COEFFICIENT
THEREFORE IS A MEASURE OF VARIABILITY, OF HOW
CONSISTENTLY THE PLOTTED DATA TRACKS TO THE
LINEAR REGRESSION LINE.
Understanding Linear Regression
& the Correlation Coefficient
|r
r|
Se

Sy
The Correlation Coefficient is... the fraction of the observed Y
data variation ( = Sy, the std deviation of the observed Y
values) that is explainable by a linear relationship between X
and Y ( the variation “associated with” or “caused by” that linear
relationship is Se, the std deviation of the predicted Y values).
The rest of the variation in the data is definitely due to
something else (e.g., poor measurement equipment, poor
measurement technique, other factors, random error, or... the
fact that the data are NOT linearly related !!).
Assuming
Y is dependent
X, what is the&source
Understanding
LinearonRegression
the
(the "cause")
of the variation
in Y-values?
Correlation
Coefficient
fsf
1000
0.955
800
Sometimes there is no
"cause" (e.g., correlation
between arm-length
and leg-length).
600
Almost no variation in Y is "caused" by relationship
between X & Y (something else is the "cause", such
slope to this lowest
regression line
as assay variation slight
or measurement
error).
400
200
0.064
0
0
5
10
15
r
92 20
What is the meaning of a
(linear regression) Correlation Coefficient?
• The correlation coefficient is...
an indicator of predictability in the data on the Y axis.
• It represents...
the fraction of the variation in the Y-data that can be
explained by an hypothesized linear relationship
between X and Y.
• If that hypothesis is false, i.e., if the relationship between X
and Y is not truly linear, then the Correlation Coefficient is
meaningless.
r=
(stdev solid dots)
(stdev hollow dots)
As mentioned earlier, in 2009, a billion dollar company submitted
to a regulatory agency a report in a tech file, claiming that
performance data between the stressed and unstressed product
were not significantly different, because the “correlation
coefficient” between the data sets was large (about 0.99).
Have you learned enough to explain why that is nonsense?
15
6
STRESSED
STRESSED
y = 0.5076x - 0.078
4
2
y = 0.5076x - 0.078
R2 = 0.9937
10
5
R2 = 0.9937
0
0
0
5
10
UNSTRESSED
15
0
5
10
UNSTRESSED
15
Conclusion to: Understanding Linear
Regression & the Correlation Coefficient:
1000
0.955
800
600
400
0.955
0.955
200
0
• Just because
Excel lets you put a Linear Regression line thru
0
15
data points
does not5 mean the 10
data is a straight
line. 20
• Just because the Correlation Coefficient is large does not mean
you have a straight line.
• You must use your judgment to determine if the line is
straight, and if "yes", then and only then can you use the Linear
Regression Equation and Correlation Coefficient to help you
96
evaluate the relationship between your X and Y values.
How to implement what you learned today?
 A new language (and some of its vocabulary) is
primarily what you learned today.
 Like any language, you must speak it if you are to
learn it well.
 Read your company's SOP (or ??) on statistical
techniques.
 Ask to read some of the validation protocols and
validation reports that relate to your work, and study
their "statistics" section (or it might be called the
"data analysis" section).
 Ask your boss to explain statistical statements made
in meetings, reports, or SOPs.