TMA 4255 Forsøksplanlegging og anvendte statistiske - no

Transcript TMA 4255 Forsøksplanlegging og anvendte statistiske - no

TMA 4255
Applied Statistics
Spring 2010
1
About the course
Learning outcome
The objective of the course is to give the students a solid foundation for use of basic
statistical methods in science and technology. In addition the students shall be
capable of planning collection of data and to use statistical software for analysing
data.
Learning methods and activities
Lectures and exercises with the use of a computer (computing programme MINITAB).
The lectures may be given in English. Portfolio assessment is the basis for the grade
awarded in the course. This portfolio comprises a written final examination 80% and
selected parts of the exercises 20%. The results for the constituent parts are to be
given in %-points, while the grade for the whole portfolio (course grade) is given by
the letter grading system. Retake of examination may be given as an oral
examination.
Compulsory assignments
Exercises
Recommended previous knowledge
The course is based on ST0103 Statistics with Applications/TMA4240
Statistics/TMA4245 Statistics, or equivalent.
2
Contents (preliminary list)
Hypotheses testing, simple and multiple linear regression, residual plots and
selection of variables, transformations, design of experiments, 2^k
experiments and fractions of these. Special designs. Graphical methods.
Error propagation formula. Analysis of variance, statistical process control,
contingency tables and nonparametric methods. Use of statistical computer
package, MINITAB.
Lecturer
Professor Bo Lindqvist, Room 1129, Sentralbygg II, NTNU Gløshaugen
Telephone: (735) 93532
Email: [email protected]
Teaching assistant
Stipendiat Håkon Toftaker, Room 1036, Sentralbygg II, NTNU Gløshaugen
Telephone (735) 91681
Email: [email protected]
3
Teaching material
Main book:
Walpole, Myers, Myers and Ye: "Probability and
Statistics for Engineers and Scientists". Eighth Edition.
Pearson International Edition.
Tables:
”Tabeller og formler i statistikk”, 2. utgave. Tapir 2009.
MINITAB:
Information is found on
http://www.ntnu.no/adm/it/brukerstotte/programvare/minitab.
4
Weekly meetings
Lectures
Tuesdays 12.15 – 14.00 H3
Thursdays 12.15 – 14.00 S4
Exercises
Mondays 17.15 – 18.00 in H3
or a computer lab
5
Preliminary curriculum, lecturing and progress plan
Week
Topic
Chapter (WMMY)
Exercise
2
Introduction, motivation and repetition.
Descriptive measures and graphs.
Normal plot.
(1-10) Particularly 8.1-8.7
1
3
Two-sample case. Comparing variances.
F-distribution.
Simple linear regression,
8.8, 9.8, 9.13, 10.8, 10.13, 10.18.
(11.1-11.5) 11.6 – 11.12
2-3
4-5
Multiple linear regression
12.1 -12.7, 12.9-12.11
4-5
6-8
2k experiment and fractions thereof
BHH 10 and 12
Alternatively, WMMY 15
6-8
9-10
Analysis of variance
13.1-13.4,13.6,13.8-13.10,13.13,13.15
14.1-14.4
9-10
11
Statistical process control
17.1-17.5
11
12
Chi-square tests and Contingency tables
10.14-10.16
12
16.1-16.3
12
13-14 (Tuesd)
Easter vacation
14(Thursd)-15
Nonparametric statistics
16
Approximation of expectation and
variance
17(Tuesd)
Repetition
6
7
8
9
10
11
12
13
14
The compulsory project: Example
15
16
17
Introduction to course
TMA 4240/45 Statistics and ST 0103:
Probability theory + simple statistics
TMA 4255 (this course):
A little probability + APPLICABLE and APPLIED statistics
The ”classical” statistical methods:
•
•
•
•
•
Regression analysis
Design of experiments
Analysis of variance (ANOVA)
Analysis of discrete data (contingency tables)
Nonparametric methods
18
Why is statistics important in science and
industry?
The book emphasizes ”the Japanese industrial miracle”:
•Use of statistical methods in design and production
•Statistical thinking in all parts of the production
In 2000, the highly reputated international medical journal New
England Journal of Medicine appointed
•Use of statistical methods
as one of the 11 most important medical advances throughout
time
19
20
Originally:
Statistics = ”collection and presentation of data”
Today much more:
•Design and collection of data from statistical investigations
•Modeling of the stochastic mechanisms behind the data
•Drawing conclusions about these mechanisms, based on the data
•Evaluation of the strength of the conclusions (variance, confidence interval,
test power)
•Basic tool: Probability theory
21
Statistical investigations can be divided into two main
types:
•Experimental studies based on design of experiments (DOE):
Experiments are done under controlled conditions.
•Observational studies: When control of conditions are not possible.
22
Statistical studies
Eksperimental studies:
Clinical trials
Comparison of drugs A og B
Trial group of n persons
r drawn at random are
given A
s drawn at random are
given B
n-r-s (rest) get Placebo
Blind test: Patient does not
know kind of drug
Double blind: Examining
doctor does not know
either
Observational studies:
Epidemiological experiments
• Smoking and cancer
• Diet and coronary diseases
Cannot control the conditions; e.g.
cannot force people to
smoke/not smoke.
Difficulty in interpretation: May be
unknown underlying causes
which make the results biased
(”confounding”). For example:
A gene which increases the
need for smoking, and at the
same time influences chance
of getting cancer.
23
Statistics in scientific investigations:
•Generate hypotheses
•Derive consequences
•See whether these are fulfilled in observations
•Generate new hypotheses
•Etc.
24
Particular matters for statistics in industry:
Product and process engineer:
Off-line: Controlled experiments aiming at optimizing production
Production: Register production data; use them to control
production.
25
MINITAB 15: Rocket fuel example
Stat > Basic Statistics > Display Descriptive Statistics
26
Stat > Basic Statistics > Graphical Summary
Summary for X
A nderson-Darling N ormality Test
36
38
40
A -S quared
P -V alue
0,59
0,092
M ean
S tDev
V ariance
S kew ness
Kurtosis
N
40,560
1,991
3,963
-1,55360
2,72795
10
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
42
35,900
39,275
41,000
42,000
42,600
95% C onfidence Interv al for M ean
39,136
41,984
95% C onfidence Interv al for M edian
39,266
42,037
95% C onfidence Interv al for S tDev
9 5 % C onfidence Inter vals
1,369
3,634
Mean
Median
39,0
39,5
40,0
40,5
41,0
41,5
42,0
27
More plots: Rocket fuel data
Dotplot of X
Empirical CDF of X
Normal
Mean
StDev
N
100
40,56
1,991
10
36,9
37,8
38,7
39,6
40,5
41,4
42,3
60
40
X
20
0
37,5
35,0
40,0
X
42,5
45,0
Probability Plot of X
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
40,56
1,991
10
0,589
0,092
80
Percent
36,0
Percent
80
70
60
50
40
30
20
10
5
1
35,0
37,5
40,0
X
42,5
45,0
47,5
28
29
30
Statistical inference with MINITAB: Rocket fuel data
Stat > Basic Statistics > 1-Sample Z
Assume known standard deviation sigma=2
One-Sample Z: X
Test of mu = 40 vs not = 40
The assumed standard deviation = 2
Variable
X
N
10
Mean
40,560
StDev
1,991
SE Mean
0,632
95% CI
(39,320; 41,800)
Z
0,89
P
0,376
One-Sample Z: X
Test of mu = 40 vs > 40
The assumed standard deviation = 2
Variable
X
N
10
Mean
40,560
StDev
1,991
SE Mean
0,632
95% Lower
Bound
39,520
Z
0,89
P
0,188
31
Stat > Basic Statistics > 1-Sample t
Assume unknown standard deviation sigma
One-Sample T: X
Test of mu = 40 vs not = 40
Variable
X
N
10
Mean
40,560
StDev
1,991
SE Mean
0,629
95% CI
(39,136; 41,984)
SE Mean
0,629
95% Lower
Bound
39,406
T
0,89
P
0,397
One-Sample T: X
Test of mu = 40 vs > 40
Variable
X
N
10
Mean
40,560
StDev
1,991
T
0,89
P
0,198
32
33
One and two sample tests concerning means
34
Example: The industrial experment
An experiment was performed on a manufacturing plant by making in sequence
10 batches using the standard production method (A), followed by 10
batches of a modified method (B). The results from these trials are given in
the table on the next slide. What evidence do the data provide that method
B is better than method A?
35
36
37
(Assuming equal variances:)
38
(Not assuming equal variances:)
39
Two sample t-test
40
Paired observations (example)
41
42
43
Test for Equal Variances for X
F-Test
Test Statistic
P-Value
1
1,58
0,505
Metode
Levene's Test
Test Statistic
P-Value
2
2
3
4
5
6
7
95% Bonferroni Confidence Intervals for StDevs
0,78
0,390
8
Metode
1
2
80
82
84
86
88
90
92
X
44
45
Regression analysis
Response
Dependent variable (WMMY)
Y
Explanatory variable
Predictor
Covariate
Independent variable (WMMY)
Regressor (WMMY)
Goal: Describe Y as function of the xi,
46
Linear Regression
47
Typical data
48
EXAMPLE
Connection between stiffness (stivheit)
and density (tettleik) of a tree product.
49
Plot of stiffness versus density
Scatterplot of stivheit vs tettleik
100000
stivheit
80000
60000
40000
20000
0
5
10
15
tettleik
20
25
50
51
Least squares regression line
Scatterplot of stivheit vs tettleik
100000
stivheit
80000
60000
40000
20000
0
5
10
15
tettleik
20
25
52
53
54
55
Regression line for logstiv ( = log(stivheit) )
Scatterplot of logstiv vs tettleik
11,5
11,0
logstiv
10,5
10,0
9,5
9,0
8,5
5
10
15
tettleik
20
25
56
57
58
Predicted values for new observations
59
60
61
62
63
64
65
Multiple Linear Regression
66
Example – Acid rain in Norwegian lakes
Data from a study of the influence of acid rain on Norwegian lakes,
made in 1986. Totally 1005 lakes were studied. In this example are chosen
26 lakes, 16 from Telemark and 10 from Trøndelag.
Variable
name
Meaning
x1
Content of SO4 (mg/l)
x2
Content of NO3 (mg/l)
x3
Content of Ca (mg/l)
x4
Content of latent aluminum (mu g/l)
x5
Content of organic substance (mg/l)
x6
Area of lake
x7
Site (0 = Telemark, 1 = Trøndelag)
y
Measured pH
z
Fish status (1 = no damage, 2 = minor damage, 3 = major damage, 4 = no fish)
ROW x1 x2 x3
1 4.9 39 1.54
2 4.1 75 1.55
3 3.5 80 0.83
4 3.8 75 0.53
x4
78
17
157
163
x5
2.02
2.98
3.40
3.42
x6 x7
y z
0.30 0 5.38 3
1.85 0 5.68 1
0.25 0 5.04 3
0.30 0 4.81 4
67
68
69
70
71
72
73
74
75
Model using x1,x2,x3
76
77
78
79
80
Simulation
20 observations are simulated
from the model
Y = 1 + ….
With
Epsilon N(0,0.02^2)
X1 uniform on (1,3)
X2 is N(1,0.5^2)
81
The wrong model
Y = beta0 …..
is estimated by MINITAB
82
Normal plot looks at first OK, but has a systematic convex shape. The
histogram is skew to the left, while the other plots are fairly OK.
83
The residuals vs x1 have an ”opposite U” shape, which suggests that model (1)
is not correct. Residuals vs x2 look better, but there is possibly something wrong
here, too.
84
The residual plot suggests that
x1 should be transformed. We
use log x1 (which is of course
correct since we know the
underlying model. Estimate the
given model in MINITAB
85
86
87
Example: Multicollinearity
88
89
90
91
Factorial Experiments
92
93
94
95
96
97
98
Back to two-factor experiment
99
DOE terminology – main effects:
100
Interaction effects
101
102
Three factors
z1
z2
z3
z12
z13
z23
z123
103
Two-factor interaction
104
Three-factor interaction
105
Or: using + and – in columns:
106
107
108
Cube plot
109
Four factors – example
110
Full Factorial Design
Factors:
4; 16
Runs:
1
Blocks:
(total):
4
Base Design:
16
Replicates:
1
Center pts
0
All terms are free from
aliasing.
111
Factorial Fit: Y versus A; B; C; D
Estimated Effects and Coefficients for Y (coded units)
Term
Constant
A
B
C
D
A*B
A*C
A*D
B*C
B*D
C*D
A*B*C
A*B*D
A*C*D
B*C*D
A*B*C*D
S = *
Effect
-8,000
24,000
-2,250
-5,500
1,000
0,750
-0,000
-1,250
4,500
-0,250
-0,750
0,500
-0,250
-0,750
-0,250
Coef
72,250
-4,000
12,000
-1,125
-2,750
0,500
0,375
-0,000
-0,625
2,250
-0,125
-0,375
0,250
-0,125
-0,375
-0,125
112
Analysis of Variance for Y (coded units)
Source
Main Effects
2-Way Interactions
3-Way Interactions
4-Way Interactions
Residual Error
Total
DF
4
6
4
1
0
15
Seq SS
2701,25
93,75
5,75
0,25
*
2801,00
Adj SS
2701,25
93,75
5,75
0,25
*
Adj MS
675,313
15,625
1,438
0,250
*
F
*
*
*
*
P
*
*
*
*
113
Normal Probability Plot of the Effects
(response is Y, Alpha = ,05)
99
B
95
90
F actor
A
B
C
D
BD
80
Percent
Effect Type
Not Significant
Significant
70
60
50
40
N ame
A
B
C
D
30
20
D
10
5
1
A
-10
-5
0
5
10
Effect
15
20
25
Lenth's PSE = 1,125
114
Pareto Chart of the Effects
(response is Y, Alpha = ,05)
2,89
F actor
A
B
C
D
B
A
D
BD
N ame
A
B
C
D
C
Term
BC
AB
BCD
ABC
AC
ABD
ABCD
ACD
CD
AD
0
5
10
15
20
25
Effect
Lenth's PSE = 1,125
115
Main Effects Plot (data means) for Y
A
B
84
78
72
Mean of Y
66
60
-1
1
-1
C
1
D
84
78
72
66
60
-1
1
-1
1
116
Interaction Plot (data means) for Y
1
-1
1
-1
1
-1
80
70
A
A
-1
1
60
80
70
B
B
-1
1
60
80
70
C
C
-1
1
60
D
117
Example: Three factors and replicate
118
Factorial Fit: Y versus A; B; C
Estimated Effects and Coefficients for Y (coded units)
Term
Constant
A
B
C
A*B
A*C
B*C
A*B*C
Effect
23,000
-5,000
1,500
1,500
10,000
0,000
0,500
S = 2,82843
Coef
64,250
11,500
-2,500
0,750
0,750
5,000
0,000
0,250
SE Coef
0,7071
0,7071
0,7071
0,7071
0,7071
0,7071
0,7071
0,7071
R-Sq = 97,63%
T
90,86
16,26
-3,54
1,06
1,06
7,07
0,00
0,35
P
0,000
0,000
0,008
0,320
0,320
0,000
1,000
0,733
R-Sq(adj) = 95,55%
Analysis of Variance for Y (coded units)
Source
Main Effects
2-Way Interactions
3-Way Interactions
Residual Error
Pure Error
Total
DF
3
3
1
8
8
15
Seq SS
2225,00
409,00
1,00
64,00
64,00
2699,00
Adj SS
2225,00
409,00
1,00
64,00
64,00
Adj MS
741,667
136,333
1,000
8,000
8,000
F
92,71
17,04
0,13
P
0,000
0,001
0,733
119
Pareto Chart of the Standardized Effects
(response is Y, Alpha = ,05)
2,31
F actor
A
B
C
A
N ame
A
B
C
AC
Term
B
AB
C
ABC
BC
0
2
4
6
8
10
12
Standardized Effect
14
16
18
120
Normal Probability Plot of the Standardized Effects
(response is Y, Alpha = ,05)
99
Effect Type
Not Significant
Significant
95
A
90
Percent
80
AC
70
F actor
A
B
C
N ame
A
B
C
60
50
40
30
20
10
B
5
1
-5
0
5
10
Standardized Effect
15
121
Blocking in 2^k experiments
Full experiment:
Two blocks:
Use column ABC as generator, i.e.
Block 1 consists of experiments with ABC = -1
Block 2 consists of experiments with ABC = 1
122
The interaction ABC is confounded (”mixed”) with the block effect.
This means that the value of the estimated coefficient of ABC can be
due to both interaction effect and block-effect.
Suppose all Y in block 2 are increased by a value h. Then the estimated
effect of ABC will increase by h. But one cannot know from observations
whether this is due to the interaction ABC or the block effect.
On the other hand, the estimated main effects A,B,C and the two-factor
interactions AB,AC,BC are not changed by the h. These are of most
importance to estimate, so the choice of blocking seems reasonable.
123
Four blocks in 2^3 experiment
Need two columns of +/- to define 4 blocks. Turns out that the best
option is to use two two-factor interactions, e.g. AB and AC (which
is default in MINITAB
Block 1: Experiments where AB = AC = -1
Block 2: Experiments where AB = -1, AC = 1
Block 3: Experiments where AB = 1, AC = -1
Block 4: Experiments where AB = AC = 1
124
Block structure is as follows:
Interaction effects AB and AC are confounded with the block effect, since they
are generators for the blocks. In addition, their product AB*AC = AABC = BC
is confounded with the block effect (Note: the BC column is constant within each
block.
Adding h2 to block 2, h3 to block 3, h4 to block 4 does not change estimated
effects of A,B,C, and also does not change the third order interaction ABC.
However, e.g. AB will change by 2h3+2h4-2h2 and we do not know whether
this is due to an interaction effect or blocking effect: This is CONFOUNDING.
125
How to determine which columns to use for blocking?
Idea: Try to leave estimates for main effects and low order interactions
unchanged by blocking.
Note: I = AA = BB = CC where I is a column of 1’s
Find the blocks for a 2^3 experiment using columns ABC and AC.
The interaction between ABC and AC is
ABC*AC = AA*B*CC = B
which is a main effect, which hence is confounded with the block effect
(in addition to ABC and AC)
126
127
Example obligatory project
128
129
130
131
MINITAB plots
132
Assuming third and fourth order interactions are 0
133
Interaction plots
134
135
136
137
138
139
From Exam in TMA4260 Industrial Statistics, december 2003, Exercise 2
A company decides to investigate the hardening process of a ballbearing
production.
The following four factors are chosen:
A: content of added carbon
B: Hardening temperature
C: Hardening time
D: Cooling temperature.
Design and results are given below:
a) What is the generator and the defining relation of the design, and what is the
design’s resolution? Write down the alias structure.
Find the estimates of the main effect of A and the interaction effect AC.
140
b) What is the variance of the main effect A and the interaction AC?
Assume that the st deviation sigma has been estimated from other experiments,
by s = 0.312 with 9 degrees of freedom (in the exam, this had been done in Ex 1.)
Use this estimate to find out whether the interaction AC is significantly different from
0 (i.e. ”active”) Use 5% significance level. What is the conclusion of the experiment
so far?
141
The company is well satisfied with the results so far and they decide to carry out
also the other half fraction. The result of the other half fraction is given below.
142
Use this to find unconfounded estimates for the main effects and the two-factor
interactions.
Assume that one would like to estimate the variance of the effects from the
higher order interactions. Explain how this can be done, and find the estimate.
Is it wise to include the four-factor interaction in this calculation? Why (not)?
Later, one of the operators that participated in the experiments asked whether
one could have carried out the first half fraction in (a) in two blocks. This would,
he said, have simplified considerably the performance of the experiments.
What answer would you give to the operator?
143
From Exam in SIF 5066 Experimental design and…, May 2003, Exercise 1
A company making ballbearings experienced problems with the lifetimes of the
products. In an experiments that they carried out they considered the factors
A: type of ball – standard (-) or modified (+)
B: type of cage - standard (-) or modified (+)
C: type of lubricate - standard (-) or modified (+)
D: quantity of lubricate – normal (-) or large (+)
The repsonse was the lifetime of the ballbearing in an accelerated life testing
experiment. The results are given on the next page.
144
A: type of ball
B: type of cage
C: type of lubricate
D: quantity of lubricate
a) What type of experiment is this? What is the defining relation? What is the
resolution? Calculate estimates of the main effect of A and the two-factor
interaction AB.
b) Estimated contrasts for B,C,D,AC,AD are, respectively, 0.60, 0.31, 0.22, -0.11,
-0.01. What can you say about the estimated effects for CD, BD, BC. BCD, ACD,
ABD,ABC?
Assume that factors C and D do not influence the response. Explain why this is
then a 2^2 experiment with replicate. Calculate an estimate for the variance of the
effects, and find out whether A, B and AB are now significant.
c) Give an interpretation of the results. The experiment was in fact carried out in two
blocks, where experiments 1-4 was one block and 5-8 the other. How is this
blocking constructed? How will we need to modify the analysis of significance in (b)?
(Assume again that C,D do not influence the response)
145

TMA 4255 Forsøksplanlegging og anvendte statistiske - no

Transcript TMA 4255 Forsøksplanlegging og anvendte statistiske - no

Directory