Stats 845 - The Department of Mathematics & Statistics

Download Report

Transcript Stats 845 - The Department of Mathematics & Statistics

Stats 845
Applied Statistics
This Course will cover:
1. Regression
–
–
Non Linear Regression
Multiple Regression
2. Analysis of Variance and Experimental
Design
The Emphasis will be on:
1. Learning Techniques through example:
2. Use of common statistical packages.
•
•
•
•
SPSS
Minitab
SAS
SPlus
What is Statistics?
It is the major mathematical tool of
scientific inference - the art of drawing
conclusion from data. Data that is to some
extent corrupted by some component of
random variation (random noise)
An analogy can be drawn to
data that is affected by
random components of
variation to signals that are
corrupted by noise.
Quite often sounds that are
heard or received by some
radio receiver can be thought
of as signals with
superimposed noise.
The objective in signal theory
is to extract the signal from
the received sound (i.e.
remove the noise to the
greatest extent possible). The
same is true in data analysis.
Example A:
Suppose we are comparing
the effect of three different
diets on weight loss.
An observation on weight loss
can be thought of as being
made up of two components:
1. A component due to the effect
of the diet being applied to the
subject (the signal)
2. A random component due to
other factors affecting weight
loss not considered (initial
weight of the subject, sex of the
subject, metabolic makeup of
the subject.) random noise.
Note:
that random assignment of
subjects to diets will ensure
that this component will be a
random effect.
Example B
In this example we again are
comparing the effect of three diets on weight
gain. Subjects are randomly divided into three
groups. Diets are randomly distributed
amongst the groups. Measurements on weight
gain are taken at the following times - one month
- two months
- 6 months and
- 1 year
after commencement of the diet.
In addition to both the factors Time and Diet
effecting weight gain there are two random
sources of variation (noise)
- between subject variation and
- within subject variation
This can be illustrated in a schematic
fashion as follows:
Deterministic factors
Diet
Time
Random Noise
within subject
between subject
Response
weight gain
This can be illustrated in a schematic
fashion as follows:
Questions arise about
a phenomenon
A decision is made to
collect data
Conclusion are drawn
from the analysis
Statistics
Statistics
A decision is made as
how to collect the
data
The data is
summarized and
analyzed
The data is collected
Notice the two points on the
circle where statistics plays
an important role:
1.The analysis of the collected data.
2.The design of a data collection
procedure
The analysis of the collected
data.
• This of course is the traditional use of statistics.
• Note that if the data collection procedure is well
thought out and well designed, the analysis step of
the research project will be straightforward.
• Usually experimental designs are chosen with the
statistical analysis already in mind.
• Thus the strategy for the analysis is usually
decided upon when any study is designed.
• It is a dangerous practice to select the form
of analysis after the data has been collected
( the choice may to favour certain predetermined conclusions and therefore in a
considerable loss in objectivity )
• Sometimes however a decision to use a
specific type of analysis has to be made
after the data has been collected (It was
overlooked at the design stage)
The design of a data collection
procedure
• the importance of statistics is quite
often ignored at this stage.
• It is important that the data collection
procedure will eventually result in
answers to the research questions.
• And will result in the most
accurate answers for the resources
available to research team.
• Note the success of a research
project should not depend on the
answers that it comes up with but
the accuracy of the answers.
• This fact is usually an indicator of
a valuable research project..
Some definitions
important to Statistics
A population:
this is the complete collection of subjects
(objects) that are of interest in the study.
There may be (and frequently are) more
than one in which case a major objective
is that of comparison.
A case (elementary sampling
unit):
This is an individual unit (subject) of the
population.
A variable:
a measurement or type of measurement
that is made on each individual case in the
population.
Types of variables
Some variables may be measured on a
numerical scale while others are
measured on a categorical scale.
The nature of the variables has a great
influence on which analysis will be used. .
For Variables measured on a numerical scale
the measurements will be numbers.
Ex: Age, Weight, Systolic Blood Pressure
For Variables measured on a categorical scale
the measurements will be categories.
Ex: Sex, Religion, Heart Disease
Types of variables
In addition some variables are labeled as
dependent variables and some variables
are labeled as independent variables.
This usually depends on the objectives of
the analysis.
Dependent variables are output or
response variables while the
independent variables are the input
variables or factors.
Usually one is interested in determining
equations that describe how the dependent
variables are affected by the independent
variables
A sample:
Is a subset of the population
Types of Samples
different types of samples are determined
by how the sample is selected.
Convenience Samples
In a convenience sample the subjects that
are most convenient to the researcher are
selected as objects in the sample.
This is not a very good procedure for
inferential Statistical Analysis but is
useful for exploratory preliminary work.
Quota samples
In quota samples subjects are chosen
conveniently until quotas are met for
different subgroups of the population.
This also is useful for exploratory
preliminary work.
Random Samples
Random samples of a given size are
selected in such that all possible samples
of that size have the same probability of
being selected.
Convenience Samples and Quota samples
are useful for preliminary studies. It is
however difficult to assess the accuracy
of estimates based on this type of
sampling scheme.
Sometimes however one has to be
satisfied with a convenience sample and
assume that it is equivalent to a random
sampling procedure
Some other definitions
A population statistic
(parameter):
Any quantity computed from the values
of variables for the entire population.
A sample statistic:
Any quantity computed from the values
of variables for the cases in the sample.
Statistical Decision Making
• Almost all problems in statistics
can be formulated as a problem of
making a decision .
• That is given some data observed
from some phenomena, a decision
will have to be made about the
phenomena
Decisions are generally broken
into two types:
• Estimation decisions
and
• Hypothesis Testing decisions.
Probability Theory plays a very
important role in these decisions
and the assessment of error made
by these decisions
Definition:
A random variable X is a
numerical quantity that is
determined by the outcome of a
random experiment
Example :
An individual is selected at
random from a population
and
X = the weight of the individual
The probability distribution of a
random variable (continuous) is
describe by:
its probability density curve f(x).
i.e. a curve which has the
following properties :
• 1. f(x) is always positive.
• 2. The total are under the curve f(x) is
one.
• 3. The area under the curve f(x) between
a and b is the probability that X lies
between the two values.
0.025
0.02
0.015
f(x)
0.01
0.005
0
0
20
40
60
80
100
120
Examples of some important
Univariate distributions
1.The Normal distribution
A common probability density curve is the “Normal”
density curve - symmetric and bell shaped
Comment: If m = 0 and s = 1 the distribution is
called the standard normal distribution
0.03
Normal distribution
with m = 50 and s =15
0.025
0.02
Normal distribution with
m = 70 and s =20
0.015
0.01
0.005
0
0
20
40
60
80
100
120
xm 
2
f(x) 

1
e
2s
2s
2
2.The Chi-squared distribution
with n degrees of freedom
1
(n  2 ) / 2  x / 2
f ( x)  n n / 2 x
e if x  0
 2 2
0.5
0.4
0.3
0.2
0.1
2
4
6
8
10
12
14
Comment: If z1, z2, ..., zn are
independent random variables each
having a standard normal
distribution then
2
2
2
U = z1  z2    zn
has a chi-squared distribution with
n degrees of freedom.
3. The F distribution with
n1 degrees of freedom in the
numerator and n2 degrees of
freedom in the denominator
 n 1  n 2  / 2
 n1 
if x  0
1  x
f(x)  K x
 n 2 
n1 / 2
n1 
n1  n 2  



 2  n 
2
where K =
n1  n2 


 2   2 
(n1  2)2
0.8
0.7
0.6
F dist
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Comment: If U1 and U2 are independent
random variables each having Chi-squared
distribution with n1 and n2 degrees of
freedom respectively then
U1 n1
F=
U 2 n2
has a F distribution with n1 degrees of
freedom in the numerator and n2 degrees of
freedom in the denominator
4.The t distribution with n
degrees of freedom
 n1  / 2

x 

f(x)  K 1
 n 
2
n  1

 2 
where K =
n 
   n
2
0.4
0.3
0.2
0.1
-4
-2
2
4
Comment: If z and U are independent
random variables, and z has a standard
Normal distribution while U has a Chisquared distribution with n degrees of
freedom then
t=
z
U n
has a t distribution with n degrees of
freedom.
•
1.
2.
3.
4.
5.
An Applet showing critical values and tail
probabilities for various distributions
Standard Normal
T distribution
Chi-square distribution
Gamma distribution
F distribution
The Sampling distribution
of a statistic
A random sample from a probability
distribution, with density function
f(x) is a collection of n independent
random variables, x1, x2, ...,xn with a
probability distribution described by
f(x).
If for example we collect a random
sample of individuals from a population
and
– measure some variable X for each of
those individuals,
– the n measurements x1, x2, ...,xn will
form a set of n independent random
variables with a probability distribution
equivalent to the distribution of X across
the population.
A statistic T is any quantity
computed from the random
observations x1, x2, ...,xn.
• Any statistic will necessarily be
also a random variable and
therefore will have a probability
distribution described by some
probability density function fT(t).
• This distribution is called the
sampling distribution of the
statistic T.
• This distribution is very important if one is
using this statistic in a statistical analysis.
• It is used to assess the accuracy of a
statistic if it is used as an estimator.
• It is used to determine thresholds for
acceptance and rejection if it is used for
Hypothesis testing.
Some examples of Sampling
distributions of statistics
Distribution of the sample mean for a
sample from a Normal popululation
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard
deviation s
Let
x
x
i
i
n
Than
x
x
i
i
n
has a normal sampling distribution with mean
mx  m
and standard deviation
sx  s
n
0
20
40
60
80
100
Distribution of the z statistic
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
z
xm
s
n
Then z has a standard normal distibution
Comment:
Many statistics T have a normal distribution
with mean mT and standard deviation sT.
Then
T  mT
z
sT
will have a standard normal distribution.
Distribution of the c2 statistic for
sample variance
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
s2 
and
2


x

x
 i
= sample variance
i
n 1
 xi  x 
2
s
i
n 1
= sample standard deviation
Let
c 
2
 x
i
 x
2
i
s2
(n  1)s

2
s
2
Then c2 has chi-squared distribution with n
= n-1 degrees of freedom.
The chi-squared
distribution
0 .5
0
0
4
8
12
16
20
24
Distribution of the t statistic
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
xm
t
s
n
then t has student’s t distribution with n = n-1
degrees of freedom
Comment:
If an estimator T has a normal distribution with
mean mT and standard deviation sT.
If sT is an estimatior of sT based on n degrees of
freedom
Then
TmT
t
sT
will have student’s t distribution with n degrees of
freedom. .