Stat 502 Topic #1

Download Report

Transcript Stat 502 Topic #1

Topic 1 – Basic Statistics
KKNR Chapters 1-3
1
Topic Overview

Course Syllabus & Schedule

Review: Basic Statistics




Terminology: Being able to Communicate
Distributions: Normal, T, F
Hypothesis Testing & Confidence Intervals
Significance Level & Power
2
Time Requirements

Often the biggest complaint in this class is
that the coursework takes too much time.

As a general rule, if you are staring at a
problem for 10 minutes and not getting very
far, you should seek help.

EMAIL / Office Hours / Appointment

Ask your group-mates how to get started
(remember, please don’t share homework
solutions)
3
Collaborative Learning

Key Premise: If you help each other and
work together, you will learn more.

Requires: Cooperation and some
coordinating on your part.

Note: If something isn’t working, or if you
have suggestions for how this aspect of the
course can improve, please tell me!
4
SAS Software

We’ll use SAS software for the course.

I’ll generally provide template files with each
lecture for the purpose of helping you learn
the appropriate code.

Hopefully working in groups will reduce the
strain/stress people often feel working in
SAS.

Can email me questions about SAS – please
send CODE (*.SAS file) and LOG (*.LOG).
5
Questions?
6
Terminology of Statistics &
Experimental Design
One of the most important
components of a good statistical
analysis is the ability to communicate
your results to others.
7
What is Statistics?

“Scientific process of learning from available
data and making decisions in the face of
variability”

Good statistics involves…

Unbiased collection of information.

Using appropriate statistics to describe the
information.

Using models to help us interpret the data,
make generalizations, and draw conclusions.
8
Key Concept

What is variability? Why is it
useful to us? How do we use it to
our advantage in trying to assess
relationships among variables?
9
Data Collection

Several questions should be answered before
data collection whenever possible…

What is the response variable (the variable
of interest)?

What magnitude of change in response is
important? (Statistical vs. Practical
Significance)

What are potential predictor variables?
10
Data Collection (2)

How do we measure each variable? Are
the variables continuous? Categorical?
Are the variables experimental?
Observational? Mixed? Are there any
sources of BIAS?

How many observations / replicates do
we take? And what are the resources /
costs involved?
11
Data Collection (3)

Unfortunately, I can’t count how many
times students have come to me
AFTER data collection and tried to
resolve the same questions above

You can see how a little groundwork
goes along way in real practical
experience
12
Tradeoffs – Observational
vs. Experimental Study
Observational




Easy to run
Less expensive
Difficult to draw
definitive conclusions
(cause/effect)
No control of the
predictor variables.
Experimental





More difficult to run
More expensive
Can plan the
experiment to obtain
definitive conclusions
from the model
Have control of the
predictor variables
Can show cause and
effect.
13
Classification of Variables

Type of Analysis usually depends on
classification


Let me repeat that: Type of Analysis
DEPENDS on classification!!!
Variables may be classified according to


TYPE: Categorical (qualitative) or
Continuous (quantitative)
DESCRIPTIVE ORIENTATION: Response /
Predictor / Nuisance (or control)
14
Categorical (Qualitative) Variables

Nominal – variables
are classified into
categories that have
no logical ordering.

Ordinal – variables are
classified into
categories that have
some logical ordering.

Examples
 Hair Color
 Blood Type
 Sex

Examples
 Letter Grade
 Agree/Disagree
Survey
 Social Class
 Age
15
Continous (Quantitative) Variables


Variables take on numerical values for
which arithmetic operations make sense.

Continuous – Height, Water Level, Age

Sometimes Counts (e.g. # of defects) are
treated as continuous
Note: ANY response variable discussed in
this class for regression or analysis of
variance will be treated as continuous.
16
Descriptive Orientation

Response (or dependent) variable



Predictor (or independent) variables


Variable to be described in terms of other variables
The objectives of your research determine which variable
will be the response.
Variables of experimental interest used to describe the
response variable.
Nuisance (or control) variables


Variables may be associated with the response but are not
of interest experimentally
Sometimes called covariates, confounders, or lurking
variables.
17
Association vs. Cause

Association: Some values of the response
variable tend to occur more often with
certain values of the predictor variable(s).

Association does NOT always mean
causation. Watch out for lurking variables!

Example: Deaths from heat related illness
are highest when more ice cream is sold.
Lurking variable: __________
18
Summary: Issues to Consider
When Designing An Experiment







Statement of the problem
What is the response variable of interest?
What qualifies as an “important” change in
response?
What predictors are available? Which do we expect
will be important? What are their classifications?
What is the experimental unit? Which variables, if
any, have associated uncertainty or variation?
Are there nuisance variables? Can we block on
control variables?
How many observations are to be taken? What are
the resources / costs?
19
Collaborative Learning Activity
CLG #1.1-1.4
Experiments have been described
for you. Please try (in groups) to
answer all of the relevant questions
from the previous summary slide.
20
CLG #1.1
A baseball scout is interested in determining
variables that can be used in predicting
future power-hitters from their minor league
numbers. He has minor league data
available on 200 current major leaguers.
He wants to relate this data to the number
of homeruns hit by the players in their 4th
full season of major league ball.
21
CLG #1.2
To study the effects of new methods of
germination for strawberry plants, a farmmanager randomly (and equally) allocated
200 seeds to four germination methods
(three new methods and a control). After
14 days, the weight of each plant was
measured.
22
CLG #1.3
We want to study the effects of two newly
developed instructional methods. To do this,
each method will be assigned to five different
3rd grade classrooms. There will also be five
classrooms that are used as a control.
Students will be given a test at the beginning of
the semester and an equivalent test at the end
of the semester. It will be assumed that the
difference in their test scores reflects the
amount which they have learned.
23
CLG #1.4
Design your own study, perhaps related to
some research you actually are doing or
plan to do at some point in the future. You
need a response variable, one or more
predictor variables, and reason to believe
they have an association.
24
Descriptive Statistics & Plots
Once we collect data, we need
appropriate methods to analyze it.
We often begin by considering
descriptive statistics and plots
concerning individual variables.
25
Describing the Information

You should recall most of these graphical
and numerical summaries from STAT 501

Descriptive Statistics


SAS: PROC UNIVARIATE or PROC MEANS
Different plots:

Scatter Plots, Histograms, etc.

SAS: PROC GPLOT (and other associated
statements like SYMBOL, AXIS, etc.)
26
Examining Data


Once we have collected data and understand
our different variables from a conceptual
standpoint, we now need to…

Summarize the information in the data

Use statistical procedures to draw relevant
conclusions.
These allow us to examine the distributional
aspects of our variables. They also help us
to identify potential problems (e.g. outliers,
gaps, heavy skewness etc.).
27
Key Concept

A descriptive statistic is any single number
computed from a data set and designed to
describe a feature of the population.

The true feature is known as a population
parameter. Statistics based on a known
sample are used to estimate the population
parameters.
28
Statistics as RV’s

Statistics are random variables!

The observed value of a statistic is based
on a particular sample. Different samples
may result in different values.

By looking at different samples, and the
variation in our statistic, we can get an
idea of the accuracy of our estimate.
29
Review of Random Variables
As we are talking a lot about
random variables and distributions
of statistics, it seems appropriate
to pause for review of some more
important and useful distributions.
30
Random Variables

Random variables are present in any
situation that is probabilistic – where we
cannot predict outcomes with certainty (i.e.
there is variation).

By understanding the underlying
distribution of a statistic that is random, we
may develop estimates for the population
parameter that encompass the variability
(i.e. confidence intervals).
31
Discrete Random Variables

Have a finite or countable number of
outcomes

Distribution often represented
pictorially using a histogram

Examples: Binomial, Poisson
32
Binomial Histogram
33
Continuous Random Variables

Have uncountably many possible
outcomes.

Distribution represented by a density
curve (probability represented by area
under the curve; probability of any
specific value is zero!)

Example: Normal Distribution
34
The Normal Distribution
Commonly denoted N   ,   where  is
the mean and  is the standard deviation.
0.8
Green = N(-1, var = 2)
0.6
y1

Cyan = N(3,var = 0.5)
Magenta = N(0,1)
0.4
0.2
0.0
-6
-4
-2
0
x
2
4
35
Normal Distribution Features

Symmetric & Bell-shaped.

Centered at the mean.

68-95-99.7 rule – 68% of observations
fall within one SD of the mean, 95%
within two SD, and 99.7% within 3 SD.

A standard normal distribution has
mean 0 and standard deviation 1.

Compute probabilities by first
standardizing and then using Table A-1
36
Standardization

If X ~ N   ,   , then you can transform X
to become Z ~ N(0,1) by subtracting the
mean and dividing by the standard
deviation:
Z

X 

This process is called standardization.
Once standardized, probabilities may be
looked up in a table (Table A-1).
37
Example
If the height of a person in this class is normally
distributed with a mean of 70 inches and a standard
deviation of 3 inches, what is the probability that a
randomly chosen person is less than 66 inches tall?
X ~ N  70,3
66  70 

P  X  66   P  Z 

3 

 P  Z  1.33  0.0918
38
Example (2)
If the height of this class is normally distributed with a
mean of 70 inches and a standard deviation of 3
inches, what is the probability that a randomly
chosen person will fall within 1.5 standard
deviations of the mean?
P  70  1.5  3  X  70  1.5  3   P  1.5  Z  1.5 
 0.9322  0.0668
 0.8654
39
Example (3)
Approximately what percentage of people
would be taller than 75 inches?
75  70 

P  X  75   P  Z 

3 

 P  Z  1.667   1  0.952  0.048
So we would say that about 4.8% are taller
than 75 inches.
40
Example (4)
What height represents the 80th
percentile?
Z 0.8  0.84  inverse reading of the table 
X 0.8     Z 0.8  70  3  0.84   72.52
41
Common Sampling Distributions
Normal Distribution
T Distribution
Chi-Square Distribution
F Distribution
42
T Distribution

Used to describe a standardized random variable
with unknown variance.

Similar to Z, but slightly heavier in the tails
depending on “degrees of freedom”

If X is normal, then T 
X 
~ tn 1
 n

If certain assumptions are satisified, then the T
distribution may also be used for statistics of the
form
ˆ  ˆ
T
Sˆ

Example: Two sample T-test
43
T - Distribution
44
Chi Square Distribution

Assymetric; Used to describe some nonnegative RVs (e.g. variances)

Also used widely in categorical data analysis

Probabilities (based on DF) in Table A-3.

Example: If we have a random sample from
some normal population, then
 n 1  2
 2 S 
  
 X
i

X
2
2
~  n21
45
Chi Square Distribution
46
F Distribution

Used to describe the ratio of independent variances
S12  12
~ Fn1 1,n2 1
2
2
S2  2



Used extensively in Analysis of Variances (ANOVA)
Critical values based on numerator DF,
denominator DF, and significance level are
found in Table A-4.
Special Property:
t
 ,1 2

2
 F1, ,1
47
Types of Descriptive Statistics


Measures of center or location

Mean

Median
Measures of variability

Range = Max – Min

Interquartile Range = Q3 – Q1

Variance or Standard Deviation
48
Measures of Location

The sample mean is the arithmetic average and is
n
denoted
X
1
n
X
i 1
i
where n is the sample size.

The sample mean is much more sensitive to
extreme X values than the sample median (middle
value). By the CLT, when n is large enough, the
sample mean is approximately normally distributed
(a nice property). If not large sample, but underlying population normal, then distributed as T.
49
Measures of Variability

The deviation of the ith observation is given by
Xi  X

The average (or sum) of the deviations in a sample
is always zero.

The average of the squared deviations is the
sample variance.
S 
2

n
1
n 1
 X
i 1
i
X
2
We commonly divide the sum by n – 1 degrees of
freedom so that the estimate remains unbiased.
50
Measures of Variability (2)

The sample standard deviation is
S S 
2
1
n 1
 X
i
X
2
It is more convenient than the variance
because it is measured in the same units as
the observation.

Both the variance and standard deviation
are sensitive to extreme observations.
51
Measures of Variability (3)

If an underlying distribution is Normal, then
the sample variance is distributed as ChiSquared.

Additionally, a ratio of variances (e.g.
ANOVA) will generally have an F
distribution.
52
Inference
Point Estimates
Confidence Intervals
Hypothesis Testing
53
Statistical Inference


Two General Categories

Estimation (point estimates, confidence
intervals)

Hypothesis Testing (which can often be
accomplished using CI’s)
Both estimation and hypothesis testing are
based on the premise of repeated
experiments and used heavily in
regression/ANOVA
54
General Goal of Inference

Want to distinguish chance variations from
true differences.

Example: Standardized test given to 500
Purdue students and 500 Notre Dame
students, with average scores of 92 and 88
respectively. Is the difference in scores
random chance? Or truly reflecting
intellect?
55
Point Estimates

Generally speaking, we want to quantify some
population parameter (e.g. the pop. mean  )

A point estimate is a single numerical
summary from a random sample that is used
to estimate a population parameter (e.g. X )

We generally like the point estimate to be
unbiased, that is its expectation is the
parameter of interest.
56
Point Estimates (2)

One problem with point estimates is that (by
themselves) they give no indication of precision.

A quick solution is to associate a standard error
(standard deviation) to the point estimate. For
example, we might consider X (+/- s).

Another nice property for a point estimate to
have is that of minimum variance. That is, of
all unbiased point estimators (and there can be
many), it is the one with the smallest SD.
57
Confidence Intervals

When distributional assumptions can be
made, we combine the point estimate with
its standard error to construct a confidence
interval. A fairly general form for a CI is:
Margin of Error
Point
Estimate


appropriate
crit. value

Estimated
Std. Error

Example for the mean

X  tn 1,1 2
 
s
n
58
Example (KKMN #3.13)
A random sample of n = 32 persons
attending a diet clinic was found after 3
weeks to have lost an average of X  30
pounds. The sample standard deviation
was s = 11. We would like to construct a
95% CI for the true mean.
59
Example (2)



The point estimate for the mean is given.
Assuming the underlying population is
normal, and since n is small/medium, and
the SD is estimated, the sample mean will
have a T distribution. The critical value 2.04
is the 97.5th percentile of a T distribution
with ~30 degrees of freedom (see tables).
The standard error of the mean will be
s / n  11/ 32
60
Example (3)
 
30   2.04  11/ 32 
X  t30,0.975 s / n
 26.03,33.97 
61
What is a 95% CI?

There is often some controversy about
what “95% confidence” means. The key
to understanding is to remember two
things:


The population parameter, though
unknown, is a FIXED number. It is not
random – it does not have “probability”
associated to it.
The endpoints of the confidence interval
are RANDOM. If you take a new sample,
you’ll get different endpoints.
62
What is a 95% CI? (2)


Confidence level gives the probability that
the interval will cover the true value of the
parameter. So for a 95% CI, in the long run
95% of the intervals will capture the
parameter (see next slide).
So we expect 95 / 100 intervals to capture
the parameter – but this does not mean
exactly 95 of the next 100 trials will result in
such success. The actual number that do
would be a Binomial random variable.
63
Be Careful...

Avoid statements like “the probability that the
true mean is in the CI is 0.95”. The true
mean either IS, or IS NOT, in the interval –
and we have no real way of knowing whether
we might have been unlucky in taking our
sample. We know that we will be unlucky
about 5% of the time.
65
Significance Level

Relate confidence to significance level.

The significance level represents the chance
that we will “get unlucky” in our sampling, and
not capture the population parameter with our
interval. It reflects the risk we are willing to take.

We usually set the significance level at 5% meaning that we will accept the reality that
about 1 of every 20 samples will lead us to an
interval that doesn’t capture the parameter.

More about significance level later...
66
Properties of CI’s

CI by default is the shortest interval for
which we have 100 1    % confidence that
the interval covers the parameter of interest.

Higher confidence level  Wider Interval

The CI gives a region of “likely” values for
the parameter. So if the CI does not contain
some value, then we may say that there is
evidence against the statement that the true
parameter has that value (hypothesis test).
67
Hypothesis Testing
General procedure to test if the
parameter is in a specific region
1. Look at the data & check assumptions.
2. State the null and alternative hypotheses.
The alternative is what you want to show.
3. Specify the desired significance level.
4. Specify the test statistic and its sampling
distribution under the null.
5. Form a decision rule.
6. Compute the statistic and draw a conclusion.
68
Example: Two sample t-test

H A : 1   2

H 0 : 1  2

Collect data – n1 and n2 observations
vs
x11 , x12 ,..., x1n1
x21 , x22 ,..., x2 n2


xi 
xi1  xi 2   xini
ni
Is the observed difference x1  x2
unusual if 1  2 ?
69
Two sample t-test (2)


Underlying Assumptions
1. Independent observations
2. Equal variances
3. Normally distributed observations
If the assumptions are true, then the
statistic T under the null hypothesis has a
t-distribution with n1 + n2 – 2 degrees of
freedom.
70
Two sample t-test (3)

Use the statistic
T

 x1  x2 
Sp
1
n1

1
n2
where
2
2
n

1
S

n

1
S




1
2
2
S2  1
p
n1  n2  2
Reject H0 if |T| is bigger than the
corresponding critical value of the
t-distribution for the desired significance
level. This is how we quantify “unusual”.
71
P-values

We quantify “unusual” by considering a
p-value. The p-value reflects the probability
that a randomly drawn t is more extreme
than T.

We reject the null hypothesis if the p-value
is small (less than our choice of significance
level  ).

Can be visualized by considering the redshaded area on the next slide.
72
P-values
73
Example
In a study of strawberry growth, 20 seedlings were
randomly and equally assigned to be grown in either
normal or nitrogen rich soil. The production from
these 20 plants during their 2nd year is summarized
in the table below. Can we conclude that the extra
nitrogen enhances the production?
Soil
n
Strawberries (lbs)
Mean
SD
Normal
10
0.623
0.107
Extra N
10
0.832
0.140
74
Example – Assumptions?


Independent Observations??? No reason to
believe otherwise.
Equal Variances??? Maybe – of the three this
might be the most questionable. If we assume it is
ok, the pooled variance is
9  0.107   9  0.14 

 0.0155
18
2
S

2
pooled
2
Normally Distributed??? Again, no reason to
believe otherwise.
75
Example – Hypotheses
Assuming that our assumptions are ok, we
want to test the hypotheses:
H 0 : extra  norm vs. H a : extra  norm
Our test statistic is then given by
T
 xE  xS 
Sp
1
n1
 n12

 0.832  0.623
0.0155
1
10
 101
 3.75
76
Example – Conclusion
Under the null hypothesis this statistic has a tdistribution with 18 degrees of freedom. Checking
the table, we see that the P-value is between 0.002
and 0.001. Since the p-value is smaller than α, we
reject the null and conclude that extra nitrogen does
enhance the crop.
Note that if p-value were bigger than α we would fail
to reject the null and conclude that there is not
sufficient evidence to state that nitrogen enhances
the crop.
77
Alternative Hypothesis Testing
Hypothesis testing can be accomplished
very easily if we have already computed
a confidence interval.

If Confidence Interval contains the value
specified under the null hypothesis, you must
fail to reject.

If the confidence interval does not contain
the value specified under the null, you reject
the null and conclude the alternative.
78
Potential Errors in Testing
Significance Level
Power & Sample Size
79
Type I and Type II Errors
True State of Nature
H0
HA
Conclude H0
Correct
Result
False Neg.
(Type II Error)
Conclude HA
False Pos.
(Type I Error)
Correct
Result
80
Type I Errors

A Type I Error occurs when we incorrectly
reject the null hypothesis. The probability of
a Type I Error is
  P  reject H0 | H0 is true 

We generally set the alpha value to 0.05.
Note that the value of alpha that we choose
defines an “unusual result” that would cause
us to reject our null hypothesis.
81
Multiple Hypothesis Tests

Alpha is set for a single test – more often than not we will
want to perform more than one test on a dataset

When there are multiple tests the effective alpha or
overall Type I error rate (probability of at least one Type I
Error) will be much higher than the alpha for a single test.

Several ways to control the overall Type I Error rate. For
now, we will consider only the bonferroni correction.
When making this correction for k tests, use  k as the
significance level for each test. This should be used only
for planned comparisons and is often extremely
conservative if k is large.
82
Type II Errors

A Type II Error occurs when we incorrectly
fail to reject the null hypothesis. The
probability of a Type II Error is
  P  do not reject H0 | H0 is false 

The probability that we do NOT make a
Type II error is called the power of the test.
83
Power & Sample Size
Generally speaking, the goal of any test is to detect a
difference of size at least   1  2 with high probability.
The choice of  is fairly subjective (what has practical
significance?).
84
Balancing Sig. Level & Power


The goal for any test is to have a small significance
level and high power. Unfortunately, significance
level and power are directly related

Increasing the significance level will increases
the power.

Decreasing the significance level will decrease
the power.
Furthermore, there is little direct control over the
power other than to increase the sample size. This
can often be cost prohibitive.
85
Collaborative Learning Activity
CLG #1.5-1.8
We may or may not have time for
this in class, but you should take
time to make sure you are able to
do these problems.
86
Questions?
87
Upcoming in Topic 2...
Review: Simple Linear Regression
Related Reading: Chapters 4 – 7
88