U1.1-Introduction - Department of Mathematics & Statistics

Download Report

Transcript U1.1-Introduction - Department of Mathematics & Statistics

Introduction and Data Gathering
(Chapters 1 – 2)
At the end of this lecture, the student should:
•
•
•
•
•
Be able to provide a definition of Statistics.
Discuss the role of statistics in research.
Be able to state reasons for using statistics.
Identify the difference between observational and
experimental studies.
Be able to organize data into a two-dimensional matrix or
array.
I hear and I forget
I see and I understand
I do and I remember
Chinese Proverb
STA6166-1-1
General Course Approach
•
•
•
•
Keep technical demands low.
Emphasize examples and context more
than mathematical derivations.
Incorporate active learning exercises.
Concentrate on a small set of basic but
easily generalizable ideas.
In mathematics,
Context obscures structure.
In data analysis,
Context provides meaning.
STA6166-1-2
Discussion Versus Readings
Applied statistics lends itself naturally to discussion.
To have time for discussion, the student needs
to get most of the material from the readings
in the book.
As is typical of graduate level
courses, discussion with classmates
is encouraged. However, when it
comes to assignments, each student
should do his/her own work!
Learning statistics is a
lot like learning a
foreign language.
STA6166-1-3
What Is Statistics?
• Summary measures, such as totals, averages or
percentages of measurements, counts or ranks.
• A set of methods for obtaining, organizing,
summarizing, presenting and analyzing numerical
facts in order to help make wise decisions in the
face of uncertainty.
• An area of science concerned with the extraction
of information from numerical data and its use in
making inferences about a population from which
the data are obtained.
STA6166-1-4
Scientific Method
• The pursuit of systematic interrelation of facts by
logical arguments from accepted postulates,
observation, and experimentation and a
combination of these three in varying
proportions.
Roles of Statistics
• Aid in creating the `best' research design with
which to generate new data.
• Extract the information from the noise or variability
at the data analysis step.
STA6166-1-5
Logical Arguments
•
Deductive argument: Conclusion follows with logical necessity or
certainty from the premises. Nothing new is revealed because we are
arguing from the general to the specific.
•
Specialization: Moving from a large set of objects, postulates, or
events, to consideration of a smaller set of objects or events.
•
Inductive argument: The premises provide some evidence for the
truth of the conclusions. Discovering general laws by the observation
and combination of particular instances.
•
Generalization: Passing from the consideration of one object,
postulate, or occurrence, to the consideration of a set of objects,
postulates, or occurrences.
•
Analogy: Consideration of the kind and amount of agreement among
different objects or events.
In statistics we attempt to formalize and use these concepts
in a quantitative way.
STA6166-1-6
Scientific Progress
We gain knowledge by iterating between
models and data.
Hypothesis
Model, Conjecture
New Hypothesis, New Model
Progress
and
Understanding
Data, Measurements
New Data
STA6166-1-7
Scientific Thinking
A typical PhD research project iterates as follows:
1.
2.
3.
4.
Questions are asked and hypotheses formulated based on existing
knowledge.
An observational study is performed to examine the validity of the
hypothesis. (knowledge created)
Using the observational data, we identify factors that are hypothesized as
“driving” the process under study.
An experiment is designed to test the importance of these factors and
possibly shed light on mechanisms. (knowledge created)
Research
Hypothesis
Literature Data
Important Factors
Sample Data
Experimental Data
STA6166-1-8
Basic Study Steps
• State the problem. What are the questions?
•
•
•
•
•
Devise a plan of solution. What will I do?
Implement the plan. This is how I do it?
Analysis of data. What happened?
Interpretation of results. What does this mean?
Reexamination. Is my logic correct? What next?
Study design and study implementation may
require iteration.
STA6166-1-9
Graphical Depiction of Scientific Study
Knowledge
Base
Problem
Constraints
Objectives &
Hypotheses
Experiment
Sample
DESIGN
How to measure?
DATA
Interpretation
STATISTICAL ANALYSIS
Graphics & Visualization
•Modeling
•Estimates and Confidence Intervals
•Formal Statistical Tests
Conclusions
STA6166-1-10
Research Design Categories
•
Census (Complete Enumeration): Every individual in the population
of interest is observed.
•
Sampling Studies (Mensurative Experiments or Surveys):
Populations to be compared are defined, and individuals are selected
from these populations for measurement. All members of the
populations have a positive probability of selection for inclusion in the
study.
•
Experimental Studies (Manipulative Experiments): Individuals in
one or more populations are carefully chosen or created to test
specific manipulations under highly controlled conditions.
STA6166-1-11
Sampling Study Design
•
•
•
•
•
•
•
What are populations of interest?
How will individuals be selected for measurement?
What will be measured?
Which analyses will be performed?
How many individuals are needed?
How large an effect will be considered important?
Are available resources adequate for this study?
Many of these questions are answered by subject
matter experts, some can be answered by a
statistical analysis.
STA6166-1-12
Sampling Study
( Mensuration Experiment)
Population 1
Population 2
Sample 1
Sample 2
What is measured?
Characteristics
How Selected?
1 1
2 1
3 1
…
n 1
x x x x x…
x x x x x…
x x x x x…
x x x x x ...
1 2 x x x x x…
2 2 x x x x x…
3 2 x x x x x…
…
m 2 x x x x x ...
STA6166-1-13
How are individuals selected?
• Individually identified (the “sample unit”).
• Randomly chosen (no biases introduced in selection).
Each possible set of individuals has the same
probability of selection (Simple Random Sampling).
Special situations allow for
increased efficacy of selection.
• Stratification (account for an extraneous factor)
• Clusters (select natural groups of sample units)
• Multi-stage (select large units then parts of units)
• Systematic (set pattern)
STA6166-1-14
Simple Random Sampling
A researcher wishes to determine the prevalence of a disease in a
greenhouse of tomato seedlings. Each seedling tested for the disease is
destroyed in the process, hence only a minimal number should be tested.
Expectations are that only about .01% of the roughly 50,000 seedlings in
the greenhouse have the disease.
How to select a simple random sample?
1. Number each pot. Use a random number table (or spreadsheet
random number generator) to produce a list of numbers, in random
order from 1 to the total number of pots. Measure plants in pots
whose numbers are selected (difficult).
2. Align pots in rows and columns. Use random number table to
select a list of row and column number pairs. Measure plant in
pots located in the (row, column) pair selected (easier).
Table 2 in Ott and Longnecker.
STA6166-1-15
Simple Random Sample
Textbook definition.
A simple random sample of n units is defined such that each
possible sample of size n is equally likely to be drawn.
Practical definition.
This sampling principle assures that each unit in the
population has the same probability (likelihood) of being
selected in the sample.
STA6166-1-16
Stratification
Allows us to take into account a factor we already know affects the
response of interest. To “remove a source of known variability”.
16 years
healthy
22 years
healthy
20 years
diseased
Pine forest: Estimate expected yield from plot.
Individuals selected at random within each strata.
Variability in diseased subpopulation expected to be much
greater than in healthy area. Mean yield greater at 22y than 16y.
STA6166-1-17
Clusters
Estimate the average sponge size on natural reefs.
REEF
9
25
12
Number of
sponges on
reef
21
5
14
7
Selecting sponges at random would be very resource inefficient.
Cheaper to select reefs (sponge clusters) at random with probability
proportional to size. All sponges on selected reefs are measured (a
cheap thing to do that increases the sample size easily).
STA6166-1-18
MultiStage Sampling
Typically large areas
or large complex
populations can be
more effectively
sampled in stages.
At the first stage,
natural or synthetic
clusters are
selected. At
subsequent stages
the selected clusters
are subdivided into
units and samples of
these are selected.
a. Random Selection
b. Systematic Selection
random
starting
point
randomly
located
grid
c. Multi-Stage Selection
Second stage
unit
Measurement
units
First-stage
unit
Example: National crop yield survey.
STA6166-1-19
Greenhouse Example
Stratification: Maybe we have observed that plants near the door seem
less healthy than those further into greenhouse. Divide room into
plants near door and plants “inside”. Random samples from each
stratum.
Cluster: Suppose plants are arranged on tables. We could select tables
at random then examine all plants on each table selected. Note that if
one plant on a table is diseased, all plants on table have an increased
probability of also being diseased.
Multi-Stage: Again suppose plants are on tables. Select some tables at
random. Next select a few plants from each selected table for testing.
First stage unit is the table. Second stage unit is the plant. Third stage
unit could be the leaf on the plant, etc.
Systematic: Imagine plants arranged on a large table. Randomly pick a
row and column to start. Then, following a systematic route, pick, say,
every 10th plant.
STA6166-1-20
What is measured?
Variable: Apt or liable to vary or change from individual
to individual, capable of being varied or
changed (factor), alterable, inconsistent,
having much variation or diversity, a quantity
that may assume any given value from a set of
values (the variable’s range).
Examples:
•
•
•
Plant biomass – varies from plant to plant.
Blood arsenic level – varies from person to person.
Gender – we are not all male or all female.
Opposite of variable - Constant
STA6166-1-21
Types of Variables
• Categorical, classification, or qualitative variable:
Discrete; essentially describes some characteristic of a
sample unit. For example: color, gender, age class,
health status, treatment group. (Further subdivided into
nominal or ordinal.)
• Quantitative or amount variable: Either discrete or
continuous; measures the amount or level of a
characteristic of a sample unit. For example: age,
weight, height, temperature, biomass, volume. (Further
subdivided into interval or ratio.)
In STA 6166-7, we will deal primarily with
quantitative variables. STA 6126-7 deals
primarily with categorical variables.
STA6166-1-22
Sampling Study Design Questions
• How is the response (effect) to be measured?
• What characteristics of the response are to be analyzed?
• What factors influence the characteristics to be analyzed?
• Which of these factors will be studied in this investigation?
• How many times should the basic experiment be performed?
• What should be the form of the analysis?
• How large an effect (effect size) will be considered important?
• What resources are available for this study? Are they adequate?
It is important to be able to define the
underlined words.
STA6166-1-23
Terminology
• The response typically refers to the measured variable(s) of primary
interest (e.g. weight, health status, growth, etc).
• Characteristics – Is it change in the average response, the spread of
responses, the maximum response, etc, that will be examined? These
characteristics typically refer to some “statistical” aspect of effects
measured among individuals in the populations being studied.
• A factor refers to the characteristic(s) that primarily differ among the
populations being studied (compared). Some factors we cannot
manipulate (I.e. such as descriptors like gender, geographic location,
genetic makeup). Other factors identify characteristics we have caused
to be different between the two populations (as in an experiment where
we manipulate the populations by giving them different “treatments”).
• Basic Experiment – The selecting of an individual for measurement. In a
sampling study, the basic experiment is the selection and measurement
of an individual from the population. In an Experimental Study, the basic
experiment is the selection of an individual from the “pool”, the
application of a treatment, and the measurement of responses.
STA6166-1-24
Terminology (Cont)
• By the form of the analysis, we refer to the statistical procedure(s) that
match the characteristics of the study design, the characteristics of the
responses measured and the estimates and hypothesis tests needed to
answer the questions of interest. So, when someone asks “What form
will your analysis take?” you might answer with something like “I will be
using regression analysis (the statistical method) to explore
associations between fat intake and cholesterol level (the hypotheses of
interest) between two populations identified geographically and by
gender (study design factors).”
• The size of the effect of interest refers to how big of a difference must
there be before I (or others) would conclude that there is a “real”
difference. Typically we are interested in specifying this at the design
phase of a study since the size of the effect of interest drives the sample
size question. Thus if you say a difference of less than 2 points in
cholesterol level between gender groups would not be significant but
anything greater than 2 is significant, you could use this to set the study
sample size. If the difference were raised to 10 points, a much smaller
sample size would be needed.
• Resources – Money, personnel, time, access, material.
STA6166-1-25
Manipulation Experiment
• Manipulation Experiment: A research design
in which the researcher deliberately introduces
certain changes in the levels of factors that are
hypothesized as affecting the process of
interest, and then makes observations to
determine the effect of these changes.
• Experimental Design: A study plan which
assures that measurements will be relevant to
the problem under study.
• Treatments: Changes to those factors which
are suspected of affecting the process under
study.
STA6166-1-26
Factorial
Experiment
Nitrogen Level
FACTORS
LEVELS
Phosphorus
Level
0 kg/ha
10 kg/ha
20 kg/ha
0 kg/ha
0/0
10 / 0
20 / 0
10 kg/ha
0 / 10
10 / 10
20 / 10
EXPERIMENTAL
UNIT (PLOT)
TREATMENTS
SITE 1
(block 1)
0 / 10
10 / 0
20 / 10
10 / 10
20 / 0
0/0
SITE 2
(block 2)
10 / 10
20 / 10
10 / 0
0/0
0 / 10
20 / 0
BLOCKED LAYOUT
(complete block - all treatments in each block)
STA6166-1-27
Standard Form for a Data Set
Observation
Number
1
2
3
.
.
.
n
1
1
1
.
.
.
1
CATEGORIES
AMOUNTS
F
F
M
RED
WHITE
BLUE
x
x
x
x ... 10.2
x ... 12.9
x ... 20.1
x
x
x
x ...
x ...
x ...
F
BLUE
x
x ... 16.0
x
x ...
strata
gender
color
Other
categorical
variable
weight
Other
quantitative
variable
STA6166-1-28
Example Data Set in Spreadsheet Format
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
ITEMP
IRH
24.47
24.47
24.47
24.45
24.45
24.45
24.68
24.68
24.68
24.79
24.79
24.79
25.03
25.03
25.03
24.44
24.44
24.44
24.43
24.43
24.43
25.24
25.24
25.24
25.35
25.35
25.35
IWB
64
64
64
50
50
50
50
50
50
51
51
51
74
74
74
74
74
74
73
73
73
78
78
78
89
89
89
20.2
20.2
20.2
18.55
18.55
18.55
18.45
18.45
18.45
18.57
18.57
18.57
21.6
21.6
21.6
21.22
21.22
21.22
21.2
21.2
21.2
21.91
21.91
21.91
23.78
23.78
23.78
FWB
REP
20.25
20.25
20.25
18.6
18.6
18.6
19.52
19.52
19.52
18.2
18.2
18.2
21.8
21.8
21.8
21.5
21.5
21.5
21.76
21.76
21.76
22.06
22.06
22.06
24.01
24.01
24.01
BIRD
1
1
1
2
2
2
3
3
3
4
4
4
1
1
1
2
2
2
3
3
3
4
4
4
1
1
1
BN
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
IBT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
ATBT
40.6
40.6
40.9
40.3
40.4
40.1
41.1
41.2
40.9
39.8
39.6
39.8
39.8
39.8
39.4
40.1
40.1
40
39.4
39.8
39.5
.
.
.
.
.
.
Indicator of missing data
39.7
40.2
39.4
40.1
39.4
39.2
40.5
40.8
40.9
39.4
39.4
39.8
38.9
38.7
39.4
39.6
39.8
39.6
39.9
40.2
39.2
.
.
.
.
.
.
WEIGHT
2.21
2.265
2.185
2.275
2.264
2.205
2.343
2.193
2.238
2.32
2.298
2.31
2.212
2.21
2.198
2.235
2.257
2.284
2.33
2.314
2.295
2.149
2.12
2.127
2.213
2.216
2.36
SATBT
SITEMP SIWB
-1.24351 -1.28723 -1.27434
-0.69343 -1.28723 -1.27434
-1.57355 -1.28723 -1.27434
-0.80345 -1.29196 -1.67386
-1.57355 -1.29196 -1.67386
-1.79358 -1.29196 -1.67386
-0.36338 -1.23754 -1.69807
-0.03334 -1.23754 -1.69807
0.07668 -1.23754 -1.69807
-1.57355 -1.21151 -1.66902
-1.57355 -1.21151 -1.66902
-1.13349 -1.21151 -1.66902
-2.12363 -1.15472 -0.93536
-2.34366 -1.15472 -0.93536
-1.57355 -1.15472 -0.93536
-1.35352 -1.29433 -1.02737
-1.13349 -1.29433 -1.02737
-1.35352 -1.29433 -1.02737
-1.02348 -1.29669 -1.03221
-0.69343 -1.29669 -1.03221
-1.79358 -1.29669 -1.03221
.
-1.10503
-0.8603
.
-1.10503
-0.8603
.
-1.10503
-0.8603
.
-1.079 -0.40751
.
-1.079 -0.40751
.
-1.079 -0.40751
STA6166-1-29
Inventor's Paradox
The more ambitious the plan, the more chances of
success, and the more opportunity for failure.
How does one decide on what to do?
Are there open questions ?
Are there available resources?
Does someone really want the answer?
Can a study be done?
Will the study be able to answer the question?
Statistics may help answer the last question!
STA6166-1-30