Transcript Lecture12

Various topics
Petter Mostad
2005.11.14
Overview
•
•
•
•
•
Epidemiology
Study types / data types
Econometrics
Time series data
More about sampling
– Estimation of required sample size
Epidemiology
• Epidemiology is the study of diseases in a
population
– prevalence
– incidence, mortality
– survival
• Goals
– describe occurrence and distribution
– search for causes
– determine effects in experiments
Some study types
• Observational studies
–
–
–
–
Cross-sectional studies
Cohort studies
Longitudinal studies
Case / control studies
• Experimental studies
– Randomized, controlled experiments
– Interventions
Cross-sectional studies
• Examines a sample of persons, at a single
timepoint
• Time effects rely on memory of respondents
• Good for estimating prevalence
• Difficult for rare diseases
• Response rate bias
Cohort studies and longitudinal
studies
• A sample (cohort) is followed over some time
period.
• If queried at specific timepoints: Longitudinal
study
• Gives better information about causal effects, as
report of events is not based on memory
• Requires that a substantial group developes
disease, and that substantial groups differ with
respect to risk factors
• Problem: Long time perspective
Case – control studies
• Starts with a set of sick individuals (cases),
and adds a set of controls, for comparison.
• Cases and controls should be from same
populations
• Matching controls
• Good method for rare diseases
• Problem: Bias from selection
Measures of risk
• Relative risk
• Odds ratio
• Incidence rate ratio
• Attributable risk
Econometrics
• ”Econometrics is the field of economics that
concerns itself with the application of
mathematical statistics and the tools of statistical
inference to the empirical measurement of
relationships postulated by economic theory”
• Is the unification of
– economic statistics
– quantitative economic theory
– mathematical economics
About econometrics
• Variations and extensions of the regression model
–
–
–
–
–
–
heteroscedasticity
autocorrelation models
panel data
logistic regression
non-linear regression models
multivariate regression
• Matrix computations (linear algebra) is almost
indispensable tool
• Time series data
• Simultaneous equations models
Heteroscedasticity
• Recall: When the variances of independent errors in the
model vary, the model is heteroscedastic.
• Example: In a regression model of house size against
income, the variance of house sizes might increase with
income
• In case of heteroscedasticity, ordinary regression models
are not optimal.
• Previously, we mentioned variable transformation as a
possible solution
• Much more advanced solutions exist, when the
heteroscedasticity is known or can be estimated:
Generalized least squares,…
Autocorrelations
• Recall: When for example the data is from a time
series, the random errors for adjacent time steps
might be correlated!
• Improvements in model might reduce problem
• Standard regression methods are not optimal
• Modelling and estimating the autoregression gives
improved results
Panel data
• Data collected for the same sample, at
repeated time points
• Corresponds to longitudinal
epidemiological studies
• A combination of cross-sectional data and
time series data
• Increasingly popular study type
Analyzing panel data
• Fixed effects: Standard regression, but using a
constant term differing for each individual
– We get a parameter for each person!
• Random effects: A stochastic variable models
variation connected to individual
– The individual variation is assumed drawn from a
distribution with fixed variance
– A generalization of least squares is needed for
computations
Analyzing panel data
• Heteroscedasticity might also here be a
problem
• Autocorrelations
• Dynamic models: Lagged variables
Logistic regression
• What if the dependent variable is an
indicator variable?
• The model then has two stages: First, we
predict a value zi from predictors as before,
then the probability of indicator value 1 is
given by e z /(1  e z )
• Given data, we can estimate coefficients in
a similar way as before
Non-linear regression models
• Ordinary regression is very useful, but it is limited
by the linear form of the equations
• Sometimes, variable transformations can bring the
connection between variables to a linear form
• Other times, this is not possible: The relationship
describes the dependent variable as some function
of independent variables and some random error.
• The model may still be estimated by minimizing
the errors. This is non-linear regression.
Multivariate regression
• Instead of one dependent variable, one can
have a vector of dependent variables
• A theory of multivariate multiple regression
can be developed (with the help of matrix
algebra): Many similar results to ordinary
multiple regressions
• Captures the dependencies between
dependent variables
Simultaneous equations models
• Often, you want to describe interdependencies between
variables, rather than explaining one variable in terms of
others
• Example:
– Demand is a function of various variables, including price
– The same is the case with supply
– Setting demand = supply creates simultaneous equations
• Identifiability?
• Estimation: Least squares is not optimal; other methods
exist
Time series models
• Time series issues:
– Identifying trends, cycles, etc.
– Predicting future values
• Autoregressive models:
– Explicit models for time dependencies:
AR(1) X t    1 X t 1   t
(Corr ( X t , X t  j )  1j )
AR(2)
X t    1 X t 1  2 X t 2   t
• (Box-Jenkins, ARIMA models)
The runs test (for random samples)
• In a random sample, the probability that an observation is
above or below the median is independent of whether the
previous observation is.
• A run is a (maximal) sequence of observations such that all
are above the median, or all are below.
• For n observations, the number of runs has a null
distribution under the assumption of no autocorrelation.
With too few runs, the null hypothesis of no
autocorrelation can be rejected. (Table in Newbold).
• For large samples, a formula based on a normal
approximation can be used.
Sampling in practice
•
Newbold mentions:
1.
2.
3.
4.
5.
6.
•
Information required?
Relevant population?
Sample selection?
Obtaining information?
Inferences from sample?
Conclusions?
Sampling / nonsampling errors
Types of sampling
•
•
•
•
Simple random sampling
Stratified sampling
Cluster sampling
Two-phase sampling (using pilot studies)
• Each requires somewhat adjusted formulas
for estimation
Correcting for finite population in
estimations
• Our estimates of for example population
variances, population proportions, etc. assumed an
”infinite” population
• When the population size N is comparable to the
sample size n, a correction factor is necessary.
(Why?)
2
• Examples:
s
( N  n)
2
ˆ
– Variance of population mean estimate:  X  n 
– Variance of population proportion estimate:
ˆ p2 
N
p (1  p ) ( N  n)

n 1
N
Estimation of required sample size
• An important part of experimental planning
• The answer will generally depend on the
parameters you want to estimate in the first
place, so only a rough estimate is possible
• However, a rough estimate may sometimes
be very important to do
• A pilot study may be very helpful
Example: Estimating the mean of a
normally distributed population
• We want to estimate mean 
• We want a confidence interval to extend a distance
a from the estimate
• We guess at the population variance  2
• A sample size estimate:
Z2 / 2 2 4 2
n
 2 at 95% confidence
2
a
a
• If we have a population of size N, and want a
specified  X2 , we get
N 2
n
( N  1) X2   2