Biostatistics 760 - University of North Carolina at Chapel

Download Report

Transcript Biostatistics 760 - University of North Carolina at Chapel

Biostatistics 760
Random Thoughts
Upcoming Classes
• Bios 761: Advanced Probability and
Statistical Inference
• Bios 763: Generalized Linear Model
Theory and Applications
• Bios 767: Longitudinal Data Analysis
• Bios 780: Theory and Methods for Survival
Analysis
• Bios 841: Statistical Consulting
Bios 761
•
•
•
•
Frequentist and Bayesian decision theory
Hypothesis testing: UMP tests, etc.
Bootstrap and other methods of inference
Stochastic processes:
– Poisson processes
– Markov chains
– Martingales
– Brownian motion
Bios 780
•
•
•
•
Time-to-event data
Right censoring
Counting processes; martingales
Semiparametric approaches
– Kaplan-Meier estimator
– Log-rank statistic
– Cox model
• Data analysis
Bios 841
• Consulting versus collaboration
• Bringing it all together to solve problems
• Communicating about statistics
– Three real problems
– Three journal style reports
– One final oral presentation
• Real time problem solving
• What is the role of statistical theory?
A Few War Stories
•
•
•
•
•
•
•
As a student: thesis on surrogates
As a postdoc: infectious diseases
As a new professor: cystic fibrosis (CF)*
Working on tenure: empirical processes
Empirical processes and cancer*
Chair of the DSMC for NICHD
Artificial intelligence and NSCLC
CF Neonatal Screening
• 1992: Joined Phil Farrell’s CF study team
• 1997: Farrell, Kosorok, Laxova, et al,
published in NEJM
• 2004 (Oct. 15): CDC recommended CF
newborn screening: the 1997 article was
judged the only valid randomized trial
• States offering CF newborn screening: 3 in
1997, 12 in 2004, 45 today
What Role Did “Theory” Play?
• Used state-of-the-art statistical methods
that were robust (GEE)
• In other CF research we have used:
– Current status methods (parametric, robust)
– Constrained regression estimation
– Semiparametric bootstrap inference
– Martingale based survival analysis
– New work using artificial intelligence
Empirical Processes and Cancer
• Non-Hodgkin’s Lymphoma Prognostic
Factors Project (1993, NEJM)
• Cox proportional hazards model employed
to ascertain risks of 5 prognostic factors:
Age, performance Status, serum lactate
dehydrogenase Level, number of
extranodal disease Sites, tumor Stage
• Diagnostics show the model fits poorly
What is the Problem?
• Poor survival function prediction
• Possibly incorrect interpretation of risk
factor effects
• A model that adds a single parameter to
the Cox model was developed and fit
• This new model fits well (Kosorok,Lee and
Fine, 2004)
• Inference for the new model is
complicated
What Does Theory Tell Us?
• We can derive valid inferential tools for the
new model: estimation and bootstrap
• Robustness was also studied: we learn
theoretically that the Cox model is robust
to this kind of model misspecification:
– The direction of the regression coefficients is
preserved
– Should use robust variance for Cox model
Theory Versus Applications
• The title implies there is conflict between
theory and applications
• This isn’t true!
• Theory provides a basis for correct
thinking and problem solving for
applications
• Applications drive new theoretical
development
Theory Can Be Impractical
• Law of iterated logarithm: needs sample
size of 108 (“asymptopia”).
• Sometimes higher order approximations
are needed before it becomes useful.
• Sometimes computational properties of
asymptotically optimal estimators are poor.
• Some hard problems take years to solve.
Why Theory is Needed
• Often it does work for practical sample
sizes.
• Can reveal properties that are universally
valid: simulation studies are limited to the
scenarios investigated.
• Theory can lead toward methodological
solutions (Cook and Kosorok, 2004 JASA).
• Theory can drive scientific discovery.
• Some results are beautiful.
Data Mining Versus Inference
• Data mining is summarizing and
representing data no matter how
complicated
• Inference is determining valid measures of
uncertainty
• Patterns obtained from data mining can be
misleading
• Inference without data mining may miss
important structure
The Core of Statistics
• Statistics is the science of science
• How do we learn from our world and draw
meaningful and valid conclusions from it?
• Need both data mining and valid inference
• Requires a unique kind of intuition
• Needs many different intellectual
perspectives
• One of the most challenging of all fields
Everyone Needs Core Literacy
• All statisticians need to know enough theory to
have core literacy about statistics and to be able
to problem solve
• All statisticians need to know enough about
applications to know what is important
• All biostatisticians need to know enough
statistical methods to be useful in practice
• The purpose of a Ph.D. in Biostatistics is to
enable the creation of new methodology
Semiparametric Inference
• The study of statistical models with
parametric and/or nonparametric parts
• Can achieve trade-off between scientific
meaning and model “robustness”
• Estimation and inference are often hard
• There exists an efficiency bound for
parametric and some nonparametric parts
• NPMLE, testing and estimating equations
Empirical Processes
• Tools for complex model inference and
high dimensional data
• Can determine universal properties of
semiparametric methods:
– Consistency
– Rate of convergence
– Limiting distributions
– Valid inference (empirical process bootstrap)
• Empirical processes are everywhere
The Road Ahead
• Whatever you choose to do, the core
statistical theory classes will help you.
• Be patient as your learn.
• Be willing to work hard (struggle is good).
• It takes many different kinds of thinkers
with different learning styles.
• There are important discoveries to be
made in both applications and theory.