Transcript Review1
Review
Review
We’ve covered three main topics thus far
Data collection
Data summarization
Probability
Data Collection
We’ve talked about three ways of data collection
Survey
Sampling frame, questionnaire, probability sample, convenience sample,
non-response bias, other types of bias
Observational study
No assignment of treatments. No causal conclusions
Randomized experiment
Random assignment units/subjects to treatments. If done properly causal
conclusions (conclusions might not generalize).
Why randomize?
Data summarization
We talked about graphical and numerical summaries for one
variable and two. Important to identify type of variable.
One categorical/qualitative variable
graphical: pie chart, bar graph
numerical: counts/percents/frequencies
One quantitative variable
graphical: histogram/boxplot (shape, center, spread, outliers)
numerical:mean, median, standard deviation, inter-quartile range, range,
percentiles
Data summarization
Two variables
One categorical/qualitative and one quantitative
graphical: side-by-side boxplots
numerical: means, meadians, SDs, IQRs, etc. for each category
Two quantitative
graphical: scatterplot (form, direction, strength, outliers)
numerical: means, SDs, etc. for both. correlation coefficient
If association is linear model with straight line. slope and intercept of
regression line (prediction, interpretation, extrapolation, etc.)
Two categorical/qualitative
graphical: plots we didn’t talk about
numerical: contigency tables; marginal frequencies, conditional frequencies
Also relative risk and odds ratios
Probability
To find probability of event A
Enumerate sample space. Count number of outcomes in event
A. Divide by the total number of outcomes
Easy to do if sample space is small
Use probability laws to push symbols around
Independence, mutually exclusive, joint= marginal(conditional)
Sample space large only way to approach things
Duke b-ball
What type of study is this?
Survey? Randomized experiment? observational study?
Might it be reasonable to assume that the opponents are a
random sample of all type of opponents Duke could potentially
face?
If not, then everything we see can’t be generalized to teams
Duke might play in the future. (In other words, the population
is the teams that Duke has played so far and we’ve have
observations on all of them.)
Limitations
Since this is not a designed experiment what are limitations?
Can we make causal conclusions?
nope
Is there potential for lurking variables?
Yup. In I’d bet there are some.
What type of information does looking at these type of data
provide?
JMP
Lets look at a few variables to summarize them graphically
and numerically.
Regression vs correlation coefficient
Do change of units change value?
Correlation coefficient (no)
Regression slope yes
Does defining the response and explanatory variable matter
Correlation coefficient (no)
Regression slope (yes)
Provides direction and strength of linear association
Correlation coefficient (yes, yes)
Regression slope (yes, no)
Quantifies linear association between two quantitative variables
Correlation coefficient (no)
Regression slope (yes)
Correlation coefficient vs regression
Influenced by outliers
Correlation coefficient (yes)
Regression slope (yes) sometimes called influential points
Can conclude explanatory variable causes change in the response
variable
Correlation coefficient (no)
Regression slope (no)
Although under a well designed experiment it is possible
Must both variables be quantitative
Corelation coefficient (yes)
Regression slope (not necessarily but I don’t think we’ll be able to
cover the the quantitative qualitative regression often called ANOVA)