Data Workshop

Download Report

Transcript Data Workshop

Data Workshop
H397
Data Cleaning
 Inputting data
 Missing Values
 Converting String Variables
 Creating Scales
 Creating Dummy Variables
Inputting and Merging Data
 Inputting
 STATA “insheet using
/Users/daphnepenn/Dropbox/CleaningPractice.csv”
 SPSS (dropdown menu EASY)
 Merging
 “merge m:1 sch_no using
"C:\Users\dmp869\Desktop\bpsschools.dta”
 SPSS (dropdown menu EASY)
Strategies for Missing Data
 Figure out why!
 Analyze only the available data (i.e. ignoring the missing
data)
 Imputing the missing data with replacement values, and
treating these as if they were observed
 Imputing the missing data and accounting for the fact that
these were imputed with uncertainty
 Using statistical models to allow for missing data, making
assumptions about their relationships with the available
data.
Converting String Variables
 Summarizing string variables…
 You can’t!
 Convert them into numeric variables
 “describe”
 “destring, replace” (for the entire dataset)
 “destring var” (for a particular variable)
 “destring schoolethnicityw2, replace”
 “encode schoolethnicityw2, generate(schoolethnicityw2)”
 encode lowincomestatus, generate(lowincomestatus2)
Creating Scales
 Stata
 Average – “egen avg = rowmean(v1 v2 v3 v4)”
 Sum – “egen total = rowtotal(v1 v2 v3 v4)”
 SPSS
 Average – “COMPUTE MPW2=mean
(MP1W2,MP2W2,MP3W2,MP4W2,MP5W2,MP6W2,MP7W2,MP8W2,
MP9W2R).”
 Sum – “COMPUTE
AGW2=AG1W2+AG2W2+AG3W2+AG4W2+AG5W2+AG6W2+AG
7W2.”
Creating Dummy Variables
 STATA
 “gen newvar = oldvar==__”
 gen male = 0
 replace male = 1 if schoolgenderw2=="M”
 SPSS
 Dropdown menu
Summarizing Data and Choosing
Tests
 tabstat ytdgpaw2, stat(me min med max)
 tab schoolgenderw2 schoolethnicityw2
 tab schoolethnicityw22 lowincomestatus2
 tabstat ytdgpaw2, s (me med sd co) by
(schoolethnicityw22)
 http://www.som.soton.ac.uk/learn/resmethods/statistical
notes/which_test.htm
Using appropriate statistics and
graphs
 Report statistics and graphs depends on the types of variables
of interest:
 For continuous (Normally distributed) variables
 N, mean, standard deviation, minimum, maximum
 histograms, dot plots, box plots, scatter plots
 For continuous (skewed) variables
 N, median, lower quartile, upper quartile, minimum,
maximum, geometric mean
 histograms, dot plots, box plots, scatter plots
 For categorical variables
 frequency counts, percentages
 one-way tables, two-way tables
 bar charts
Using appropriate statistics and
graphs…
Y=Cat.
X=Cat.
Y=Cont.
Z=Cat.
Z=Cat.
Y=Cat.
Use
3-Way
Table
Y=Cont.
X=Cont.
X=Time
N/A
N/A
N/A
All these graphs are available in Chart Builder, from the Choose from: list.
10
Flow chart of commonly used
descriptive statistics and
graphical illustrations
 Categorical data
 Frequency
 Percentage (Row, Column or Total)
 Continuous data: Measure of location
 Descriptive statistics
 Mean
 Median
 Continuous data: Measure of variation
 Standard deviation
 Range (Min, Max)
 Inter-quartile range (LQ, UQ)
Exploring data
 Categorical data
 Bar chart
 Clustered bar charts (two categorical variables)
 Bar charts with error bars
 Graphical illustrations
 Continuous data
 Histogram (can be plotted against a
categorical variable)
 Box & Whisker plot (can be plotted against
a categorical variable)
 Dot plot (can be plotted against a
categorical variable)
 Scatter plot (two continuous variables)
Choosing appropriate statistical test
 Having a well-defined hypothesis helps to distinguish the
outcome variable and the exposure variable
 Answer the following questions to decide which statistical
test is appropriate to analysis your data
 What is the variable type for the outcome variable?
 Continuous (Normal, Skew) / Binary / If more than
one outcomes, are they paired or related?
 What is the variable type for the main exposure variable?
 Categorical (1 group, 2 groups, >2 groups) /
Continuous
 For 2 or >2 groups: Independent (Unrelated) / Paired
(Related)
 Any other covariates, confounding factors?
12
Flow chart of
commonly used
statistical tests
Continuou
s
Outcom
e
variable
Categoric
al
Survival
Exposure
variable
Normal
Skew
1 group
One-sample t test
2 groups
Two-sample t test
Sign test / Signed rank
test
Mann-Whitney U test
Paired
Paired t test
Wilcoxon signed rank test
>2 groups
One-way ANOVA test
Kruskal Wallis test
Continuou
s
Pearson Corr / Linear
Reg
Spearman Corr / Linear
Reg
1 group
Chi-square test / Exact test
2 groups
Chi-square test / Fisher’s exact test / Logistic regression
Paired
McNemar’s test / Kappa statistic
>2 groups
Chi-square test / Fisher’s exact test / Logistic regression
Continuou
s
Logistic regression / Sensitivity & specificity / ROC
2 groups
KM plot with Log-rank test
>2 groups
KM plot with Log-rank test
Continuou
s
Cox regression
13
Other Issues
 Organizing Quantitative Data
 Choosing the right tests
 Sampling
Favorite Stats Resources
 Youtube
 http://www.ats.ucla. edu/stat/stata/