Data Workshop
Download
Report
Transcript Data Workshop
Data Workshop
H397
Data Cleaning
Inputting data
Missing Values
Converting String Variables
Creating Scales
Creating Dummy Variables
Inputting and Merging Data
Inputting
STATA “insheet using
/Users/daphnepenn/Dropbox/CleaningPractice.csv”
SPSS (dropdown menu EASY)
Merging
“merge m:1 sch_no using
"C:\Users\dmp869\Desktop\bpsschools.dta”
SPSS (dropdown menu EASY)
Strategies for Missing Data
Figure out why!
Analyze only the available data (i.e. ignoring the missing
data)
Imputing the missing data with replacement values, and
treating these as if they were observed
Imputing the missing data and accounting for the fact that
these were imputed with uncertainty
Using statistical models to allow for missing data, making
assumptions about their relationships with the available
data.
Converting String Variables
Summarizing string variables…
You can’t!
Convert them into numeric variables
“describe”
“destring, replace” (for the entire dataset)
“destring var” (for a particular variable)
“destring schoolethnicityw2, replace”
“encode schoolethnicityw2, generate(schoolethnicityw2)”
encode lowincomestatus, generate(lowincomestatus2)
Creating Scales
Stata
Average – “egen avg = rowmean(v1 v2 v3 v4)”
Sum – “egen total = rowtotal(v1 v2 v3 v4)”
SPSS
Average – “COMPUTE MPW2=mean
(MP1W2,MP2W2,MP3W2,MP4W2,MP5W2,MP6W2,MP7W2,MP8W2,
MP9W2R).”
Sum – “COMPUTE
AGW2=AG1W2+AG2W2+AG3W2+AG4W2+AG5W2+AG6W2+AG
7W2.”
Creating Dummy Variables
STATA
“gen newvar = oldvar==__”
gen male = 0
replace male = 1 if schoolgenderw2=="M”
SPSS
Dropdown menu
Summarizing Data and Choosing
Tests
tabstat ytdgpaw2, stat(me min med max)
tab schoolgenderw2 schoolethnicityw2
tab schoolethnicityw22 lowincomestatus2
tabstat ytdgpaw2, s (me med sd co) by
(schoolethnicityw22)
http://www.som.soton.ac.uk/learn/resmethods/statistical
notes/which_test.htm
Using appropriate statistics and
graphs
Report statistics and graphs depends on the types of variables
of interest:
For continuous (Normally distributed) variables
N, mean, standard deviation, minimum, maximum
histograms, dot plots, box plots, scatter plots
For continuous (skewed) variables
N, median, lower quartile, upper quartile, minimum,
maximum, geometric mean
histograms, dot plots, box plots, scatter plots
For categorical variables
frequency counts, percentages
one-way tables, two-way tables
bar charts
Using appropriate statistics and
graphs…
Y=Cat.
X=Cat.
Y=Cont.
Z=Cat.
Z=Cat.
Y=Cat.
Use
3-Way
Table
Y=Cont.
X=Cont.
X=Time
N/A
N/A
N/A
All these graphs are available in Chart Builder, from the Choose from: list.
10
Flow chart of commonly used
descriptive statistics and
graphical illustrations
Categorical data
Frequency
Percentage (Row, Column or Total)
Continuous data: Measure of location
Descriptive statistics
Mean
Median
Continuous data: Measure of variation
Standard deviation
Range (Min, Max)
Inter-quartile range (LQ, UQ)
Exploring data
Categorical data
Bar chart
Clustered bar charts (two categorical variables)
Bar charts with error bars
Graphical illustrations
Continuous data
Histogram (can be plotted against a
categorical variable)
Box & Whisker plot (can be plotted against
a categorical variable)
Dot plot (can be plotted against a
categorical variable)
Scatter plot (two continuous variables)
Choosing appropriate statistical test
Having a well-defined hypothesis helps to distinguish the
outcome variable and the exposure variable
Answer the following questions to decide which statistical
test is appropriate to analysis your data
What is the variable type for the outcome variable?
Continuous (Normal, Skew) / Binary / If more than
one outcomes, are they paired or related?
What is the variable type for the main exposure variable?
Categorical (1 group, 2 groups, >2 groups) /
Continuous
For 2 or >2 groups: Independent (Unrelated) / Paired
(Related)
Any other covariates, confounding factors?
12
Flow chart of
commonly used
statistical tests
Continuou
s
Outcom
e
variable
Categoric
al
Survival
Exposure
variable
Normal
Skew
1 group
One-sample t test
2 groups
Two-sample t test
Sign test / Signed rank
test
Mann-Whitney U test
Paired
Paired t test
Wilcoxon signed rank test
>2 groups
One-way ANOVA test
Kruskal Wallis test
Continuou
s
Pearson Corr / Linear
Reg
Spearman Corr / Linear
Reg
1 group
Chi-square test / Exact test
2 groups
Chi-square test / Fisher’s exact test / Logistic regression
Paired
McNemar’s test / Kappa statistic
>2 groups
Chi-square test / Fisher’s exact test / Logistic regression
Continuou
s
Logistic regression / Sensitivity & specificity / ROC
2 groups
KM plot with Log-rank test
>2 groups
KM plot with Log-rank test
Continuou
s
Cox regression
13
Other Issues
Organizing Quantitative Data
Choosing the right tests
Sampling
Favorite Stats Resources
Youtube
http://www.ats.ucla. edu/stat/stata/