Northwestern Medicine - Feinberg School of Medicine
Download
Report
Transcript Northwestern Medicine - Feinberg School of Medicine
Preparing and Formatting your
Research Data
May 15, 2015
Hannah Palac, MS
[email protected]
Overview
• Importance
• Deciding What to Measure
• Deidentification
• Data Entry and Organization
• Keeping a Codebook
• Updating your Data
• Brief Intro to REDCap
• Preparing for Statistical Collaboration
Importance
• Garbage In Garbage Out
• Ethics
• Scientific integrity
Ensures consistency
Instills confidence in funders, participants, readers, etc.
Lays groundwork for smooth data cleaning and
statistical analysis
What do you want to know?
• What is your research question and primary hypothesis?
Be clear and discrete when stating your question
Example:
Non-specific: Is NSAID use associated with
complications after myocardial infarction?
Specific: Is NSAID use associated with bleeding and
cardiovascular events in patients receiving
antithrombotic therapy after myocardial infarction?
What do you want to know about each
person/unit?
• Let the research question and hypothesis drive
the variables you collect
• Consider confounding variables and effect
modifiers
What variables will you need to assess for
these possible effects?
• Categorize variables into meaningful groups (e.g.
demographics, lab values, medical history, etc.)
Longitudinal Studies
• List out what data is collected at each timepoint
in an event grid
• Example:
What type of data is best suited for each
variable?
• Nominal/Categorical: Variables with 2+ categories with
no intrinsic order
Race, Sex, Marital Status
• Ordinal: Variables with 2+ categories that can be ordered
or ranked
Disease stage or severity, Education
• Continuous: Variables measured on a continuum or
interval scale
Laboratory values, Age, Weight, Height
What type of data is best suited for each
variable?
• String/Character: Words
Many statistical softwares read in string variables exactly “as is,”
meaning that deviations in spacing, capitalization, and spelling will be
read as separate outcomes
• Example: “Male”, “male”, “1” and “M” mean the same thing, but
the software reads them as four separate groups
• Numeric: Numbers
Raw values (e.g. Age, Labs, etc.)
Coded values (e.g. 0/1 coding for Female/Male)
• Avoid using symbols, such as $ or %, in your data
In general, formats are not recommended except for date variables, in
which all data points should follow a consistent date format
• Do NOT mix string and numeric data types in the same variable
Deidentification and Security
• Use a unique subject ID number rather and identifying information,
such as MRN or name
If you must link the ID to the participant, maintain a key separate
from the database
Use the unique participant ID for all study related documents
Avoid sending MRNs or other PHI to the statistician
Do not use the Excel row number as an identifier
Example Key:
• Do not transfer data or messages containing PHI to Gmail/Yahoo/etc.
e-mail addresses
• Do not store data containing PHI on flash drives, personal devices, etc.
Data Entry and Organization
• In general, statistical software packages prefer numeric data
• Code data using numeric codes and use these codes during data
collection and entry
Examples:
• 0=Female; 1=Male
• 0=No diabetes; 1=Diabetes
• 0=Underweight; 1=Normal weight; 2=Overweight; 3=Obese
• Be consistent with your codes, such that similar variables use the
similar codes
• Use consistent codes for “other” values, N/A, patient refusal,
patient does not know, or other missing data
Do not enter N/A or other text
Data Entry and Organization
• Always enter the raw and intact data field rather than entering
the calculation or categorizing right away
• Do not use symbols in your data (e.g. >60 for normal eGFR)
Instead enter 60 or categorize values into meaningful groups
(Stage 1, Stage 2, etc.)
• For variables where only one value is possible, create one
variable only
Example: Disease stage or severity
• For variables where there can be multiple values at the same
time, create separate variables for each item
Example: Medications, side effects, diagnoses
Working with “Others”
• For coded variables, develop a consistent scheme for coding
“other” values that can be implemented across all variables
• Keep a separate text column for “others” next to the variable of
interest
• Examples:
Variables with only one possible value:
Variables with multiple possible values:
Naming your Variables
• Names do not necessarily have to be
descriptive of the variable, but it is
nice if they are
• Start with a letter
• Keep it short
• Different variables should have different names
Example: If collecting data on side effects for multiple
medications, use a systematic naming scheme such as
“[medname]_nausea” rather than “nausea” for all
medications.
Naming your Variables
• If variables have more than 1 component, such as BP,
create multiple variables (SBP, DBP)
• If necessary, add a separate variable for comments
Do NOT include comments in the variable
• Use caution with calculations in Excel
• Do not use spaces or symbols
• Do not color code
• Do not include blank rows
Keeping a Codebook
• Keep a codebook separate from the database (such as in a
separate tab in Excel) of the variable label, variable name, units,
and possible values (codes), plausible ranges, formulas
Example:
Wide vs. Long Formats
• “Wide” or “Horizontal” data consists one observation per
participant and all data is entered in a single row
Example:
ID
sex
100005
0
100141
1
ldl_base
179
101
ldl_wk4
175
121
ldl_wk8
162
104
ldl_wk12
169
113
• “Long” or “Vertical” data consists of multiple
observations per participants entered in separate rows
Example: ID
sex
time
ldl
100005
100005
100005
100005
100141
100141
100141
100141
0
0
0
0
1
1
1
1
Baseline
Week 4
Week 8
Week 12
Baseline
Week 4
Week 8
Week 12
179
175
162
169
101
121
104
113
Updating your Data
• Invariably, there will be some data cleaning by the statistician
(reformatting, converting from wide to long or long to wide)
• If you need to update your data, please do not add it to the
original database
• Instead, send a new spreadsheet with additions or changes so
the statistician can figure out the best way to merge or append
this data into the clean dataset
Be sure to identify the new or updated data by study ID
What is
?
• Research Electronic Data Capture
• Web-based application used for building and managing research
databases quickly and securely
• Developed at Vanderbilt in 2004
• 1,255 consortium partners and rapidly growing
Why Researchers Use REDcap
• Can create data collection forms and surveys without the need
for a programmer
Although, programmers are needed to facilitate REDCap
maintenance (CDSI)
• Can enter and access data from multicenter studies in one place
• All data stored in REDCap at Northwestern is compliant with
HIPAA standards for security
• Audit trails for tracking history of database and entry
• Suitable for (almost) any study with web-based data entry
capabilities
• Reporting capabilities
• Self-service tool means it’s FREE (to you) to use!
Sample Data Entry Form
Limitations/Considerations
• Some flexibility limitations in form design
• Data entry in online form is only possible with
VPN connection
• Once a project is in “production mode,” any
changes must be submitted to a REDCap
administrator for approval
• No offline version of REDCap; must always be
connected to internet
Resources
• http://project-redcap.org/
Video Tutorials
REDCap Shared Library – A repository for data
collection instruments and forms that can be
downloaded and used (for free) by consortium
partners
• For information on REDCap at NU, contact:
[email protected]
Preparing for Statistical Collaboration
• State your primary and secondary objectives and hypotheses
• Decide a priori if you will be excluding cases or performing a sub
group analysis
Adjustment will be required to ensure “sub-results” are not
simply due to chance
• Avoid deciding post-hoc to repeat the analysis on a subset of
people
• Submit your spreadsheet with unique variable names in the first
row
• Submit your codebook containing variable labels and codes
BCC Resources for Maximizing
Statistical Interactions
• Guidelines to help you prepare for Statistical Collaboration:
http://www.feinberg.northwestern.edu/sites/bcc/docs/StatsCollaborationGui
deSummary.pdf
• Preliminary Help (Grants and Power):
http://www.feinberg.northwestern.edu/sites/bcc/docs/PowerGuide.pdf
• Database Issues:
http://www.feinberg.northwestern.edu/sites/bcc/docs/DataGuide.pdf
• Analysis and Write-up:
http://www.feinberg.northwestern.edu/sites/bcc/docs/ProjectGuide.pdf
Submitting a Request for Statistical Support
• Submit BCC Appointment Request Form:
http://www.feinberg.northwestern.edu/sites/bcc/contactus/request-form.html
Questions?
https://redcap.nubic.northwestern.edu
Biostatistics Collaboration Center
http://www.feinberg.northwestern.edu/sites/bcc/