Title of Presentation Myriad Pro, Bold, Shadow, 28pt
Download
Report
Transcript Title of Presentation Myriad Pro, Bold, Shadow, 28pt
Preparing Data for Analysis
Nishan Ahmed
Regional Training Workshop on Influenza Data Management
Phnom Penh, Cambodia
July 27 – August 2, 2013
National Center for Immunization & Respiratory Diseases
Influenza Division
Data Cleaning: What is it?
• Check for accuracy of observations and correct or
eliminate inaccuracies
– Important for both simple and complex data
• Questions to ask:
– Are values outside of what you would normally
observe?
– If yes, are values due to inaccuracies in the data or to
real changes in activity (i.e. an outbreak, start of
influenza season)
• Values can be inaccurate due to many factors
• Data Entry mistake
• Incorrect measurement at site
• Incorrect analysis
Data Cleaning: Why do it?
• To prepare your data for regular analysis
– Steps:
• Prepare a copy for temporary cleaning, but also
clean the original data source as corrections are
validated
• If data is not cleaned at source, cleaning will need
to be done each time analysis is attempted (i.e. records
can be temporarily deleted until verified or corrected)
• To finalize a dataset for future analysis/create a
clean copy to be used for research
– Typically a more thorough process than cleaning
during a flu season
Data Cleaning: Why do it?
• To check for validity and consistency of reported
variables
– Ensures that the data collected makes sense
• Examples:
– # of ILI cases is not greater than the # of patient visits
– The date of onset is before data of death
– Only enrolled sites should be reporting & included in
analysis of sentinel data
• To check for data outliers
– A facility that normally sees ~100 patient visits will
probably not see 1,000 patients during a week
• To identify and remove duplicate records
Methods to identify problems
• How do you find data that has problems?
– Eyeball method
– Through quick, simple data queries
• Access or Excel queries as you go
– Statistical methods
– Through pre-programmed automated processes
• Used for elements that are routinely cleaned
• Example: Automated process for deleting duplicate
records
Eyeball Method
Quick and Simple Queries
• To find duplicate records, using Access
Quick and Simple Queries
• To check validity of variables
Automated Processes: Duplicates
Basic Statistic Measures
• Measures of Center
– Mean: Sum of the observations divided by the
number of observations.
– Median: The middle value in an ordered list
– Mode: The most frequently occurring value
Measures of Variation or
Spread
Standard Deviation:
measures variation by
indication how far, on
average, the
observations are from
the mean
Equations in Excel
Mean
Median
Standard Deviation
Data Cleaning Processes
• Example: Checking for outliers
– The US ILI system uses a statistical process to
check for outliers:
• Look at # of patient visits over time from a given
provider
• That # should be consistent within a certain degree of
change (i.e. 4 standard deviations from the mean)
• All values above or below this value are selected and
checked manually to verify whether or not the values
are reasonable and make sense.
Data Outliers in Excel
Data Cleaning
01002: Data could not be disproved, left in.
04099: Fixed data based on returned
workfolder
04108: Data looked OK to surveillance
staff, this was the peak of pandemic, and
we would have expected numbers to be
high
Error Logs
• List of errors found during the cleaning
process
• Helps to keep track of changes made to
records during the cleaning process.
– Keep track of how the data has changed over time
– Used for follow-up on questions to sites
• May be manual or automated
– Based on needs of the data
Example of Error Log
Date
2/9/11
Specimen Patient
State
ID
ID
Field
Prior Value
Current
Value
Your
Reason for Change Initials
SPECIMEN
coinfection H3 and
MD A11009193 995731
id
A11009193 A11009193b
2009 H1N1
SPECIMEN
coinfection H3 and
id
A11005370 A11005370b
2009 H1N1
2/9/11
MD A11005370 991669
2/9/11
SPECIMEN
coinfection B and
MD A11010991 997611
id
A11010991 A11010991B
2009 H1N1
2/9/11
M11VR000
SPECIMEN M11VR0008 M11VR00083 coinfection 2009
SD
830
388182
id
30
0 (a, b, c)
H1N1, H3, and B
Comments
AB
changed one specimen id to
'b' so would be coded as two
separate viruses
AB
changed one specimen id to
'b' so would be coded as two
separate viruses
AB
changed one specimen id to
'b' so would be coded as two
separate viruses
AB
changed one specimen id to
'b' so would be coded as two
separate viruses
Conclusions
• Preparing data for analysis includes finding and
cleaning as many data errors as possible
– Statistical methods, the eyeball method, and simple
queries can all be used to find potential data errors
• Data cleaning is important because data errors
could alter the interpretation of data (i.e. could cause
a perceived increase without a true increase in
disease activity)
• Error logs are useful in accounting for errors and
how they were dealt with