Data Cleaning 101

Download Report

Transcript Data Cleaning 101

Data Cleaning 101
Ron Cody, Ed.D
Robert Wood Johnson Medical School
Piscataway, NJ
Sample Data Set
Variable
Name
Description
PATNO
Patient Number
GENDER Gender
VISIT
Visit Date
HR
Heart Rate
SBP
Systolic Blood Pres.
DBP
Diastolic Blood Pres.
DX
Diagnosis Code
AE
Adverse Event
Type
Character
Character
MMDDYY10.
Numeric
Numeric
Numeric
Character
Character
Valid Values
Numerals
‘M' or 'F'
Any valid date
40 to 100
80 to 200
60 to 120
1 to 3 digits
'0' or '1'
Using PROC FREQ to Look for
Invalid Character Values
The FREQ Procedure
PROC FREQ DATA=PATIENTS;
TITLE "Frequency Counts";
TABLES GENDER DX AE / NOCUM
NOPERCENT;
RUN;
Gender
GENDER
Frequency
------------------2
1
F
12
M
13
X
1
f
2
Frequency Missing = 1
Using PROC PRINT with a WHERE
Statement
PROC PRINT DATA=PATIENTS;
WHERE GENDER NOT IN ('F','M',' ')
VERIFY(DX,' 0123456789') NE 0
AE NOT IN ('0','1',' ');
TITLE "Listing of Invalid Data";
ID PATNO;
VAR GENDER DX AE;
RUN;
OR
OR
Using PROC PRINT with a WHERE
Statement
Listing of Invalid Data
PATNO
GENDER
DX
AE
002
003
XX5
010
013
002
023
987
F
X
M
f
2
F
f
M
X
3
1
1
1
X
0
1
A
0
1.3
0
0
0
Using a Data Step to Identify Invalid
Character Values
DATA _NULL_;
INFILE "C:PATIENTS.TXT" PAD;
FILE PRINT; ***Send output to the output window;
TITLE "Listing of Invalid Data";
INPUT @1 PATNO
$3.
Listing of Invalid
@4 GENDER
$1.
Data
@24 DX
$3.
PATNO=002 DX=X
@27 AE
$1.;
PATNO=003 GENDER=X
***Check GENDER;
PATNO=XX5 AE=A
IF GENDER NOT IN ('F','M',' ') THEN
PATNO=010 GENDER=f
PUT PATNO= GENDER=;
PATNO=013 GENDER=2
***Check DX;
PATNO=002 DX=X
IF VERIFY(DX,' 0123456789') NE 0 THEN
PATNO=023 GENDER=f
PUT PATNO= DX=;
PATNO=987 DX=1.3
***Check AE;
IF AE NOT IN ('0','1',' ') THEN
PUT PATNO= AE=;
RUN;
Using PROC MEANS to Look for
Outliers
PROC MEANS DATA=CLEAN.PATIENTS N NMISS
MIN MAX MAXDEC=3;
TITLE "Checking Numeric Variables";
VAR HR SBP DBP;
RUN;
Checking Numeric Variables
Variable Label
N Nmiss Minimum Maximum
---------------------------------------------------------HR
Heart Rate
27 3
10.000 900.000
SBP
Systolic Blood Pressure 26 4
20.000 400.000
DBP
Diastolic Blood Pressure 27 3
8.000 200.000
----------------------------------------------------------
Using PROC UNIVARIATE with an
ODS Select Statement
ODS SELECT EXTREMEOBS;
PROC UNIVARIATE DATA=CLEAN.PATIENTS;
VAR HR SBP DBP;
ID PATNO; The UNIVARIATE Procedure
Variable: DBP (Diastolic Blood Pressure)
RUN;
Extreme Observations
--------Lowest--------
--------Highest-------
Value
Obs
Value
23
12
14
27
6
106
120
120
180
200
8
20
64
68
68
PATNO
020
011
013
025
006
PATNO
027
004
010
009
321
Obs
28
4
11
10
22
Using the NEXTROBS Option with
PROC UNIVARIATE
ODS SELECT EXTREMEOBS;
PROC UNIVARIATE DATA=CLEAN.PATIENTS
NEXTROBS=3;
VAR HR SBP DBP;
ID PATNO;
RUN; Variable: DBP (Diastolic Blood Pressure)
Extreme Observations
--------Lowest--------
--------Highest-------
Value
Obs
Value
23
12
14
120
180
200
8
20
64
PATNO
020
011
013
PATNO
010
009
321
Obs
11
10
22
Using a WHERE statement with PROC
PRINT to list out-of-range data
PROC PRINT DATA=CLEAN.PATIENTS;
WHERE HR NOT BETWEEN 40 AND 100 AND
HR IS NOT MISSING
OR
SBP NOT BETWEEN 80 AND 200 AND
SBP IS NOT MISSING
OR
DBP NOT BETWEEN 60 AND 120 AND
DBP IS NOT MISSING;
TITLE "Out-of-range Values for Numeric Variables";
ID PATNO;
VAR HR SBP DBP;
RUN;
Using a WHERE statement with PROC
PRINT to list out-of-range data
Out-of-range Values for Numeric Variables
PATNO
HR
SBP
DBP
004
008
009
010
011
014
017
321
020
023
101
210
86
.
68
22
208
900
10
22
200
.
240
40
300
130
.
400
20
34
120
.
180
120
20
90
84
200
8
78
Using a DATA _NULL_ Data Step to list
out-of-range data values
DATA _NULL_;
INFILE "C:\CLEANING\PATIENTS.TXT" PAD;
FILE PRINT; ***output to the output Window;
TITLE "Listing of Patient Numbers and Invalid Data Values";
INPUT @1 PATNO
$3.
@15 HR
3.
@18 SBP
3.
@21 DBP
3.;
***Check HR;
IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN
PUT PATNO= HR=;
***Check SBP;
IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 THEN
PUT PATNO= SBP=;
***Check DBP;
IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 THEN
PUT PATNO= DBP=;
RUN;
Using a DATA _NULL_ Data Step to list
out-of-range data values
Listing of Patient Numbers and Invalid Data Values
PATNO=004 HR=101
PATNO=008 HR=210
PATNO=009 SBP=240
PATNO=009 DBP=180
PATNO=010 SBP=40
PATNO=011 SBP=300
PATNO=011 DBP=20
PATNO=014 HR=22
PATNO=017 HR=208
PATNO=321 HR=900
PATNO=321 SBP=400
PATNO=321 DBP=200
PATNO=020 HR=10
PATNO=020 SBP=20
PATNO=020 DBP=8
PATNO=023 HR=22
PATNO=023 SBP=34
Using User Defined Formats to
Detect Invalid Values
PROC FORMAT;
VALUE $GENDER 'F','M' = 'Valid'
' '
= 'Missing'
OTHER
= 'Miscoded';
VALUE $DX '001' - '999'= 'Valid'
' '
= 'Missing'
OTHER
= 'Miscoded';
VALUE $AE '0','1' = 'Valid'
' '
= 'Missing'
OTHER = 'Miscoded';
RUN;
Gender
GENDER
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Miscoded
4
Valid
25
Frequency Missing = 1
PROC FREQ DATA=CLEAN.PATIENTS;
TITLE "Using FORMATS";
FORMAT GENDER $GENDER.
DX
$DX.
AE
$AE.;
TABLES GENDER DX AE / NOCUM NOPERCENT;
RUN;
Using User-defined Formats and a
PUT Function
DATA _NULL_;
INFILE "C:PATIENTS.TXT" PAD;
FILE PRINT; ***Send output to the
output window;
TITLE "Invalid Data Values";
INPUT @1 PATNO
$3.
@4 GENDER
$1.
@24 DX
$3.
@27 AE
$1.;
IF PUT(GENDER,$GENDER.) = 'Miscoded'
THEN PUT PATNO= GENDER=;
IF PUT(DX,$DX.) = 'Miscoded' THEN PUT
PATNO= DX=;
IF PUT(AE,$AE.) = 'Miscoded' THEN PUT
PATNO= AE=;
RUN;
Invalid Data Values
PATNO=002 DX=X
PATNO=003 GENDER=X
PATNO=004 AE=A
PATNO=010 GENDER=f
PATNO=013 GENDER=2
PATNO=002 DX=X
PATNO=023 GENDER=f
Using PROC RANK to List the Highest
and Lowest n% of the Data
%MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR);
***Compute number of groups for PROC RANK;
%LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR);
***Value of the highest GROUP from PROC RANK, equal to the
number of groups - 1;
%LET TOP = %EVAL(&GRP - 1);
PROC FORMAT;
VALUE RNK 0='Low' &TOP='High';
RUN;
PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP;
VAR &VAR;
RANKS RANGE;
RUN;
***Sort and keep top and bottom n%;
PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP)));
BY &VAR;
RUN;
(continued)
Using PROC RANK to List the Highest
and Lowest n% of the Data
***Produce the report;
PROC PRINT DATA=NEW;
TITLE "Upper and Lower &PERCENT.% Values
for %UPCASE(&VAR)";
ID &IDVAR;
VAR RANGE &VAR;
FORMAT RANGE RNK.;
RUN;
PROC DATASETS LIBRARY=WORK NOLIST;
DELETE NEW;
RUN;
QUIT;
%MEND HI_LOW_P;
Using PROC RANK to List the Highest
and Lowest n% of the Data
%HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO)
Upper and Lower 10% Values for SBP
PATNO
RANGE
SBP
020
023
011
321
Low
Low
High
High
20
34
300
400
Detecting Outliers Based on the
Standard Deviation
***Output mean and standard deviations to a
data set;
proc means data=clean.patients noprint;
var hr;
output out=means(drop=_type_ _freq_)
mean=m_hr
std=s_hr;
run;
Data set MEANS contains one observation:
Listing of Data Set MEANS
m_hr
104.871
s_hr
153.026
continued...
Detecting Outliers Based on the
Standard Deviation
%let n_sd = 2;
***Two standard deviations gives
approximately 5% of the outliers;
data _null_;
file print;
title "Statistics for Numeric Variables";
set clean.patients;
if _n_ = 1 then set means;
if hr lt m_hr - &n_sd*s_hr and hr ne . or
hr gt m_hr + &n_sd*s_hr then
put patno= hr=;
run;
Detecting Outliers Based on the
Standard Deviation
Statistics for Numeric Variables
PATNO=321 HR=900
Detecting Outliers Based on Trimmed
Data
A trimmed mean is a mean
computed by first removing
some of the highest and lowest values before doing the
calculation.
proc rank data=clean.patients out=tmp groups=4;
var hr;
ranks r_hr;
run;
proc means data=tmp noprint;
where r_hr in (1,2); ***The middle 50%;
var hr;
output out=means(drop=_type_ _freq_)
mean=m_hr
std=s_hr;
run;
continued...
Detecting Outliers Based on Trimmed Data
%let n_sd = 2;
data _null_;
title "Outliers Based on Trimmed Data";
file print;
set clean.patients;
if _n_ = 1 then set means;
if hr lt m_hr - &n_sd*2.63*s_hr and
hr ne . or
hr gt m_hr + &n_sd*2.63*s_hr then
put patno= hr=;
run;
Detecting Outliers Based on Trimmed
Data
Outliers Based on Trimmed Data
PATNO=008 HR=210
PATNO=014 HR=22
PATNO=017 HR=208
PATNO=321 HR=900
PATNO=020 HR=10
PATNO=023 HR=22
Detecting Outliers Based on Trimmed Data
Program to Trim the Top and Bottom 5% of the Data
proc rank data=clean.patients out=tmp
groups=20;
var hr;
ranks r_hr;
run;
proc means data=tmp noprint;
where r_hr not in (0,19); *The middle 90%;
var hr;
output out=means(drop=_type_ _freq_)
mean=m_hr
std=s_hr;
run;
Defining Interquartile Range
Diastolic Blood Pressure (DBP)
8
20
74
82
180
100
IQR
1.5 x IQR
Outliers
Outliers
Q3 (upper hinge)
Q1 (lower hinge)
Median
200
Outliers Based on the Interquartile Range
%MACRO INTER_Q(DSN,VAR,IDVAR,N_IQR);
PROC MEANS DATA=&DSN NOPRINT;
VAR &VAR;
OUTPUT OUT=TMP Q3=UPPER Q1=LOWER QRANGE=IQR;
RUN;
DATA _NULL_;
TITLE "Outliers Based on &N_IQR Interquartile Ranges";
FILE PRINT;
SET &DSN;
IF _N_ = 1 THEN SET TMP;
IF &VAR LT LOWER - &N_IQR*IQR AND &VAR NE .
OR &VAR GT UPPER + &N_IQR*IQR THEN PUT &IDVAR= &VAR=;
RUN;
PROC DATASETS LIBRARY=WORK NOLIST;
DELETE TMP;
RUN;
QUIT;
%MEND INTER_Q;
Outliers Based on the Interquartile
Range
%INTER_Q(CLEAN.PATIENTS,SBP,PATNO,2);
Outliers Based on 2 Interquartile Ranges
PATNO=011 SBP=300
PATNO=321 SBP=400
Using Perl Regular Expressions
For Data Cleaning
Some PERL Regular Expression Examples
Regular
Expression
Matches
/cat/
the letters "cat"
/cat*/
the letters "ca" followed by zero or more "t"s
/cat+/
the letters "ca" followed by one or more "t"s
/c[aeiou]t/
a "c" followed by a vowel followed by the letter
"t"
/\d\d/
any two digits
/\d\d+/
two or more digits
PRXPARSE Syntax
RE = PRXPARSE("expression");
return code
PERL regular expression
Examples
RETURN = PRXPARSE("/cat/");
RE = PRXPARSE("/\d\d+/");
PRXMATCH Syntax
POS = PRXMATCH(return,string);
Position of the
beginning of the
pattern. If not
found, returns a
zero
Return code
from PRXPARSE
Function
Text
string
Examples
POS = PRXMATCH(RE,STRING);
RETURN = PRXPARSE("/cat/");
P = PRXMATCH(RETURN,"This is a cat");
Value of P is 11
A Simple Example: Locating a SS Number
DATA FIND_SS;
IF _N_ = 1 THEN RETURN =
PRXPARSE("/\d\d\d-\d\d-\d\d\d\d/");
RETAIN RETURN;
INPUT STRING $30.;
POSITION = PRXMATCH(RETURN,STRING);
IF POSITION GT 0 THEN OUTPUT;
DATALINES;
none on this line
yes! 123-45-6789 is one
two 111-22-3333 444-55-6666
;
RETURN
1
1
STRING
yes! 123-45-6789 is one
two 111-22-3333 444-55-6666
POSITION
6
5
Using PRXMATCH without using
PRXPARSE (version 9.1)
DATA FIND_SS;
INPUT STRING $30.;
POSITION =
PRXMATCH("/\d\d\d-\d\d-\d{4}/",STRING);
IF POSITION GT 0 THEN OUTPUT;
DATALINES;
none on this line
yes! 123-45-6789 is one
two 111-22-3333 444-55-6666
;
STRING
POSITION
yes! 123-45-6789 is one
two 111-22-3333 444-55-6666
6
5
Using Perl Regular
Expressions for Data Cleaning
DATA BAD_DATA;
SET CLEAN.PATIENTS;
IF PRXMATCH("/\d |\d\d |\d{3}/",DX) EQ 0
AND NOT MISSING(DX);
RUN;
Listing of data set BAD_DATA
PATNO
DX
002
002
987
986
X
X
1.3
1 5
Some Regular Expression
Solutions
DATA BAD_OBS;
LENGTH ID $ 5;
INPUT ID @@;
*Valid ID's are X, Y, or Z followed by
one or more digits;
IF NOT PRXMATCH("/^X|Y|Z\d+/",ID);
DATALINES;
X12 C334 Y777 78Z 999
;
Listing of data set BAD_OBS
ID
C334
78Z
999
Penn State SAS Users Group
Open to all SAS users in the central PA
area
Visit our website for more information and sign-up for
our listserv.
http://www.pop.psu.edu/help/cacpri/psu
sug/