Data/Select Cases

Download Report

Transcript Data/Select Cases

GAP Toolkit 5
Training in basic drug abuse data management
and analysis
Training session 12
Data cleaning
Objectives
•
•
•
•
To establish methods of uncovering coding errors
To discuss techniques for implementing logical tests
To present methods of selecting cases
To reinforce the SPSS skills presented to date
Boolean operators: AND
• The AND operator is a logical operator in Boolean
algebra
• Imagine two statements: X and Y
• For the operation (X AND Y) to be true X has to be true
and Y has to be true
• The rules for Boolean operators are commonly
displayed in Truth Tables
Truth table: AND
Let: 0 = False ; 1 = True
X
Y
X AND Y
0
0
0
0
1
0
1
0
0
1
1
1
Boolean operators: OR
• The OR operator is a logical operator in Boolean
algebra
• Imagine two statements: X and Y
• For the operation (X OR Y) to be true either X is true or
Y is true or both X and Y are true
Truth table: OR
Let: 0 = False ; 1 = True
X
Y
X OR Y
0
0
0
0
1
1
1
0
1
1
1
1
Data cleaning
• Check the data for errors
• Clean the data before any data analysis
Types of error
• There are two broad areas of error:
– Coding errors
– Logical errors
Coding error
• Data entry errors
• Out-of-range values
Detecting out-of-range values
• For categorical variables, having declared valid values,
frequency counts will highlight any peculiar entries
• For continuous variables, descriptive statistics, in
particular the range and a histogram, will highlight any
peculiar values
Examples
• Age: generate descriptive statistics
• Treatment type: generate a frequency distribution
Descriptives
Statistic
Age
Mean
95% Confidence Interval
for Mean
Std. Error
31.78
Lower Bound
31.16
Upper Bound
32.40
5% Trimmed Mean
31.31
Median
31.00
Variance
Std. Deviation
.315
154.614
12.434
Minimum
1
Maximum
77
Range
76
Interquartile Range
20.00
Skewness
-.427
.062
Kurtosis
-.503
.124
Histogram
300
200
Frequency
100
Std. Dev = 12.43
Mean = 31.8
N = 1563.00
0
0.0
10.0
5.0
Age
20.0
15.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
75.0
Treatment type
Frequency
Valid
Inpatient
Outpatient
4
Total
Missing
Total
System
Percent
Valid Percent
Cumulative Percent
1027
65.4
65.7
65.7
535
34.1
34.2
99.9
1
.1
.1
100.0
1563
99.5
100.0
8
.5
1571
100.0
Resolving errors
•
•
•
•
The questionnaires should be checked
If possible, return to the interviewer or interviewee
If still unresolved, consider setting the value as missing
Note the importance of ID numbers for linking the
computer to the questionnaire
Selecting cases
• The ability to select a set of cases according to a
criterion is essential in data cleaning
• Generating statistics for subsets of the data is also a
useful analytical tool
Example: Age
• Descriptive statistics of Age indicate that there is a
case with a value of 1 and a case with the value 77
Descriptive Statistics
N
Age
1563
Valid N (listwise)
1563
Minimum
Maximum
1
77
Mean
Std. Deviation
31.78
• It is advisable to check the extreme values
12.434
Example: Age
• It would be reasonable to check for values 10 and under
and 70 and over
• The task is to select those cases and display the results
• Data/Select Cases generates the following dialogue box
Choose these
options to
define selection
criteria.
Data/Select Cases
• SPSS creates a new variable in the data set called
filter_$ which = 1 when AGE<=10 OR AGE >= 70
• All subsequent analysis will be on the reduced data set
until Data/Select Cases/All Cases is chosen
• The filtered cases are identified by a slash through the
case number
Age
Frequency
Valid
Percent
Valid Percent
Cumulative Percent
1
1
7.1
7.1
7.1
7
5
35.7
35.7
42.9
8
1
7.1
7.1
50.0
9
1
7.1
7.1
57.1
10
3
21.4
21.4
78.6
70
1
7.1
7.1
85.7
72
1
7.1
7.1
92.9
77
1
7.1
7.1
100.0
14
100.0
100.0
Total
Generating a report
• Analyse/Reports/Case Summaries
• Select the variables to be included in the summary
Case summariesa
Case
number
ID
Age
Race
Education
Employment
1
16
16
8
White
Secondary
Working fulltime
2
85
85
77
White
Tertiary
3
183
183
70
White
4
184
184
72
5
903
903
6
1041
7
Marital status
Treatment
type
1st most
frequently used
drug
Married liv w.
spouse
Inpatient
ALCOHOL
Pensioner
Widowed
Inpatient
ALCOHOL
Secondary
Pensioner
Married liv w.
spouse
Inpatient
ALCOHOL
White
Tertiary
Pensioner
Married liv w.
spouse
Inpatient
ALCOHOL
1
White
.
Student/pupil
Never married
Inpatient
DAGGA
1041
7
African
Primary
Student/pupil
Never married
Outpatient
DAGGA
1042
1042
7
African
Primary
Student/pupil
Never married
Outpatient
DAGGA
8
1043
1043
7
African
Primary
Student/pupil
Never married
Outpatient
DAGGA
9
1044
1044
7
African
Primary
Student/pupil
Never married
Outpatient
DAGGA
10
1045
1045
7
African
Primary
Student/pupil
Never married
Outpatient
DAGGA
11
1518
1518
9
African
Primary
Student/pupil
Never married
Outpatient
WHITE PIPE
12
1519
1519
10
African
Primary
Student/pupil
Never married
Outpatient
WHITE PIPE
13
1520
1520
10
African
Primary
Student/pupil
Never married
Outpatient
WHITE PIPE
14
1521
1521
10
African
Primary
Student/pupil
Never married
Outpatient
WHITE PIPE
14
14
14
13
14
14
Total
N
a. Limited to first 100 cases.
14
14
Note: All Cases
• Don’t forget that, once certain cases have been
selected, all subsequent analysis is on the selected
cases only
• Once you have finished working with the subset, restore
the file to All Cases before doing any further analysis
– Data/Select Cases…
– Select the All Cases radio button
– OK
Locating a case
• From the Data Editor:
– Data/Go To Case
OR
– Select a variable, then Edit/Find
Logical errors
• Detecting logical errors involves comparing answers to
ensure that they are consistent
• The type of logical checks appropriate to identify
particular errors will depend on the questions in the
questionnaire
Detecting logical errors
• Cross-tabulations between categorical variables can be
used to highlight errors
• Check criteria using conditional statements and the
Compute facility
• Some software, such as SPSS Databuilder, allows tests
for logical and coding errors to be built into a data entry
form
Example: Cross-tabulation
• Cross-tabulations provide a simple method of
investigating the joint distribution of two variables
• The following slide is a cross-tabulation of Drug1
against Mode1 to check that appropriate modes of
ingestion have been reported
Most Frequently Used Drug (Cross-tabulation)
Mode of ingestion Drug1
Most frequently used drug
DAGGA
Swallow
Smoke
1
HEROIN
CODEINE
SEDATIVES &
TRANQUILLIZERS
BENZODIAZEPINES
181
11
29
71
5
2
44
46
97
1
98
4
1
2
7
24
1
25
3
3
16
16
MANDRAX
12
12
VALIUM
2
2
LSD
5
5
SOLVENTS & INHALANTS
2
WHITE PIPE
ALCOHOL
1
3
6
309
309
717
717
ROHYPNOL
3
3
MISC. PRESCRIPTION DRUGS
9
1
MISC. DRUGS
Total
Total
5
CRACK
ECSTASY
Inject
180
31
COCAINE
AMPHETAMINE
Snort
1
791
634
62
10
1
30
1517
Example: conditional statements
• Main.sav contains information on the three most
frequently used drugs: Drug1, Drug2 and Drug3
• In a single case, no drug should appear in more than
one of the three variables
• To check this, generate a test variable on the basis of a
conditional statement; the test variable should take the
value 0 if all three drug variables are different and the
value 1 if there is any duplication
Compute: Test = 0
• Transform/Compute
• Enter the name of the new variable: TEST
• Click the Type and Label button and declare the
variable as numeric with the label: TEST VARIABLE
FOR DRUG DUPLICATION
• Set TEST = 0
Compute: TEST = 1
• If any of the drug options are the same, TEST should
equal 1 EXCEPT when Drug2 = Drug3 = 77 (not
applicable)
• The condition is if
–
–
–
–
Drug1 = Drug2 OR
Drug1 = Drug3 OR
(Drug2 = Drug3 AND Drug2  77)
THEN Test = 1
Click If… button to define
the conditional statement.
Case summariesa
1st most frequently used
drug
2nd most frequently used
drug
3rd most frequently used
drug
ID
1
BENZODIAZEPINES
MISC. PRESCRIPTION
DRUGS
MISC. PRESCRIPTION
DRUGS
734
2
CRACK
CRACK
ECSTASY
807
3
CRACK
WHITE PIPE
CRACK
835
4
HEROIN
SEDATIVES &
TRANQUILLIZERS
SEDATIVES &
TRANQUILLIZERS
1182
5
SEDATIVES &
TRANQUILLIZERS
MISC. PRESCRIPTION
DRUGS
MISC. PRESCRIPTION
DRUGS
1230
6
SEDATIVES &
TRANQUILLIZERS
SEDATIVES &
TRANQUILLIZERS
MISC. PRESCRIPTION
DRUGS
1231
7
MISC. PRESCRIPTION
DRUGS
MISC. PRESCRIPTION
DRUGS
Not Applicable
1245
8
MISC. PRESCRIPTION
DRUGS
MISC. PRESCRIPTION
DRUGS
ALCOHOL
1250
Total
N
a. Limited to first 100 cases.
8
8
8
8
Exercise
• Check for consistency between the drug reported and
the method of ingestion for the second and third drugs
of use
• What additional logical tests could be completed on the
data in main.sav?
Summary
•
•
•
•
•
•
Data entry errors
Out-of-range errors
Logical errors
Conditional statements
Selecting cases
Reports