No Slide Title

Download Report

Transcript No Slide Title

Data Management for Research
Michael A. Kohn, MD, MPP
7 January 2003
Assumptions about Students
• Actively involved in a clinical research
study
• Some experience with entering and
maintaining data in single-table spreadsheet
or statistical software
Housekeeping
• Check Lab sheet to confirm which lab you
are in.
• Labs will be in Library 222.
• Bring a diskette and your syllabus to labs.
• Syllabus and “Learn MS Access 2000” CD
Requirements
• Turn in all 4 assignments on time
• Turn in course evaluation
Course Objective
Learn how to develop a multi-table, relational
database for a research study using
Microsoft Access.
Example: Infant Jaundice Study
Data Management for Clinical Research
We know how to
• define the study population, the independent
variables and the outcome variables;
• measure these variables and anticipate problems with
measurement;
• analyze the results.
Data Management for Clinical Research
The DBMS (Database Management System) is for
• entering and storing the measurements,
• entering and storing the other information necessary to
administer the study (subject contact information, exam
schedules, reimbursement records, etc.),
• monitoring the study, and
• either analyzing the results or formatting the results for
analysis.
Four Types of Research Database
1. Combination of paper files, Excel spreadsheets,
and direct keyboard entry into the statistical
analysis package.
2. Desktop multi-table relational database.
3. Client-Server multi-table relational database.
4. Internet database server.
Advantages of a computerized database
•Ease of data entry
•Automatic data validation
•Automatic error checking
•Alternative is a stack of paper forms
Advantages of a Multi-Table Relational
Database
•Eliminates redundancy
•Ensures data integrity
Note: Unless you plan on doing your analysis long-hand,
you always need a computer database of some sort (a Stata
dataset or an Excel spreadsheet may be adequate); you
don’t always need a multi-table relational DBMS (like
Microsoft Access).
Multi-Table Relational Database
Collection of spreadsheet-like, twodimensional tables.
Rows in Tables = Records
Columns in Tables = Attributes
Tables are related one-to-many, many-to-many,
and one-to-one.
Jaundice and Infant Feeding Study
Cohort study to determine the 5-year
neuropsychiatric sequelae of infants with
neonatal jaundice or feeding disorders.
Infant Jaundice Study
(Our fictional version of JIFee)
Methods:
Design-Cohort study.
Setting-Single, urban medical center
Subjects-Infants with neonatal jaundice and randomly selected
non-jaundiced infants
Independent Variable-Presence or absence of jaundice
Outcome Variable- Neuropsychiatric score (ranging from
55 to 145) at age 5
Analysis- ?
Infant Jaundice Study Data
1. Approximately 400 children
2. 5 examiners (doctors)
3. Approximately 700 neuropsychiatric examinations,
measuring weight, height, and “NPScore” (IQ)
4. Some children to be examined more than once
5. No examiner to see the same child twice
6. If child died before age 5, store age and circumstances of
death
Assignments
Lab 1: Tables and Relationships 1/14/2003
Send Access file LastnameLab1.mdb to Andrew High,
([email protected]) by 1/20/2003.
Lab 2: Forms, Queries, and Reports 1/21/2003
Send Access file LastnameLab2.mdb to Andrew High,
([email protected]) by 1/27/2003.
We will work through these assignments in the labs, so
you don’t need to have Access2000 at home.
Assignments (cont’d)
Lab 3: Exporting and Analyzing Data 1/28/2003
Determine if neonatal jaundice was associated with
the 5-year neuropsychiatric scores and create a table,
figure, or paragraph appropriate for the “Results”
section of a manuscript summarizing the association.
Write a sentence or two for the “Methods” section
on inter-rater reliability. (Use Bland and Altman,
BMJ 1996; 313:744)
Send assignment to [email protected] by 2/3/2003.
Assignments (cont’d)
Class session 5 (not a lab): Planning and Budgeting for
Data Management 2/4/2003
Write a one-page data management section for your
research study protocol and create a budget for data
management. If you do not have your own research
study protocol, write the data management section and
create a budget for the fictional Infant Jaundice Study
protocol. We can compare your budget with the real
budget of JIFee.
Send assignment to [email protected] by 2/10/2003.
Table of Study Subjects
Infant Jaundice Study Table of Subjects = “Baby”
Row = Individual Infant
Columns = ID#, Name, DOB, Sex, Jaundice.
If one set of measurements per infant, put
measurements in subject table.
This is a single-table database.
Demonstration
Label columns and enter rows of data in datasheet view
Demonstration
Table design or “data dictionary” view
Demonstration
Disallowed values
Duplicate primary keys
This automatic error checking and data validation IS
why you need to enter your data into a computer; it is
NOT why you need a relational DBMS. Many singletable products (Filemaker Pro, SAS FSP, even Excel)
can do error checking and data validation.
Acceptable table showing one set of exam results per participant.
Table of Study Subjects
Table of Study Subjects
Row = Individual Infant
Columns = ID#, Name, DOB, Sex, Jaundice
If some infants have more than one exam,
what do you do?
Unacceptable table showing multiple exam results per study participant.
Common Error
• If you find yourself creating multiple
columns for the same measurement, e.g.,
Date1, Score1, Date2, Score2, Date3,
Score3, …
• Or if your table is more than about 30
columns wide,
– It is time to restructure your table.
Unacceptable table with participant-specific data duplicated for each exam.
(Note problem with Helen’s DOB.)
Normalization
If some infants have multiple exams,
“normalize” the records into two tables, one
for subjects and one for examinations.
Figure 6. Data normalized into two tables: one (“Baby”) with rows comprising
subject-specific information; the other (“Exam”) with rows comprising examspecific information. Note that Helen can only have one birth date. Subjects
with no exams, e.g. Alejandro, still appear in the database. “SubjectID”
functions as the primary key in the “Baby” table and as the foreign key in the
“Exam” table.
Figure 7. Relationships diagram showing the one-to-many relationship between
the table of subjects (“Baby”) and the table of measurements (“Exam”).
Demonstration
Inability to create integrity violations with normalized tables.
This IS why you need a multi-table relational DBMS.
Table of Examiners
Neuropsychiatric outcomes are assessed by 5 different
examiners (doctors)
May want to assess whether examiner characteristics
(sex, specialty, age) affect neuropsychiatric scores
Doctor examines many children; each child may have
more than one exam; but a child is never examined by
the same doctor twice.
Table of examiners with multiple examiner-specific fields.
Figure 9. Undesirable table in which examiner-specific data is repeated with
each examination. (Note that Dr. Novello is a female pediatrician for two
examinations and a male internist for an exam in between.)
Figure 10. Normalization into two tables, one for exam-specific information and
one for examiner-specific information. (Note that Dr. Novello cannot change
specialty or gender between examinations.) “DocID” functions as a second
foreign key in the “Exam” table. (The other foreign key is “SubjectID”.)
Figure 11. Relationships diagram showing the relationships between the
table of subjects (Baby), the table of measurements (Exam) and the table
of examiners (Doctor). The “Exam” table functions as a linkage or join
table between “Baby” and “Doctor” creating a “many-to-many”
relationship between study subjects and examiners.
One-to-One Relationship: Infants and Deaths.
Some fields are subject specific but valued for only a few subjects.
Maintaining columns for these fields in the table of subjects leads to empty
fields and wasted space.
Figure 12. Some fields are subject specific but valued for only a few subjects.
Maintaining columns for these fields in the table of subjects leads to empty
fields and wasted space.
Figure 13. Creating a separate table with a one-to-one relationship eliminates the
empty fields and wasted space.
Figure 14. The relationships diagram now includes a table (“Death”) with a oneto-one relationship with the table of subjects (“Baby”). A subject can only have
one record in the one-to-one-related table, but the vast majority of subjects will
not have any “Death” record.
Undesirability of Storing Calculated Values
Store raw data, not calculated fields, e.g., store dates and
times; calculate intervals.
Storing a patient’s birth date allows calculation of his or
her exact age on the date of a particular measurement.
Figure 15. Calculated fields such as “AgeInMonths” are undesirable. What if the birth
date for SubjectID 2322 (Helen) is corrected in the “Baby” table?
Demonstration
• Calculate Age
• Calculate BMI
Standard Data Entry Conventions
Several conventions for data entry and display have
developed over time. Although most users of screen forms
are not aware of these conventions, they have come to
expect them subconsciously. For example, a series of
mutually exclusive, collectively exhaustive choices is
usually displayed as an “option group” consisting of several
different “radio buttons”, whereas choices which are not
mutually exclusive are displayed as check boxes.
N.B. An “option group” of mutually exclusive choices is a
single column or field. A group of N check boxes
represents N yes/no fields.
Use check boxes when
options are not mutually
exclusive. (5 fields)
Use radio buttons when
options are mutually
exclusive. (1 field)
Computer chart abstraction form showing two common data entry
conventions.
Demonstration
Option group for examiner’s medical specialty
Guidelines for Data Management in Clinical Research
Establish the database tables, their rows and columns, and
their relationships correctly at the outset.
A poorly organized database makes data maintenance and
retrieval nearly impossible. Make sure the data are
normalized. The data structures should never require duplicate
data entry or redundant storage.
? MS Genetics Example
Guidelines for Data Management in Clinical Research
Establish and follow naming conventions for columns and
tables.
Short field names without spaces or underscores are
convenient for programming, querying, and other
manipulations. Instead of spaces or underscores, use
“IntraCaps” (upper case letters within the variable name)
to distinguish words, e.g. “StudyID”, “FName”,
“FdDisord”, or “ExamDate”. Table names should be
singular, e.g. “Baby” instead of “Babies”, “Exam” instead
of “Exams”.
Guidelines for Data Management in Clinical Research
Obtain baseline demographic and clinical information about
members of the study population from existing computer
databases.
Avoid re-entering data which are already available (in digital
formats) from other sources. In the JIFee Study, the patient
demographic data and contact information are obtained from
the hospital database. Computer systems can almost always
produce text-delimited or fixed-column-width character files
that the database management system can import.
Guidelines for Data Management in Clinical Research
Minimize the extent to which study measurements are
recorded on paper forms.
Enter data directly into the computer database or move data
from paper forms into the computer database as close to the
data collection time as possible. When you define a variable
in a computer database, you specify both its format and its
domain or range of allowed values. Using these format and
domain specifications, computer data entry forms give
immediate feedback about improper formats and values that
are out of range. The best time to receive this feedback is
when the study subject is still on site.
Guidelines for Data Management in Clinical Research
Back up the database regularly and check the adequacy
of the back up procedure by periodically restoring a file
from the back up medium.
Desktop DBMS
The processing of records is done by the desktop. The server
simply stores files (file server).
Microsoft Access
Claris Filemaker Pro
Paradox
Microsoft Visual FoxPro
Dataease
Client-Server DBMS
The processing of records is done by the server. The desktop
manages the screen, but passes queries on to the server. (Just
to confuse things, MS Access can be a client for SQL Server,
and other enterprise systems. The ultimate in “thin” clients is
a browser (Internet Explorer). In this case, the server is an
intranet or internet database server.)
Microsoft SQL Server
Oracle
Informix
Sybase