Document 62861

Download Report

Transcript Document 62861

Relational Databases and
Statistical Processing
Andrew Westlake
Survey & Statistical Computing
+44 20 8374 4723
[email protected]
www.SaSC.co.uk
Introduction
• Good Statistical Analysis needs Good Data
– Many things contribute to data quality
• Quality of structures and processes for data collection
related to quality of data
• Good Ideas from Computer Science
– Process design methodologies
– Database methodologies
– Data and process modelling systems
• Statisticians can benefit from CS concepts
– Particularly for design of complex structures and
continuous processes
15-Mar-03
Statistical Databases and RDBMS © S&SC
2
Data Structure
• Statistics generally based on a single rectangular dataset
– Much statistical data is conceived and stored in this form
• Database professionals often use Relational Database
systems
– Formal model, based on a set of related rectangular datasets
– Allows the representation of more complex structures, and
easier maintenance and manipulation of the data.
• RDB approach valuable for some statistical situations
– Review RDB concepts and functionality, plus SQL
– Statistical Examples
• Pakistan Fertility Survey
• Processing and analysis system for HIV and AIDS reporting for
England and Wales
15-Mar-03
Statistical Databases and RDBMS © S&SC
3
The Relational Model
• A logical specification of the content and behaviour of
a database management system, including
• The types of structure that can be present in a database
• The properties of elements that can be stored in these structures
• The operations that can be performed on these structures and their
behaviour
• Facilities that must be present in the database management system
• The general nature of the interactions between the database and its
users and administrators
– Codd’s Rules specify properties that a Relational DataBase
Management System (RDBMS) must possess
• SQL
– A standard language with which to interact with a RDBMS.
15-Mar-03
Statistical Databases and RDBMS © S&SC
4
Relational Database System Structure
DataBase
Management
System
User Processes
P
Schema
Design
P
P
Data
Entry
Database
D
Data
Dictionary
D
Data
D
Rules
DBMS
(SQL)
Insertion
Deletion
Updates
Consistency
Retrieval
Selection
Transformation
Sorting
Grouping
Aggregation
P
Transaction
Processing
P
MIS
Reports
User
Processes
Data
Modelling
SQL
Relational
Model
Storage &
Access
Methods
Applications &
Tools, include
functionality,
semantics
appropriate to
application
Expressing the
real task in
terms of the
conceptual
model
Standard
interface to an
RDBMS, syntax
and embedding
Specification of
functionality,
behaviour and
scope
Implementation
issue, affects
performance
15-Mar-03
Statistical Databases and RDBMS © S&SC
5
RDBMS Strengths
• Relational Model
– Precise, formal mathematical specification of structure and
behaviour
– Appropriate for information about Entities, and
Relationships between them
• Data Modelling
– Useful tools for understanding data structures and flows
• SQL
– International Standard (SQL2), widely implemented
• Current Implementations
– Widely available, well supported, good implementations,
integration with other products, add-on market for tools
15-Mar-03
Statistical Databases and RDBMS © S&SC
6
Components of the Relational Model
• Domain
• a set of possible (atomic) values
• Attribute
• defined over a domain, has a name, cf. variable
• Tuple
• a set of values, one associated with each attribute, cf. cases or
records. NULL values supported
• Relation
• defined over a set of attributes, has a name, consists of a set
of tuples
• Relational DataBase
• a set of relations
15-Mar-03
Statistical Databases and RDBMS © S&SC
7
Relations
Relation
Cancer_
Registration
Domain
Registration
ID
ID
Attribute
Tuples
4951
4952
4953
4954
4955
4956
4957
4958
4959
4960
4961
Gender Age in years
Sex
Age_at_
Registration
2
1
2
2
1
1
1
2
1
2
1
58
68
73
52
49
68
87
74
69
92
45
Relation Lung_ Diseases
Domain
Attribute
ICD
Diagnosis
String
Disease_Name
Tuples
162
501
503
Cancer of the Lung
Asbestosis
Mesothelioma
15-Mar-03
ICD
Age in
years
SEG Age_at_ Diagnosis
Diagnosis
Weight in
Kg, 1dp
Weight
SEG
54.5
75.8
62.3
48.7
95.2
112.7
84.2
69.4
53.0
83.0
67.4
3
2
5
2
4
3
4
4
1
5
5
53
60
70
45
47
61
85
70
69
92
41
194
285
162
501
162
196
203
162
186
162
503
• Rectangular Data tables, in which
columns are variables, rows are cases.
• Tables can be linked by the values of
common variables.
Statistical Databases and RDBMS © S&SC
8
SQL
– International Standard, actively revised
– Widely available in good RDBMS software
– Text language, used by programs and people
• Stored or constructed at run-time
– Designed to support tools which are independent of the DBMS
• cf. Client-Server architecture
– User and Programmer skills portable across products and sites
– Has sections for defining database structure (DDL),
Manipulating database content (DML)
Ensuring database integrity, and
Managing database security
15-Mar-03
Statistical Databases and RDBMS © S&SC
9
SQL Data Manipulation
• All manipulation (retrieval) of
data uses the single Select
statement, which has various
components corresponding to
different relational operations.
• The result of a SELECT
statement is a relational table,
which is displayed (by default)
or can be stored or processed in
another statement.
• Retrieval is usually done
through a Query interface,
which generates the SQL.
15-Mar-03
SELECT {DISTINCT}
<expression list>
FROM <table list>
WHERE <condition>
GROUP BY <column list>
HAVING <group condition>
ORDER BY <column list>
Statistical Databases and RDBMS © S&SC
10
Data Manipulation - Join
• The Join operation selects the combination of rows
from two or more tables that meet a matching
condition
SELECT R1.*, R2.*
FROM R1, R2
WHERE A = X
• All matching combinations are retained
– Eg can relate household information to all members via
household number
Join A=X
A
B
C
R1.D
X
R2.D
E
1
b1
c1
d1
1
d3
e1
2
b2
c2
d2
2
d5
e2
3
b3
c3
d3
3
d3
e3
15-Mar-03
Statistical Databases and RDBMS © S&SC
11
Data Manipulation - Aggregation
• Aggregation functions can be used in the variable list
SELECT
COUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS),
AVE(WEIGHT)
FROM
CANCER_REGISTRATION
– This produces one record for the whole table containing
three values, the number of records with non-null values
for Diagnosis, the number of distinct Diagnosis values, and
the average Weight.
– The available functions are COUNT, SUM, AVE, MIN,
MAX.
– Aggregation functions can be applied to expressions.
15-Mar-03
Statistical Databases and RDBMS © S&SC
12
Data Manipulation - Grouping
• The GROUP BY clause is used to perform aggregation
within subgroups, producing a record for each group.
SELECT
DIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT)
FROM CANCER_REGISTRATION
GROUP BY DIAGNOSIS
ORDER BY DIAGNOSIS
– This produces a record for each diagnosis containing the number of
registrations and their average weight as well as the diagnosis
code.
– It is sorted into diagnosis code order, using the ORDER BY clause.
– The Group By clause can contain multiple fields (for crosstabulation).
– The expression list can contain variables used in the Group By
clause, and aggregate functions for any variable.
– A WHERE clause selects the records to be aggregated.
– A HAVING clause selects the groups to be returned.
15-Mar-03
Statistical Databases and RDBMS © S&SC
13
Views
• Stored query definition
– Can be evaluated on demand
– Used to present data in the form needed by the user (a
person or a process)
• Dynamic evaluation
– Ensures that the viewed information is up to date
– May be inefficient if the information does not change
15-Mar-03
Statistical Databases and RDBMS © S&SC
14
Database Design
• Choose physical tables and fields
– Capture essential characteristics of real-world system
– Support intended uses of the information (directly or
through views)
– Operate efficiently
• Identify important links between tables
– Provide appropriate Keys
– Choose Indexes and other implementation details
• Retain flexibility for future needs
• Various methodologies for design process
15-Mar-03
Statistical Databases and RDBMS © S&SC
15
RDB Summary
• Relational databases are ubiquitous, and are useful for largescale data collections
• Some manipulation and aggregation operations can be done
more easily than in statistical packages
• Relational model is a useful way of thinking about data
structures
– May suggest different data structure
•
•
•
•
No replacement for statistical packages for statistical analysis
Flexible manipulation of more complex data structures
Useful development tools for processes
Interface to Views for Statistical Analysis
– ODBC or similar
– Extract micro- or macro-data for statistical analysis
15-Mar-03
Statistical Databases and RDBMS © S&SC
16
Databases for Statistics
• Simple situations
– Use statistical or survey packages
• Studies with Complex Structure
– Use RDB for flexible manipulation
– May store static result tables
• Ongoing Data Capture Processes
– Use RDB for dynamic data entry, updating, matching and checking
– Choose data structure to optimise data processes and quality, not
ease of statistical processing
– May take static snapshots for regular statistical analysis
• Statistical Dissemination – another presentation
• Statistical Metadata – yet another presentation
15-Mar-03
Statistical Databases and RDBMS © S&SC
17
PFFPSWID
PK,FK1
PK,FK1
PK,FK1
PFFPS – Main
structure
PFFPSHID
PK,FK1,I1
PK
Weights
PK
Cluster
Weight
15-Mar-03
I10
I1
I9
I3
I4
I5
I6
I11
CLUSTER
H_NO
RECORD_TYP
PROVINCE
AREA
DISTRICT
VISIT
F_DAY
F_MON
F_YEA
INTERVIEW
SUPERVISOR
DEO
RESULT
Q16T
Q16M
Q16F
Q17
Q18
Q19
Q20
Q21
Q22
Q23_1
Q23_2
Q23_3
Q23_4
Q23_5
Q23_6
Q23_7
Q23_8
Q24
Q25_1
Q25_2
Q25_3
Q25_4
Q26
Q27
Statistical Databases and RDBMS © S&SC
PFFPSHHM
PK,FK1,I1
PK,FK1,I1
PK
CLUST_HHM
H_NO_HHM
Q01
RECORD_TYP
PROV_HHM
AREA_HHM
Q03
Q04
Q05
Q06
Q07
Q08
Q09
Q10
Q11
Q12
Q13
Q14
Q15
CLUST_HHM
H_NO_HHM
Q01
RECORD_TYP
PROV_WID
AREA_WID
LINE_WID
DIST_WID
F_DAY_WID
F_MON_WID
F_YEA_WID
VISIT_WID
WOM_RES
Q101H
Q101M
Q102
Q103
Q104
Q105M
Q105Y
Q106
Q107
Q108L
Q108C
Q109
Q110
Q111
Q112
Q113
Q114
Q115
Q201
Q202
Q203S
Q203D
Q204
Q205S
Q205D
Q206
Q207S
Q207D
Q208
Q210
Q218
Q219
Q220
Q221
Q222
Q223
Q224
Q225_1
Q225_2
Q225_3
Q225_4
Q225_5
Q225_6
Q225_7
Q226
Q227
Interview CMC
Birth CMC
18
Data Tables for PFFPS
Weights
PK
PK,FK1
PK,FK1
PK,FK1
PK
Cluster
PFFPSHID
PK,FK1,I1
PK
PFFPSWBH
CLUSTER
H_NO
PFFPSWID
PK,FK1
PK,FK1
PK,FK1
CLUST_WID
H_NO_WID
L_NO_WID
CLUST_WBH
H_NO_WBH
L_NO_WBH
LINE_WBH
PFFPSWPB
PK,FK1
PK,FK1
PK,FK1
PK,FK1
CLUST_WPB
H_NO_WPB
L_NO_WPB
LINE_WPB
PFFPSWC1
PK,FK1
PK,FK1
PK,FK1
PK
CLUST_WC1
H_NO_WC1
L_NO_WC1
Q301
PFFPSHHM
PK,FK1,I1
PK,FK1,I1
PK
CLUST_HHM
H_NO_HHM
Q01
15-Mar-03
PFFPSWC2
PK,FK1
PK,FK1
PK,FK1
CLUST_WC2
H_NO_WC2
L_NO_WC2
Statistical Databases and RDBMS © S&SC
PFFPSWMF
PK,FK1
PK,FK1
PK,FK1
CLUST_WMF
H_NO_WMF
L_NO_WMF
PFFPSWSE
PK,FK1
PK,FK1
PK,FK1
19
CLUST_WSE
H_NO_WSE
L_NO_WSE
PFFPS Meta Data Structure
Record
PK
Category
Record
PK,FK1
PK,I1
Field
Code
Dictionary
15-Mar-03
Field
FK1,I1
Record
Range
Skip
PK,FK1
PK,I1
PK
Field
Code
PK,FK1
PK
Field
Lower
Statistical Databases and RDBMS © S&SC
20
HIV and AIDS reporting section, PHLS
• Redesign of processing and reporting system
• Old system (1994) based on files processed with
Epi-Info and Aspect database
• Requirements
–
–
–
–
15-Mar-03
Simplify management of data
Improve consistency
Simplify regular reporting
Enable ad hoc reporting
Statistical Databases and RDBMS © S&SC
21
Old System
• Input Notifications
• Data Storage
– HIV Tests (standard form)
– AIDS Diagnosis (standard)
– ONS Death Reports
• Some notifications on disc,
most on paper
• Processes
• Reporting
– Quarterly standard reports
– Ad Hoc inquiries (e.g. PQs)
– Occasional in-depth analysis
• Problems
– Continuous Data Entry from
paper
– Regular Linking for reporting
extractions
– Occasional Matching within and
between files
15-Mar-03
– Separate HIV and AIDS files
– Deaths Added
–
–
–
–
Statistical Databases and RDBMS © S&SC
Duplicates and omissions
Name matching
Repeated linking
Inflexible for reporting (eg
progression from HIV to AIDS)
22
Old Structure
HIV Forms
HIV
Notification
File
Quarterly
Extracts
AIDS Forms
Death Reports
on disc
15-Mar-03
AIDS
Notification
File
Statistical Databases and RDBMS © S&SC
23
Initial Thoughts
• Notification forms appropriate for reporting, but not
for primary information store
• Information is about Patients
– Notifications are sources of information
– Record information about events and states
• Do matching once, at initial entry
• Unified database structure
• Additional components for metadata, auditing, contact
information
• Extracts to separate statistical reporting system
15-Mar-03
Statistical Databases and RDBMS © S&SC
24
Full system
15-Mar-03
Overall Structure
Notification Processing
Metadata System
Patient Data
External Organisation
Reporting Facilities
Auditing System
Statistical Databases and RDBMS © S&SC
25
Main Patient Tables
D AIDS Indicator
PK,FK1,I1 Patient ID
PK
Indicator Type Code
PK
Indicator Code
D Patient
PK
Patient ID
I11
I5
I2
I10
I7
I9
I8
I1
I6
I3
I4
FK1,U2,U1
Soundex
Init
DOB
Sex
Occ
PC Out
PC In
DHA
LA
Ethnic Gp
Exp
Audit ID
15-Mar-03
D HIV Infection
PK,FK2
FK2,U2,U1 Audit ID
Patient ID
U3
HivNo
FK3
Hospital ID
FK1,U2,U1 Audit ID
D AIDS Diagnosis
PK,FK2
Patient ID
U1
AidsNo
FK3
Hospital ID
FK1,U3,U2 Audit ID
Statistical Databases and RDBMS © S&SC
D Death Record
PK,FK2
Patient ID
FK3,U1
FK4,U5
U6
U4
I1
FK1,U3,U2
AidsNo
HivNo
ONS ID
Death ID
Date Death
Audit ID
26
Patients and Notifications
Patient
Repeating
Components
Optional
Components
HIV Infection
Alternative
IDs
(Soundex, DoB)
AIDS
Diagnosis
External ID
Source
Notifications
HIV
Notification
(green form)
(Context,
Institution)
AIDS
Indicator
Action
(Type, Date,
Outcome)
AIDS
Notification
Previous
Status
(blue form)
(Date)
ONS Death
Notification
Death
Record
Blood
Donation
Residence
(PostCode, Date)
Blood
Recipient
External
Files
Follow Up
Study
(Vicky)
15-Mar-03
Statistical Databases and RDBMS © S&SC
27
D Alternative ID
PK,FK2,I1
PK,I3
PK,I2
D User
Patient ID
Identifier Type Code
Identifier Value
PK
Full HAP
Data
Structure
User Code
PK
FK1,I1 Session ID
D HIV Infection
PK,FK2
D Session
FK1,I1 User Code
FK1,U2,U1 Audit ID
Patient ID
D Audit Block
U3
HivNo
FK3
Hospital ID
FK1,U2,U1 Audit ID
PK
D Death Record
Patient ID
FK3,U1
FK4,U5
U6
U4
FK1,U3,U2
AidsNo
HivNo
ONS ID
Death ID
Audit ID
PK
Notification ID
FK3
FK2,I2,I1
U3
FK4,I6
FK1,U1,U2
Patient ID
Batch ID
Notification No
Organisation ID
Audit ID
Audit ID
FK1,I1 Session ID
FK2,I2 Creation Session ID
D Notification
PK,FK2
Session ID
D Note
PK
Note ID
FK2,I2,I1
Patient ID
FK1,U2,U1 Audit ID
D Patient
PK
Patient ID
D Action
FK1,U2,U1 Audit ID
PK
Action ID
FK2,I4,I2
Patient ID
FK1,U1,U2 Audit ID
D External Patient ID
D AIDS Indicator
PK,FK1,I1
PK
PK
Patient ID
Indicator Type Code
Indicator Code
FK2,U2,U1 Audit ID
PK,FK2
Patient ID
Person ID
FK1,U2,U1 Audit ID
15-Mar-03
FK1,U1
U1
AidsNo
FK3
Hospital ID
FK1,U3,U2 Audit ID
PK
Organisation ID
FK2,I2
Director
FK3,I3
Primary Contact
FK1,U2,U1 Audit ID
Statistical Databases and RDBMS © S&SC
/ Person
Audit ID
D Contact
D Organisation
D Person
PK
PK,FK3,I3 Patient ID
PK,FK2,I1 Organisation ID
PK,I2
ID in Institution
D AIDS Diagnosis
/ works at
PK,FK3,I2
PK,FK2,I1
Organisation ID
Person ID
FK1,U2,U1 Audit ID
28
New Notification
Data Entry
• Get Patient details from
source
• Look for matching
existing patients
– May need expert to
decide
– Create new Patient if
needed
• Open appropriate form
for source to complete
transfer of Patient and
details
• Forms know how to
handle duplicates
(generally take earliest)
Enter common Patient identification
from a New Notification
External Data Sources
View Matches with existing Patients
Initiate entry of a new Patient or
updating an existing one
New Patient
Create a
new
Patient
record
Transfer
Patient
Transfer
information
from New
Notification
HIV Notification
AIDS Notification
Information about the Patient and
HIV Infection taken from a green
HIV form
Information about the Patient and
AIDS Diagnosis taken from a
blue AIDS form
Death Record
Add or Update Death
Information for a Patient
HIV Infection
15-Mar-03
Notification
Statistical Databases and RDBMS © S&SC
Death Record
AIDS Diagnosis
29
Extraction
• Separate Extraction Database template
– Linked to main database
– Set of associated Excel spreadsheet templates
• New copy each quarter
– Populate by Extraction of information through Views
– Gives a Snapshot of Patient status
• Set of aggregation queries
– Import results into spreadsheets as source
• Reorganise with Pivot tables
– Create publication versions
15-Mar-03
Statistical Databases and RDBMS © S&SC
31
Conclusions
• Good analysis needs good data
– Need right structure for efficient processing
• RDB provides flexible structure and manipulation
– Useful for complex situations
– Can produce the views needed for statistical analysis
– Quick to implement at a basic structural level
• Good framework for implementing production
processes
– Important to get the process right for good quality data
– Not cheap to implement production processes
• Can easily include associated information
– Metadata, other supporting information
15-Mar-03
Statistical Databases and RDBMS © S&SC
32
The End
• Extended versions of presentations available from
– www.sasc.co.uk
– Follow links to MSc in Statistical Computing
15-Mar-03
Statistical Databases and RDBMS © S&SC
33