OpenClinica Data Mart
Download
Report
Transcript OpenClinica Data Mart
OpenClinica sqldatamart
Lindsay Stevens
Viral Hepatitis Clinical Research Program
Kirby Institute, UNSW Australia
1
Agenda
Context
Features
Setup
Client use cases
Questions, or:
Implementation details
Feature ideas
©
#OC15Europe
Context
Clinical research program
Focus on Hepatitis C treatment and prevention
Mainly phase 3, 4, observational studies
2-3 projects in/out per year
Local and International participating institutions
©
#OC15Europe
Problem: Study Setup
Typical scope:
1. OpenClinica study
2. SAS extraction code (from XML to CSV or Excel)
3. Reports (typically Access)
a) Standard reports (subject list, discrepancies, etc.)
b) Study-specific reports
4. Something special (tablet forms, system integrations, etc.)
sqldatamart removes steps 2 and 3a, simplifies 3b
©
#OC15Europe
Problem: Running Reports
Typical steps:
1. Full study extract to XML
2. Run SAS code
3. Open report in Access
sqldatamart removes steps 1 and 2
©
#OC15Europe
Requirements
1. Easier setup per study
2. Easier to run reports
3. Provide access to all clinical and operational data
4. Low maintenance
5. Restrict study data access to staff with OpenClinica role
©
#OC15Europe
Previous Solution
Access generates a report database in postgres
Written using SQL and Access VBA
Limitations:
Access can only handle 255 columns per table
Access is single core only, and a bit slow
Roles used static passwords
Refresh required dropping the report database
Making friends with postgres allows many cool things
©
#OC15Europe
Current Solution
Postgres generates a report database in postgres
Written using SQL and PLPGSQL (postgresql procedural
language)
Limitations:
Synchronous commits during function execution
Can use dblink extension if this becomes a problem
Concurrent materialized view refresh requires unique index
Can use array index
©
#OC15Europe
Feature Overview
Common study data
Subject details, groups, event / CRF status
SDV status, discrepancy notes
Metadata, study user roles
Study specific data
One matview per item group (long item names)
One view per item groups (short item names)
Support functions
Snapshot study to SAS / Stata files
Create pass thru views in Access
©
#OC15Europe
Open Sourced
All the code is on github
github.com/lindsay-stevens-kirby/openclinica_sqldatamart
MIT License
Documentation is in the repository wiki
Code or bug report contributions welcome
Written to suit our OpenClinica usage patterns
If your pattern is incompatible, maybe you or I can add a fix
©
#OC15Europe
Some Postgres Terminology
Matview = Materialized View
A table that has an associated query which can be used to
regenerate the table
Allows the database to cache (store) the results of a query
FT = Foreign Table
A table that is actually a view to a resource outside of the
current database
Allows the database to access other databases, files, etc.
©
#OC15Europe
Data Flow: Diagram
1
2
©
#OC15Europe
3
Data Flow: 1
openclinica_fdw schema
Facilitates connection to OpenClinica database
Copy of some catalog tables to look up the definition of
tables, views and indexes
Copy of these tables, views and indexes
Home of the support functions
©
#OC15Europe
Data Flow: 2
dm schema
Facilitates running the core queries in one go for all data
in the instance, keeping their definitions in one place
Home of the support queries
©
#OC15Europe
Data Flow: 3
Study schemas
Facilitates segmenting data by study
Transforms data into item group matviews
©
#OC15Europe
Some OpenClinica Terminology
ODM Schema Definition = Metadata
Schema Instance = Clinicaldata
Study
My Study
Subject
Subject 1
Event 1
Repeat 1
Event
CRF
CRF 1
ItemGroup
ItemGroup 1
Repeat 1
Item
Item 1 Value
©
#OC15Europe
Common Matviews
From dm schema, copied to each study, currently 15.
1. Clinicaldata: item data with some metadata
2. Discrepancy_notes_all: all parent and child notes
3. Discrepancy_notes_parent: just the parent notes
4. Metadata: study level metadata
5. Metadata_event_crf_ig: filtered metadata, top part
©
#OC15Europe
Common Matviews
6. Metadata_crf_ig_item: filtered metadata, bottom part
7. Metadata_study: study info, plus the schema name
8. Response_set_labels: CRF item value labels
9. Sdv_status_history: current and previous sdv statuses
10.User_account_roles: user accounts with study roles
©
#OC15Europe
Common Matviews
11.Subject_event_crf_expected: subjects x all possible
12.Subject_event_crf_status: what exists
13.Subject_event_crf_join: results of 11+12
14.Subject groups: group class assignment details
15.Subjects: subject details
©
#OC15Europe
Item Group Matviews
Named using item group OID
Row identifier columns
Study
Site
Subject
Event
CRF
Item group
©
#OC15Europe
Item Group Matviews
Item columns
Each item gets a column
If study has item names > 12 characters, name is item_oid
Otherwise, first 12 of item_name then first 45 of description
Coded items get a second “_label” column
my_item
my_item_label
Multi-choice items split into column pairs for each choice
Choice value appended to name
Probably problems if using long coded values
©
#OC15Europe
Item Group Matviews
Item data values are stored as text
There is only one “item value” column varchar(4000)
Each value is cast to a type
TEXT: ST, PDATE, FILE, “_label” columns
NUMERIC: INT, REAL
DATE: DATE
We don’t use PDATE or FILE so they don’t get any
special handling
©
#OC15Europe
Item Group Alias Views
Each item group matview has an twin “Alias” view
Alias view is for helping statistical software
SAS and Stata variable name limits around 32 characters
item_name is probably the shortest meaningful identifier
Alias view renames matview names using item_name
Example:
SELECT myitem_my_first_item_with_a_really_long_name
AS myitem
FROM ig_my_matview
©
#OC15Europe
Group Roles
Group roles are collections of permissions which can be
assigned to Login roles (real users) or other group roles
A dm_admin group role is created, which owns all
sqldatamart objects
A group role is created for each study schema
Study group roles have read only access to objects in
that schema
Named like dm_study_my_study_name
©
#OC15Europe
Login Roles
A login role is created for each OpenClinica user
The local name of the user email is used
E.g. email [email protected] => Lstevens
Login roles are given group roles matching OpenClinica
How these login roles authenticate is up to you
Advanced setup describes single-sign on with SSPI
Could use passwords, certificates, LDAP, etc.
©
#OC15Europe
Basic Setup Overview
Create a select-only role in the OpenClinica database
Allow connections from a report database using that role
Install postgres 9.3+,
Install pgAgent (scheduled task runner)
Run bat script updated with the environment details
Or execute the psql statements manually
Or rewrite for bash
Create pgAgent job to refresh the database as desired
©
#OC15Europe
Advanced Setup Overview
All of basic, plus some bonus how-tos
Encrypted connections
Between OpenClinica and report database server
Between report database server and client
Single sign on using SSPI authentication against AD
Data access logging
Performance settings suggestions
Least-privilege maintenance role
©
#OC15Europe
Maintenance
pgAgent used as job runner
Could implement using Task Scheduler, Cron, etc...
Job runner tasks:
Refresh data (if study not locked recently)
Build schema for new study if not exist
Rebuild study if event or CRF definitions updated
Sync user roles (create / revoke / grant)
Speed:
pgAgent keeps statistics so you can tune refresh frequency
Takes ~5 min for our 11 study database, 500k item_data
©
#OC15Europe
Clients
Anything that can talk to postgresql
Most practical for office environment is ODBC
Many common applications can talk ODBC
SAS, Stata, Microsoft Office, LibreOffice, etc.
Many languages have their own database connection
libraries e.g. python, java, c#, R
©
#OC15Europe
Client Setup
psqlODBC driver
Optional:
If using advanced setup, copy of report database server
issuer certificate for verification
Might be unique to my environment but Access seems to
have very short DNS lookup timeout
Currently solved by adding record to hosts file
I submitted a patch for psqlODBC driver to allow specifying a
postgres service name as a connection parameter and putting
the host IP in that, but the patch has not been accepted ... yet
©
#OC15Europe
Client Helper Scripts
SAS script:
Interacts with special sqldatamart function
Generates code to export a study to .sas7bdat
Accepts general filters e.g. if you want only subject “x”
Runs said code
Result is a “snapshot” of all study data
Stata script and function, same as above except to .dta
These special functions live in the “public” schema so all
users can execute them
©
#OC15Europe
Client Helper Scripts
Access VBA module:
Import into a database
Run the function, providing parameters:
Study schema name
ODBC connection string or path to FileDSN
Creates pass through queries for each study matview
©
#OC15Europe
Client Examples - SAS
SAS can interact with postgres without snapshot files
Create a foreign library for the study:
LIBNAME myStudyNameLibName ODBC
SCHEMA=myStudyNameSchemaName
NOPROMPT="fileDSN reference string, or ODBC connection string";
RUN;
Use this library, e.g. for table of subjects per site by sex:
DATA myStudyNameLibName.subjects ;
PROC FREQ; TABLE site_name*subject_sex; RUN;
©
#OC15Europe
Client Examples - Stata
Stata can interact with postgres without snapshot files
Load a dataset
odbc load, table("myStudyNameSchemaName.subjects") noquote
connectionstring("fileDSN reference string, or ODBC connection
string")
Use the dataset, e.g. for table of subjects per site by sex:
tabulate site_name subject_sex
©
#OC15Europe
Client Examples - Access
Access can interact with postgres via pass thru queries
Run the setup code or manually create each p-t query
Equivalent table of subjects per site by sex:
TRANSFORM Count(subjects.subject_id) AS CountSubject
SELECT subjects.site_name
FROM subjects
GROUP BY subjects.site_name
PIVOT subjects.subject_sex;
Complex queries best done as pass thru since Access
downloads each full set to execute query locally
©
#OC15Europe
Implementation Details
Batch script feeds psql all CREATE FUNCTION files
Batch script feeds psql variables for build script
psql substitutes variables into build script
Build script uses the functions to create database
Create foreign table for catalog to get database definition
Use catalog to create foreign tables for all tables, views
©
#OC15Europe
Implementation Details
Create matview for each foreign table: “cache layer 1”
Create “dm” schema with core queries: “cache layer 2”
Clinicaldata, metadata, metadata-derivatives, etc
Create study schemata
Copy core queries filtered for that study
Create item group matviews: “cache layer 3”
User creation
©
#OC15Europe
Implementation Details
Item group matviews pivot data using max aggregates
Dynamic SQL creates column names, types, case
statements to filter out blanks and null codes
Simplified representation of what happens:
O
C
D
M
Itemgroup
item
Item_value
Visit
visit_date
2015-06-01
Visit
visit_attended
Yes
Visit
visit_reason_not_done
‘’
Itemgroup
visit_date
visit_attended
visit_reason_not_done
Visit
2015-06-01
Yes
[NULL]
©
#OC15Europe
Customisation Examples
Change a core query (in dm schema)
The “dm_create_dm_...” functions are just wrappers
around a CREATE MATERIALIZED VIEW statement
Edit the query and rebuild the database
Change how login roles are created
Currently uses regex ^([A-z]+)@
Edit each “dm_users_...” function where this is used
Should delegate this to a function so it is easier to change
©
#OC15Europe
Customisation Examples
Add a new core query (in dm schema)
Copy/paste/rename existing “dm_create_dm...” function
Put your query between the $query$ string delimiters
Add a call to it in the build script:
After the call to dm_create_ft_openclinica_matview_indexes
Before the call to dm_create_study_schemas
Make sure it has a “study_name” column
Should be copied by dm_create_study_common_matviews
That is, you might not need to rebuild
©
#OC15Europe
Feature Ideas
Variable and value labelling in snapshot scripts
Reorganise clinical/metadata into modular views so that
each child is a composition instead of a “select distinct”
Snapshot script for R / python(pandas)
Add more database IDs everywhere
For unique indexes on item group matviews
For links in reports that go straight to OpenClinica pages
For data cleaning tracking against custom views
Mechanism to automatically re-create post-build custom
study postgres views after rebuild
©
#OC15Europe