OpenClinica Data Mart

Download Report

Transcript OpenClinica Data Mart

OpenClinica sqldatamart
Lindsay Stevens
Viral Hepatitis Clinical Research Program
Kirby Institute, UNSW Australia
1
Agenda
 Context
 Features
 Setup
 Client use cases
 Questions, or:
 Implementation details
 Feature ideas
©
#OC15Europe
Context
 Clinical research program
 Focus on Hepatitis C treatment and prevention
 Mainly phase 3, 4, observational studies
 2-3 projects in/out per year
 Local and International participating institutions
©
#OC15Europe
Problem: Study Setup
 Typical scope:
1. OpenClinica study
2. SAS extraction code (from XML to CSV or Excel)
3. Reports (typically Access)
a) Standard reports (subject list, discrepancies, etc.)
b) Study-specific reports
4. Something special (tablet forms, system integrations, etc.)
 sqldatamart removes steps 2 and 3a, simplifies 3b
©
#OC15Europe
Problem: Running Reports
 Typical steps:
1. Full study extract to XML
2. Run SAS code
3. Open report in Access
 sqldatamart removes steps 1 and 2
©
#OC15Europe
Requirements
1. Easier setup per study
2. Easier to run reports
3. Provide access to all clinical and operational data
4. Low maintenance
5. Restrict study data access to staff with OpenClinica role
©
#OC15Europe
Previous Solution
 Access generates a report database in postgres
 Written using SQL and Access VBA
 Limitations:




Access can only handle 255 columns per table
Access is single core only, and a bit slow
Roles used static passwords
Refresh required dropping the report database
 Making friends with postgres allows many cool things
©
#OC15Europe
Current Solution
 Postgres generates a report database in postgres
 Written using SQL and PLPGSQL (postgresql procedural
language)
 Limitations:
 Synchronous commits during function execution
 Can use dblink extension if this becomes a problem
 Concurrent materialized view refresh requires unique index
 Can use array index
©
#OC15Europe
Feature Overview
 Common study data
 Subject details, groups, event / CRF status
 SDV status, discrepancy notes
 Metadata, study user roles
 Study specific data
 One matview per item group (long item names)
 One view per item groups (short item names)
 Support functions
 Snapshot study to SAS / Stata files
 Create pass thru views in Access
©
#OC15Europe
Open Sourced
 All the code is on github
 github.com/lindsay-stevens-kirby/openclinica_sqldatamart
 MIT License
 Documentation is in the repository wiki
 Code or bug report contributions welcome
 Written to suit our OpenClinica usage patterns
 If your pattern is incompatible, maybe you or I can add a fix
©
#OC15Europe
Some Postgres Terminology
 Matview = Materialized View
 A table that has an associated query which can be used to
regenerate the table
 Allows the database to cache (store) the results of a query
 FT = Foreign Table
 A table that is actually a view to a resource outside of the
current database
 Allows the database to access other databases, files, etc.
©
#OC15Europe
Data Flow: Diagram
1
2
©
#OC15Europe
3
Data Flow: 1
 openclinica_fdw schema
 Facilitates connection to OpenClinica database
 Copy of some catalog tables to look up the definition of
tables, views and indexes
 Copy of these tables, views and indexes
 Home of the support functions
©
#OC15Europe
Data Flow: 2
 dm schema
 Facilitates running the core queries in one go for all data
in the instance, keeping their definitions in one place
 Home of the support queries
©
#OC15Europe
Data Flow: 3
 Study schemas
 Facilitates segmenting data by study
 Transforms data into item group matviews
©
#OC15Europe
Some OpenClinica Terminology
ODM Schema Definition = Metadata
Schema Instance = Clinicaldata
Study
My Study
Subject
Subject 1
Event 1
Repeat 1
Event
CRF
CRF 1
ItemGroup
ItemGroup 1
Repeat 1
Item
Item 1 Value
©
#OC15Europe
Common Matviews
 From dm schema, copied to each study, currently 15.
1. Clinicaldata: item data with some metadata
2. Discrepancy_notes_all: all parent and child notes
3. Discrepancy_notes_parent: just the parent notes
4. Metadata: study level metadata
5. Metadata_event_crf_ig: filtered metadata, top part
©
#OC15Europe
Common Matviews
6. Metadata_crf_ig_item: filtered metadata, bottom part
7. Metadata_study: study info, plus the schema name
8. Response_set_labels: CRF item value labels
9. Sdv_status_history: current and previous sdv statuses
10.User_account_roles: user accounts with study roles
©
#OC15Europe
Common Matviews
11.Subject_event_crf_expected: subjects x all possible
12.Subject_event_crf_status: what exists
13.Subject_event_crf_join: results of 11+12
14.Subject groups: group class assignment details
15.Subjects: subject details
©
#OC15Europe
Item Group Matviews
 Named using item group OID
 Row identifier columns






Study
Site
Subject
Event
CRF
Item group
©
#OC15Europe
Item Group Matviews
 Item columns
 Each item gets a column
 If study has item names > 12 characters, name is item_oid
 Otherwise, first 12 of item_name then first 45 of description
 Coded items get a second “_label” column
 my_item
 my_item_label
 Multi-choice items split into column pairs for each choice
 Choice value appended to name
 Probably problems if using long coded values
©
#OC15Europe
Item Group Matviews
 Item data values are stored as text
 There is only one “item value” column varchar(4000)
 Each value is cast to a type
 TEXT: ST, PDATE, FILE, “_label” columns
 NUMERIC: INT, REAL
 DATE: DATE
 We don’t use PDATE or FILE so they don’t get any
special handling
©
#OC15Europe
Item Group Alias Views
 Each item group matview has an twin “Alias” view
 Alias view is for helping statistical software




SAS and Stata variable name limits around 32 characters
item_name is probably the shortest meaningful identifier
Alias view renames matview names using item_name
Example:
SELECT myitem_my_first_item_with_a_really_long_name
AS myitem
FROM ig_my_matview
©
#OC15Europe
Group Roles
 Group roles are collections of permissions which can be
assigned to Login roles (real users) or other group roles
 A dm_admin group role is created, which owns all
sqldatamart objects
 A group role is created for each study schema
 Study group roles have read only access to objects in
that schema
 Named like dm_study_my_study_name
©
#OC15Europe
Login Roles
 A login role is created for each OpenClinica user
 The local name of the user email is used
 E.g. email [email protected] => Lstevens
 Login roles are given group roles matching OpenClinica
 How these login roles authenticate is up to you
 Advanced setup describes single-sign on with SSPI
 Could use passwords, certificates, LDAP, etc.
©
#OC15Europe
Basic Setup Overview
 Create a select-only role in the OpenClinica database
 Allow connections from a report database using that role
 Install postgres 9.3+,
 Install pgAgent (scheduled task runner)
 Run bat script updated with the environment details
 Or execute the psql statements manually
 Or rewrite for bash
 Create pgAgent job to refresh the database as desired
©
#OC15Europe
Advanced Setup Overview
 All of basic, plus some bonus how-tos
 Encrypted connections
 Between OpenClinica and report database server
 Between report database server and client
 Single sign on using SSPI authentication against AD
 Data access logging
 Performance settings suggestions
 Least-privilege maintenance role
©
#OC15Europe
Maintenance
 pgAgent used as job runner
 Could implement using Task Scheduler, Cron, etc...
 Job runner tasks:




Refresh data (if study not locked recently)
Build schema for new study if not exist
Rebuild study if event or CRF definitions updated
Sync user roles (create / revoke / grant)
 Speed:
 pgAgent keeps statistics so you can tune refresh frequency
 Takes ~5 min for our 11 study database, 500k item_data
©
#OC15Europe
Clients
 Anything that can talk to postgresql
 Most practical for office environment is ODBC
 Many common applications can talk ODBC
 SAS, Stata, Microsoft Office, LibreOffice, etc.
 Many languages have their own database connection
libraries e.g. python, java, c#, R
©
#OC15Europe
Client Setup
 psqlODBC driver
 Optional:
 If using advanced setup, copy of report database server
issuer certificate for verification
 Might be unique to my environment but Access seems to
have very short DNS lookup timeout
 Currently solved by adding record to hosts file
 I submitted a patch for psqlODBC driver to allow specifying a
postgres service name as a connection parameter and putting
the host IP in that, but the patch has not been accepted ... yet
©
#OC15Europe
Client Helper Scripts
 SAS script:





Interacts with special sqldatamart function
Generates code to export a study to .sas7bdat
Accepts general filters e.g. if you want only subject “x”
Runs said code
Result is a “snapshot” of all study data
 Stata script and function, same as above except to .dta
 These special functions live in the “public” schema so all
users can execute them
©
#OC15Europe
Client Helper Scripts
 Access VBA module:
 Import into a database
 Run the function, providing parameters:
 Study schema name
 ODBC connection string or path to FileDSN
 Creates pass through queries for each study matview
©
#OC15Europe
Client Examples - SAS
 SAS can interact with postgres without snapshot files
 Create a foreign library for the study:
LIBNAME myStudyNameLibName ODBC
SCHEMA=myStudyNameSchemaName
NOPROMPT="fileDSN reference string, or ODBC connection string";
RUN;
 Use this library, e.g. for table of subjects per site by sex:
DATA myStudyNameLibName.subjects ;
PROC FREQ; TABLE site_name*subject_sex; RUN;
©
#OC15Europe
Client Examples - Stata
 Stata can interact with postgres without snapshot files
 Load a dataset
odbc load, table("myStudyNameSchemaName.subjects") noquote
connectionstring("fileDSN reference string, or ODBC connection
string")
 Use the dataset, e.g. for table of subjects per site by sex:
tabulate site_name subject_sex
©
#OC15Europe
Client Examples - Access
 Access can interact with postgres via pass thru queries
 Run the setup code or manually create each p-t query
 Equivalent table of subjects per site by sex:
TRANSFORM Count(subjects.subject_id) AS CountSubject
SELECT subjects.site_name
FROM subjects
GROUP BY subjects.site_name
PIVOT subjects.subject_sex;
 Complex queries best done as pass thru since Access
downloads each full set to execute query locally
©
#OC15Europe
Implementation Details
 Batch script feeds psql all CREATE FUNCTION files
 Batch script feeds psql variables for build script
 psql substitutes variables into build script
 Build script uses the functions to create database
 Create foreign table for catalog to get database definition
 Use catalog to create foreign tables for all tables, views
©
#OC15Europe
Implementation Details
 Create matview for each foreign table: “cache layer 1”
 Create “dm” schema with core queries: “cache layer 2”
 Clinicaldata, metadata, metadata-derivatives, etc
 Create study schemata
 Copy core queries filtered for that study
 Create item group matviews: “cache layer 3”
 User creation
©
#OC15Europe
Implementation Details
 Item group matviews pivot data using max aggregates
 Dynamic SQL creates column names, types, case
statements to filter out blanks and null codes
 Simplified representation of what happens:
O
C
D
M
Itemgroup
item
Item_value
Visit
visit_date
2015-06-01
Visit
visit_attended
Yes
Visit
visit_reason_not_done
‘’
Itemgroup
visit_date
visit_attended
visit_reason_not_done
Visit
2015-06-01
Yes
[NULL]
©
#OC15Europe
Customisation Examples
 Change a core query (in dm schema)
 The “dm_create_dm_...” functions are just wrappers
around a CREATE MATERIALIZED VIEW statement
 Edit the query and rebuild the database
 Change how login roles are created
 Currently uses regex ^([A-z]+)@
 Edit each “dm_users_...” function where this is used
 Should delegate this to a function so it is easier to change
©
#OC15Europe
Customisation Examples
 Add a new core query (in dm schema)
 Copy/paste/rename existing “dm_create_dm...” function
 Put your query between the $query$ string delimiters
 Add a call to it in the build script:
 After the call to dm_create_ft_openclinica_matview_indexes
 Before the call to dm_create_study_schemas
 Make sure it has a “study_name” column
 Should be copied by dm_create_study_common_matviews
 That is, you might not need to rebuild
©
#OC15Europe
Feature Ideas
 Variable and value labelling in snapshot scripts
 Reorganise clinical/metadata into modular views so that
each child is a composition instead of a “select distinct”
 Snapshot script for R / python(pandas)
 Add more database IDs everywhere
 For unique indexes on item group matviews
 For links in reports that go straight to OpenClinica pages
 For data cleaning tracking against custom views
 Mechanism to automatically re-create post-build custom
study postgres views after rebuild
©
#OC15Europe