Using Electronic Medical Records Systems for Clinical Research
Download
Report
Transcript Using Electronic Medical Records Systems for Clinical Research
Using Electronic Medical Records
Systems for Clinical Research:
Benefits and Challenges
Prakash M. Nadkarni
1
Introduction
Opportunities
Availability
of clinical, financial and administrative
data in electronic form
Challenges
Using
EMR Software for research operations
Using EMR Data for research? Suitability of careoriented data to clinical research needs.
EMRs queried directly to answer research
questions
2
EMR/Clinical Research Information
System (CRIS) Differences:
Research Subjects
Subjects are not necessarily “patients”.
Personal Health Information may be
optional.
Not all screened subjects are enrolled.
Simultaneous or sequential enrollment
Eligibility Criteria
3
EMR/CRIS Differences: The Study
Calendar
Events/Visits and Study Calendar: Specific
evaluations or interventions are done at
specific time points ('events") relative to
the start of the study.
All patients are not enrolled at the same
time.
4
EMR/CRIS Differences: Electronic
Data Capture (EDC)
CRIS EDC is Far More Structured and Finegrained – textual comments are only a last
resort.
CRISs may need to Support Real-Time
Self-reporting of Subject Data
CRIS EDC may not always be Real-Time.
Quality Control considerations dictate
many workflow steps.
5
EMR/CRIS Differences: TransInstitutional Scope
For trans-institutional scope, Web
technology is virtually mandated.
Site restriction in Multi-Site studies – endusers and investigators access only their
own site’s patients.
Trans-National Issues: Software
Localization/ Globalization – same
software, different language/layout.
6
EMR/CRIS Differences: User Roles
CRISs support differential access to studies
Most
users of a CRIS are unaware of the other
studies in the same database.
Some users have read-only access to the data;
some only view reports.
Only certain users may be allowed to enter data in
particular forms, or even view certain "blinded"
data.
Data analysts typically do not need to access PHI.
However, in multi-institutional studies, they are
not typically site-restricted (see later)
7
EMR/CRIS Differences: Summary
EMRs are intended to primarily support
patient care, not research. CRISs are
specifically designed for research protocols.
May inter-operate with CRISs.
Sub-systems:
Laboratory, Pharmacy, Scheduling
EMR *may* be used with structured EDC for
intra-institutional studies if the only alternative is
paper, or if data-entry would otherwise be
duplicated.
Claims by any EMR vendor that their systems
are CRIS-capable should be viewed
skeptically.
8
EMR Data for Research:
The Nature of Electronic EMR Data
Significant
dependence on narrative text,
which is often the gold standard for clinical
findings.
Using administrative/billing data as a
surrogate for clinical data
Miscoding,
variations in coding
9
Using EMR Data for Research
Primarily hypothesis suggestion/generation
rather than confirmation
Sample
size may be too small to achieve
statistical significance
Most data mining tests only show association,
which does not prove causation.
Selection of patients matching complex criteria:
sample size projections for a planned study (a
strength of I2B2 – no IRB approval needed
because only anonymized data is returned).
10
Medical Natural Language Processing
101
NLP is concerned with extraction of meaningful
information from human language input.
Ultimate goal is to transform unstructured text
into a structured form.
Most NLP applications are targeted toward
specific goals – e.g., identification of
medications, adverse drug events.
NLP is not 100% accurate
11
Medical NLP 101 : Symbolic/ Rulebased approaches
Linguistic / symbolic NLP approaches
employ hand-crafted grammar rules to
parse text into units of speech (symbols),
which are then processed further.
Still used successfully for limited problems.
This approach does not always scale
Labor-intensive,
ambiguous parses, poor
results with telegraphic text.
12
Medical NLP 101: Statistical NLP
Relies on large bodies of text annotated with
the correct answers by humans.
Utilizes probabilistic methods for prediction
The larger and more representative the
training data, the better the results will be.
Approaches include Support Vector Machines
(SVMs), Hidden Markov Models (HMMs), and
Conditional Random Fields (CRFs).
13
Medical NLP 101: Subproblems
NLP software typically works as a pipeline of
modules: Modules for Low-level tasks
precede those for high-level tasks
Low Level Tasks
Segmentation-
sentence and word boundary
detection, problem-specific boundary detection
Part of speech tagging
Morphological decomposition of compound
words
Aggregation – identification of phrases
14
Medical NLP 101 : Sub-problems (2)
High-level tasks
Spelling
and grammatical error correction
Named Entity Recognition – including medical
concept recognition
Word /abbreviation disambiguation
Negation and uncertainty identification
Relationship extraction
Temporal inferencing
15
Medical NLP: Practical Issues
Change of Workflow and Introduction of
Structure can eliminate a difficult problem.
Code Reuse to avoid reinventing wheels.
General vs. Specific Solutions
Tools Need Commoditization
16
Querying EMR Data:
Technological Considerations
A database cannot be simultaneously
designed for rapid query as well as
efficient interactive, multi-user updates.
EMR database designs are transactionoriented.
EMRs are optimized for "Patient/Entity
Centric", not "Attribute-Centric" queries
17
Data Warehousing 101
Principle: Operating on a separate read-only
copy of the data on separate hardware yields
better query performance.
Structural
tweaks include adding extra and precomputation of aggregate values.
Special types of indexes (bitmap indexes) yield
improved query performance.
“Star schemas” characterize most warehouse
designs.
Farmers vs. Explorers (Inmon)
“Virtual" integration ("federation")
18
Data Warehousing: Practical
Considerations
After warehouse, need for creation of
custom reports may increase rather than
decrease.
The critical requirement for effective ad
hoc query is a comprehensive
understanding of the data. This is
generally a full-time effort.
19
Special Considerations: Querying
of Clinical Data
Both EMRs and large-scale CRISs typically
store clinical data in Entity-Attribute-Value
(EAV) form
100,000s
of clinical parameters exist across all
medical domains.
The vast majority of parameters will be
inapplicable for a particular subject/patient.
EAV is a triple: Entity=Patient+point in time,
Attribute=Parameter, Value=value of that
parameter.
EPIC Flowsheet data uses EAV.
20
Standardization
The mere presence of structure does not
solve all problems
Synonyms
in narrative text are unavoidablereduced to the same concept. Controlled
medical vocabularies (UMLS) help.
UMLS is not a panacea
Institutions will therefore evolve their internal
controlled vocabularies.
21
Standardization Considerations
Standardizing your definitions
2nd
Law of Thermodynamics
Poor definition quality becomes a problem if
pooled-data (or meta-) analysis is intended.
Features of certain systems predispose to
disorder. (Learn As You Go, separate
definitions databases.)
Even the best system is not immune – path of
least resistance.
Consistent definition is difficult to achieve
after the fact – Deming.
22
EMR use as the basis for research
hypotheses
Conflicting evidence regarding EMR
benefit still appears.
A *well designed* EMR may benefit.
Electronic Alerting Systems themselves
may not improve care, unless EMRs also
reduce workload through automatic
actions.
Review vendor-supplied templates
carefully.
23
Conclusions: Future EMR Evolution
EMRs fully supporting CRIS capability are
unlikely to evolve.
No
software should attempt to do everything
Differences
in storage-engine capabilities
Jack-of-all-trades approach (doing everything in a
mediocre manner) is not viable.
Difficult
(or impossible) to devise a logically
consistent user-interface metaphor that
applies to diverse unrelated features.
Example of Microsoft Office.
24
Inter-operation (1)
Co-existing and Inter-operating best-ofbreed packages offer the best usability and
feature-set
CRISs,
Genomic / Proteomic Data Management
Packages
There may be minimal data duplication- e.g.,
EMRs may pull in very limited summary
information on critical genetic data for selected
patients, so that it is immediately visible.
25
Inter-operation (2)
• CRIS/EMR
Bulk
import of laboratory parameters, to avoid
duplicate data entry
Automatic grading of laboratory-based adverse
events (oncology studies) – Richesson et al.
Use for scheduling research subject visits
Pharmacy subsystem for drug dispensation
EMR for primary EDC in intra-institutional studies
if the only alternative is paper, or if data-entry
would otherwise be duplicated.
• EMR/Specialized EMR
• Picture-archiving systems
26
Inter-operation (3)
• Application Programming Interfaces (APIs)
All
large packages – CRISs, EMRs, ‘Omics –
require APIs to make inter-operation efficient
APIs are vendor-specific. Inter-operation
standards (e.g., the HL7 Virtual medical record)
have not received much traction.
Currently, many vendors set unreasonable
financial and other barriers to use of their APIs
(e.g., official certification, withholding of
documentation).
EMRs lag in the software industry’s trend toward
open-source.
27
Questions?
28
Further reading
CRIS
NLP
Richesson and Andrews, Clinical Research Informatics, 2012 (Springer)
Jurafsky and Martin: Natural Language Processing
Manning and Schuetze: Foundations of Statistical Natural Language
Processing
Nadkarni, Ohno-Machado and Chapman: Natural Language Processing:
An Introduction. Journal of the American Medical Informatics
Association 2011.
Data Warehousing
Larry Greenfield. The Data Warehousing Information Center.
www.dwinfocenter.org/
Kimball, Reeves, Ross and Thornthwaite. The Data Warehouse Lifecycle
Toolkit : Expert Methods for Designing, Developing, and Deploying Data
Warehouses. Wiley, 1998.
29