Using Hippocratic Database Technology to
Download
Report
Transcript Using Hippocratic Database Technology to
IBM Research
Securing Electronic Health Records
without Impeding the Flow of Information
Rakesh Agrawal*
Microsoft Search Labs
Mountain View, CA
[email protected]
Christopher Johnson
IBM Almaden Research Center
San Jose, CA
[email protected]
* Based on work done while author was at IBM Almaden
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Based on joint work with
Roberto Bayardo
Alvin Cheung
Alexandre Evfimievski
Tyrone Grandison
Jerry Kiernan
Kristen Lefevre
Ramakrishnan Srikant
Yirong Xu
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Thesis
Technology alone cannot solve the complex
problem of securely managing the health
information; at the same time, policy and law
needs to be informed of what is technically
feasible and in what timeframe.
By advancing technology, we can:
– change the mix of legislation, societal norms,
market forces, and technology comprising the
solution; and
– improve the overall quality of the solution.
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Outline
Illustrate thesis with technology examples
based on Hippocratic database work
Recommendations for
– Policy designers and legislators
– Solution developers
– Scientists and researchers
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Hippocratic Database Technologies
GOAL
Create a new generation of information systems that protect the privacy, security, and
ownership of data while not impeding the flow of information.
Active Enforcement
Data item level
enforcement of disclosure
policies and patient
preferences
Privacy-Preserving Data
Mining
Preserves privacy at
individual level, allowing
accurate data mining
models at aggregate level
Compliance Auditing
Determine whether data has
been accessed in violation
of specified policies
Optimal
k-anonymization
De-identifies records in a
way that maintains truthful
data but is not prone to
data linkage attacks
IMIA Conference – Security in Health Information Systems | April 29, 2006
Sovereign Information
Integration
Selective, minimal sharing
across autonomous data
sources, without trusted
third party
IBM Research
Active Enforcement
• Privacy Policy: Organizations define
a set of policies describing who may
access data (users or roles), for what
purposes the data may be accessed
(purposes) and to whom the data may
be disclosed (recipients).
• Consent: Data subjects are given
control, through opt-in and opt-out
choices, over who may see their data
and under what circumstances
#
Name
Age
Phone
1
Adam
25
111-1111
3
Bob
-
333-3333
4
Daniel
40
-
• Provides cell-level disclosure control.
• Application modification not required.
• Database agnostic; does not require
changes to the database engine.
Patient Preferences
& Data Collection
Policy
Creation
Application Data
Retrieval
• Disclosure Control: Database
enforces privacy policies and data
subject consent choices with respect to
all data access.
• Active Enforcement system
intercepts and rewrites incoming
queries to comply with policies,
subject choices, and context.
• Rewritten queries benefit from all of
the optimizations and performance
enhancements provided by the
underlying engine (e.g. parallelism).
VLDB 02, WWW 03, VLDB 04
Installation
Policy
Parser
Negotiation
Patient Preferences
& Policy Matching
Installed Policy
Patient Records
IMIA Conference – Security in Health Information Systems | April 29, 2006
DATABASE
Enforcement
JDBC/ODBC Driver
IBM Research
Query Modification Example
(Disclose Name only of Patients who have opted-in)
SELECT Name
FROM Patients
WHERE Age < 20
SELECT
CASE WHEN EXISTS
(SELECT Name_Choice
FROM Patient_Choices
WHERE Patients.Patient# = Patient_Choices.Patient#
AND Patient_Choices.Name_Choice = 1)
THEN Name ELSE null END
FROM Patients
WHERE Age < 20
AND EXISTS
(SELECT Patient#_Choice
FROM Patient_Choices
WHERE Patients.Patient# = Patient_Choices.Patient#
AND Patient_Choices.Patient#_Choice = 1)
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Elapsed Time (seconds)
40
30
20
10
Unmodified
Modified External Multiple
Modified Internal
0
0
20
40
60
Choice Selectivity (%)
80
100
Measured performance of a query selecting all records from a 5 million-record table
Compared performance of original and modified queries for varied choice selectivity
Not surprisingly, performance actually better for modified queries when we use
privacy enforcement as an additional selection condition
– Able to use indexes on choice values
Shows the importance of database-level privacy enforcement for performance
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Audit Scenario
The doctor must now review
disclosures
of Jane’s
Sometime
later, Jane
information
in order
The doctor
uncovers
that
Jane’stoblood sugar level is
receives
promotional
understand
high literature
and suspects
fromdiabetes
a the circumstances
of the disclosure, and take
pharmaceutical
appropriate
action
company,
proposing
over
theto
counter
diabetes of Health and Human
Jane complains
the department
tests
Services saying
that
of the
Janeshe
hashad
notopted
been out
feeling
welldoctor
and decides to
sharing her medical
information
with
pharmaceutical
consult her doctor
companies for marketing purposes
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Audit Expression
Who has accessed Jane’s disease information?
audit
T.disease
from
Customer C, Treatment T
where
C.cid=T.pcid and C.name = ‘Jane’
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Problem Statement
Given
– A log of queries
– An audit expression specifying sensitive
data
NOT Given
– Log of data accesses
Precisely and Efficiently identify
– Those queries that accessed the data
specified by the audit expression in the
past
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Compliance Auditing
Query with purpose, recipient
IDs of log queries having
accessed data specified
by the audit query
Audit query
Updates, inserts, delete
Database
Layer
Audit
Database triggers
track updates to
base tables
Data
Tables
• Audits whether particular data
has been disclosed in violation
of the specified policies.
• Audit expression specifies
what potential data disclosures
need monitoring.
Database
Layer
Backlog
• Identifies logged queries that
accessed the specified data.
• Auditors can analyze the
circumstances of violations.
• Make necessary corrections to
procedures, policies, security.
Generate audit record
for each query
Query Audit Log
ID
Timestamp
Query
User
Purpose
Recipient
1
2004-02…
Select …
B. Jones
Marketing
PharmaCo.
2
2004-02…
Select …
S. Roberts
Treatment
S. Roberts
VLDB 04
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Overhead on Updates
Time (minutes)
250
7x if all tuples are updates
3x if a single tuple is updated
200
Negligible
by using
Recovery
Log to build
Backlog tables
Composite
Simple
No Index
No Triggers
150
100
50
0
5
20
35
50
# of versions per tuple
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Audit Query Execution Time
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Privacy Preserving Data Mining
Kevin’s
LDL
Kevin’s
weight
Julie’s
LDL
126 | 210 | ...
128 | 130 | ...
Randomizer
Randomizer
126+35
161 | 165 | ...
129 | 190 | ...
Reconstruct
distribution
of LDL
Reconstruct
distribution
of weight
Data Mining Algorithms
Data Mining Model
Preserves privacy at the individual patient level, but
allows accurate data mining models to be constructed
at the aggregate level.
Adds random noise to individual values to protect
patient privacy.
EM algorithm estimates original distribution of values
given randomized values + randomization function.
Algorithms for building classification models and
discovering association rules on top of privacypreserved data with only small loss of accuracy.
1200
120
1000
100
800
80
600
60
400
40
20
200
0
0
Original
Randomized
20
40
82
74
66
58
50
42
34
26
18
2
Sigmod00, KDD02, Sigmod05
10
10
Reconstructed
IMIA Conference – Security in Health Information Systems | April 29, 2006
60
80
100
150
Randomization Level
Original
Randomized
Reconstructed
200
IBM Research
Optimal k-Anonymization
Goal: De-identify patient data such that it retains its
integrity, but is resistant to data linkage attacks.
Motivation: Naïve de-identification methods are prone
to data linkage attacks, which combine subject data
with publicly available information to re-identify
represented individuals.
Process of k-Anonymization
•
Data Suppression - Involves deleting particular cell
values or entire tuples.
•
Value Generalization - Entails replacing specific values,
such as a telephone number, with more general ones,
such as the area code alone.
Samarati and Sweeney k-Anonymity* Method
–
A k-anonymized data set has the property that
each record is indistinguishable from at least k-1
other records within the data set.
Advantages of Optimal k-anonymization
•
Truthful - Unlike other disclosure protection techniques
that use data scrambling, swapping, or adding noise, all
information within a k-anonymized dataset is truthful.
•
Secure - More secure than other de-identification
methods, which may inadvertently reveal confidential
information.
Optimal k-Anonymization
–
Name
We have developed a k-anonymization algorithm
that finds optimal k-anonymizations under two
representative cost measures and variations of k.
Address
City
Age
Diagnosis
Eric
7, rue du Mont Dore
Paris
26
Influenza
Paul
13, rue des Canettes
Paris
42
Hypertens.
Marc
48, rue du Four
Paris
47
Diabetes
Henri
21, rue du Mont Dore
Paris
28
Asthma
Address
City
Age
Diagnosis
*
17th Arrond.
Paris
20-29
Influenza
*
6th Arrond.
Paris
40-49
Hypertens.
*
6th Arrond.
Paris
40-49
Diabetes
*
17th Arrond.
Paris
20-29
Asthma
Name
(k=2, on name,
address, age)
* P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when Disclosing Information.” In Proc.
of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, 188, 1998.
IMIA Conference – Security in Health Information Systems | April 29, 2006
ICDE05
IBM Research
Sovereign Information Integration
Separate databases due to statutory,
competitive, or security reasons.
Minimal Necessary Sharing
Selective, minimal sharing on a
need-to-know basis.
Example: Among those patients who
took a particular drug, how many with
a specified DNA sequence had an
adverse reaction?
Researchers must not learn anything
beyond counts.
•
Algorithms for computing joins and
join counts while revealing minimal
additional information.
R
a
u
v
x
Medical
Research
Inst.
RS
u
v
S
b
u
DNA
Sequences
RS
R must not
know that S
has b and y
S must not
know that R
has a and x
v
Count (R S)
R and S do not learn
anything except that
the result is 2.
y
Drug
Reactions
Sigmod 03, DIVO 04
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Recommendations
Policy Makers & legislators
– Continuous technology monitoring and understanding to inform policies
and laws (current and new)
– Invest in research
Solution Developers (Technologists)
– Design-in ethical considerations (e.g. respect for privacy, safeguard
against misuse); they can’t be afterthoughts
– Engage in dialog with policy makers and legislators to educate them on
performance implications of the policies/laws
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Recommendations for Researchers
Asking questions is easy:
it's answering them that's hard.
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Policy Specification
How to determine if the policy specification
accurately captures the intent of the policy
maker? (The person specifying the policy is
usually not a computer scientist.)
How to help the patient understand the
policy and the implications of his or her
choices?
How to design a policy language that
reconciles the goals of understandability
and efficient computation?
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Sticky Policies
Healthcare organizations should be assured that
original policy controls will be enforced over data after
transfer to other entities.
Transferees of patient data should be capable of
applying source disclosure policies to any information
in its database.
Database should enforce source and enterprise
policies and resolve any conflicts among policies.
Data compliant
with source and
enterprise policies
policies
Patient data + policy annotations
Patient Records DB
patient data
Hospital 1
Hospital 2
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Data Pointillism
Name
Phone
Phone
Address
City
Patient
Policy#
Bob
394-1015
396-1012
Maple St
Chatham
Alice
AAA1035
Alice
396-1012
394-1015
-
Madison
Bob
AAA1035
Alice
396-1112
396-1112
Maple St
Madison
Alice
UHG1035
• > 14B records with
Choicepoint
• Accuracy? Limits?
Pointillist
• How
to
allow
someone to verify
data?
• Data from > 22,000
sources in RDC’s
GRID
• >550
companies
compiling databases
of pvt information
Bob
394-1015
Maple St
Madison
AAA1035
Alice
396-1012
Maple St
Chatham
UHG1035
•Identifying
and
correcting errors?
Alice
396-1112
Maple St
Madison
AAA1035
• Usage control?
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Massively Distributed Data Management
What if patient data is stored on personal devices?
Pervasive monitoring devices will also collect patient data.
How to protect the security of these devices?
Enable selective sharing of information stored on devices?
Distributed backup in the network to prevent data loss?
512MB SanDisk Cruzer
$47.99
Transcend 40GB Portable Hard Disk USB
95mm x 71.5mm x 15mm, $189
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Data Life Cycle Management
Healthcare organizations must define data retention
policies based on legal requirements and patient
specifications:
–HIPAA: 6 years (21 years for pediatric care).
–Medicare: 5 to 7 years
–AHA & AHIMA: at least 10 years
Data compression vs. encryption
How to remove expired data and forget persistent
data?
How to establish truthfulness of data?
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Interoperability
Sovereign health information systems must be
able to communicate among one another, using
standard data formats and clinical vocabularies.
Examples of current efforts include:
–HL7 messaging standards
–SNOMED-CT vocabularies
–CDA and CCR document standards
Much work remains to be done to make systems
interoperable.
Mass collaboration might be useful in defining
clinical vocabularies and taxonomies.
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Concluding Remarks
Hippocratic Database technologies
protect the security of electronic health
records and patient privacy without
impeding the flow of information.
We need not sacrifice security or privacy
to gain value from EHRs for diagnosis,
treatment, and research.
We must focus on:
– Deriving value from bits we know how
to manage.
– Demonstrating what could not be done
before.
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
Thank you!
Papers: rakesh.agrawal-family.com
Collaborations: [email protected]
[email protected]
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
References
Active Enforcement
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. “Hippocratic Databases.” 28th Int'l Conf. on Very
Large Databases (VLDB), Hong Kong, August 2002.
K. Lefevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, D. DeWitt. "Limiting
Disclosure in Hippocratic Databases". Proc. of the 30th Int'l Conf. on Very Large Databases
(VLDB 2004), Toronto, Canada, August 2004.
Compliance Auditing
R. Agrawal, R. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau and R. Srikant. “Auditing
Compliance with a Hippocratic Database.” Proc. of the 30th Int'l Conf. on Very Large
Databases (VLDB 2004), Toronto, Canada, August 2004.
Privacy-Preserving Data Mining
R. Agrawal and R. Srikant. "Privacy-Preserving Data Mining". Proc. of the ACM SIGMOD
Conference on Management of Data, Dallas, May 2000.
A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. "Privacy Preserving Mining of
Association Rules". Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery
in Databases and Data Mining, Edmonton, Canada, July 2002.
IMIA Conference – Security in Health Information Systems | April 29, 2006
IBM Research
References
Optimal k-Anonymization
R. J. Bayardo and R. Agrawal. "Data Privacy Through Optimal k-Anonymization". To appear in
Proc. of the 21st Int'l Conf. on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005.
Sovereign Information Integration
R. Agrawal, A. Evfimievski, R. Srikant. “Information Sharing Across Private Databases.” ACM
Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
R. Agrawal, D. Asonov and R. Srikant. "Enabling Sovereign Information Sharing Using Web
Services". Proc. of the ACM SIGMOD Conference on Management of Data, Paris, France,
June 2004.
IMIA Conference – Security in Health Information Systems | April 29, 2006