How To Tell A Secret Without Revealing It
Download
Report
Transcript How To Tell A Secret Without Revealing It
How To Tell A Secret
Without Revealing It
Enhanced Data Privacy in a Distributed Implementation
of the Smith-Waterman Genome Sequence Comparison
Algorithm
Barry Lawson
University of Richmond
Outline
•
•
•
•
•
•
•
Distributed volunteer computing
A problem
Related work
A real-world application
Our enhanced privacy approach
Results & analysis
Conclusions, future & ongoing work
Our Scenario
• You have a very large, compute
intensive project
Your PC
(few GFLOPS)
NOW (GFLOPS)
Supercomputer (< 280.6 TFLOPS)
~$128M
What To Do?
Welcome to the
Internet, my friend.
How can I help you?
Distributed Volunteer
Computing (DVC)
Participants (Bob)
(< 200 TFLOPS)
Supervisor (Alice)
DVC Computation
• Large-scale distributed computation:
– compute intensive
– easily parallelizable
• Supervisor:
– divide computation into tasks (independent)
– ship tasks to participants
– collect significant results
• Participants:
– download and execute tasks (when o/w idle)
– return significant results
Real-world Examples
DVC Group @ Richmond
• Faculty
– Doug Szajda (CS)
– Barry Lawson (CS)
– Jason Owen (Statistics)
• Students
– Current
Mike Pohl, Greg Steffensen, Andy White
– Past
Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C.
– Four more this summer
Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja
Telling a Secret Without
Revealing It
• A problem:
– Participants are untrustworthy
– Code executes outside supervisor’s control
– Computation data may be proprietary
• Goal:
– Participants provide meaningful results
– Supervisor does not divulge data
Related Work
(not exhaustive)
• Computing with encrypted data:
– Alice has x, wants Bob to compute f(x)
But does not want to divulge x
– Alice gives Bob x’ and f’( )
– Bob computes f’(x’)
• The key:
– Alice can determine f(x) from f’(x’)
– Bob cannot determine x from x’ and/or f’(x’)
• Difficult (often impossible) in practice
Less Formally
x
x’
f
f’
“Alice”
?
f(x)
“Alice”
f’(x’)
f’(x’)
“Bob”
Flexibility In Our Context
• The computation:
– Alice (supervisor) has many x’s
– Bob (participant) determines x’s that
are significant
• Alice doesn’t need the value f(x)
• Alice will post-process
– A few false positives are OK
– Sufficient accuracy: flexibility in f’( )
The Adversary
• Assumed to be intelligent
– can decompile, analyze, modify code
– understands task algorithms
– understands enhanced privacy scheme(s)
• Motivation
– may not be obvious: business competitor?
– may not care if leak is detected
General Model
• The Computation:
– evaluate an algorithm f : D
for all x in D
R
• Task T( ):
– partition D into subsets Di
– T(Di) evaluates f(xi) for all xi in D
• Filter function G( ):
– determines “significance”
– returns indices of significant xi
D
D1
D2
D3
Our General Approach
• Transform Di, f, G into Di’, f’, G’
• Replace task T(Di) with T(Di’)
• Desirable properties:
– T(Di) does not leak additional info
about values in Di
– significance in T(Di)
significance in T(Di’)
– any difference is reasonably small
In Reality…
• Providing desired properties is difficult
– even with increased flexibility
– impossible for some apps
• When possible, application-specific
• Bottom line: we have a potential approach
– where few, if any, others exist
Application:
Genome Sequence Comparison
• Compare sequences over genome alphabet
∑ = {A,C,G,T}
• Track evolutionary changes by aligning
columns of sequences (an alignment)
• E.g.:
CTGTTA
CAGTTA
Sequence Evolution
• Deletion:
CTGTTA
CTG–TA
• Insertion:
C–TGTTA
CGTGTTA
• Substitution:
CTGTTA
CAGTTA
indels
Sequence Evolution
• After several “generations”
C–TGT––TA––
CTA–TGCTACG
• Note: # of alignments is huge
(for realistic-length sequences)
Alignment Types
• Global alignment
– considers entire sequence
• Local alignment
– considers substrings
– biologists usually use local
Measuring Alignments
• Scoring function:
+1 if symbols match
-1 if not
• Gap penalty
– g(k) = a + b(k-1)
– k is gap length
(# consecutive dashes in single sequence)
• Alignment score:
– sum of column scores minus gap penalties
A Simple Example
• Global alignment:
– Scoring function: +1 match, -1 no match
– Gap penalty: g(k) = 2 + 1(k-1)
C – T G T – – T A – –
C T A – T G C T A C G
+1 -2
-1
-2
+1
-3
+1 +1
-3
• Alignment score: +4 - 11 = -7
Smith-Waterman
• Dynamic programming algorithm
• Produces an optimal alignment
Global: O(n2)
Local: O(n3)
• Implemented on commercial DVC
platforms
Significance in S-W
• Significance of scores based on probability
• Empirical evidence:
– given randomly-generated sequences
– scores exhibit extreme value distribution
Determining Significance
• Choosing a significance threshold p :
– want small probability that a random score >p
– typically, probability < 0.003
p
A Smith-Waterman Task
• Pairwise comparison of two sets of
sequences, A and B
– A : proprietary sequences
– B : sequences from public database
• Returned: indices of well-matched pairs
• Notation: T(A,B,s,g,p)
Our Transformation
• Use offset sequences:
– compare relative distances b/w specific
nucleotides
• X: GCACTTACGCCCTTACGACG
–
–
–
–
F(X,A) = {3,4,8,3}
F(X,C) = {2,2,4,2,1,1,4,3}
F(X,G) = {1,8,8,3}
F(X,T) = {5,1,7,1}
Modified Tasks
• X: GCACTTACGCCCTTACGACG
F(X,C) = {2,2,4,2,1,1,4,3}
• Y: GCACTCGCCACTTAGCACG
F(Y,C) = {2,2,2,2,1,2,5,2}
• Apply S-W to F(X,C) and F(Y,C)
– Scoring function, gap penalty
– “Goodness” threshold
Intuition
• Similar sequences
similar offsets
– consider effects of indels, substitutions
• What about false positives?
– multiple nucleotides
• e.g., assign A & C tasks to distinct participants
– good match if both tasks indicate
significance
CAGGATCTCAAGC
A
2351
C
1624
CAGCATATCACGT
2323
1352
“Alice”
2351
1624
2323
1352
?
?
“Bob 1”
“Bob 2”
Using Multiple Nucleotides
• Maximum method
– one task for each of A,C,G,T
– result significant if any of the four indicate
• Adding method
– one task for each of A,C,G,T
– result significant if sum of four scores
indicates significance
• Costs reduced in either case
– on average, 1/4 length of original sequence
– runtime for an offset sequence ~1/64
Does This Provide Real
Data Privacy?
• Recall desired properties:
1. T(Di) does not leak additional info
about values in Di
2. significance in T(Di)
significance in
T(Di’)
3. any difference is reasonably small
Data Privacy?
• Property 1 fails:
– T(Di) does leak additional info about values in Di
– adversary knows all info about one nucleotide
• How much info is leaked?
– conditional entropy gives rough estimate
– e.g., N = 600, C∂ = N/4
• 487 bits (of 1200) leaked
• 713 bits of uncertainty remain
Analysis
• Clearly, not provable security
• Suggests two questions:
1. Can adversary determine additional
symbols; if so, how many?
2. How much info leakage is too much?
“4 out of 5 [Biologists] Agree”
• Given only the position of a single
nucleotide literal:
1. No additional nucleotides can be inferred
2. No “biologically useful” information that
can be inferred
• Given current understanding of the structure
and function of the genome
Does It Work?
• In general, yes
– strong correlation b/w our scores and S-W
– not as sensitive as S-W
• some “weak” matches missed
• Via statistical inference:
– very few false positives: < 10-4
– very few false negatives (usually none)
An Extension
• Sequences can be “masked”
– For each task, choose random binary mask
– Remove from sequence all “zero” elements
X: 2 2 4 2 1 1 4 3
11101110
224114
• Our experiments suggest mask with “1” in
90% of positions works well
Simulation Results
• Well-matched sequences artificially generated
– Substring mutated over several generations
– Placed at random location into random sequences
• Scoring function: +1 match, -1 no match
• Gap penalty: g(k) = 2 + 1(k-1)
• 10000 comparisons, no mask, maximum method
• Sequence length 600-800, matching portion length 300, average
of 52.5 subs and 52.5 indels
• 10000 comparisons, no mask, adding method
• Sequence length 600-800, matching portion length 300,
average of 52.5 subs and 52.5 indels
• 1000 comparisons, no mask, maximum method
• Sequence length 2000, matching portion length 1000,
average of 150 subs and 150 indels
• 1000 comp, 90% mask, maximum method
• Sequence length 1000-1300, matching portion length 500,
average of 86.25 subs and 86.25 indels
Conclusions
• Introduced notion of sufficient accuracy
• Presented a strategy for enhancing data
privacy in important real-world
application
• Present important real-world app that:
– requires privacy
– efficiently parallelizable
• Potential first entry for benchmark suite
of apps for privacy study
Future Work
• Solution is less than ideal
– lack of formal privacy model / provable security
– need more testing on real genetic data
• But it’s a start
– general problem is very difficult
– this is a potential avenue of attack
– S-W requires more careful study in this context
• Consider additional apps
Ongoing DVC Work @ UR
• Augmenting BOINC software for
campus-wide distribution
– want to collect participant/server/data info
& patterns
– Greg Steffensen
• Exploring AI to catch malicious behavior
– can we catch omitted results?
– Andy White, Matt Kretchmar (Denison CS)
Thanks
• NSF CyberTrust
• Doug Szajda, Jason Owen
• All the UR students
• UR Biologists:
– Rafael de Sa, Laura Runyen-Janecky,
Joe Gindhart
• Tadayoshi Kohno (UCSD)