Bob - VideoLectures.NET

Download Report

Transcript Bob - VideoLectures.NET

Collusion-Resistant
Anonymous
Data Collection Method
Mafruz Zaman Ashrafi
See-Kiong Ng
Institute for Infocomm Research
Singapore
Introduction
 Quality data is a pre-requisite to obtain
good data mining results.
 Collecting good quality data requires
efforts and money.
 Internet is a convenient and low-cost
platform for large-scale data collection.
Some Motivating Examples
Corporate Survey
A large organization wishes to poll its
employees for sensitive information.
 eg. How satisfied they are with their
bosses’ management skills.
- Individuals need to
rate their bosses.
- However, they are
afraid of the price
to pay for honesty.
Health Information
A drug company wishes to find out adverse
effects of a drug.
 eg. Relationship between the effects of a
drug with other drugs.
- Patients need to
disclose all the drugs
they are taking.
- However, disclosing
drug info may reveal
health condition.
Traffic Monitoring
Individual drivers wish to avoid roads with
problematic conditions.
 eg. Find out the congested road
intersections and other bottlenecks.
- Individuals need to
disclose their GPS
info.
- However, disclosing
GPS info may reveal
current position.
Introduction Cont’d..
 However, collecting data online has its
challenges.
 Privacy is the number-one concern for
online respondents.
 Respondents are reluctant to provide
truthful information if their privacy is
not protected.
Technical Challenges
Objective: Online Data Collection
Two Actors: Data Collector and Respondents
- The data collector wants to obtain the
responses from a set of respondents.
- The respondents submit honest responses only
if the data collector is unable to link a particular
response and its respondent.
Challenges
1. How does the data collector guarantee that it
is unable to associate a particular response to
the corresponding respondent?
2. How can a collusion attack be mitigated?
3. How can an honest respondent pull out his
response without revealing it to the data
collector if he finds a threat to his anonymity?
4. How can we reduce the computational and
communication overhead?
Related Works
1. Randomized Response
- Respondents’ responses are associated with the result of the
toss of a coin.
- Only a respondent knows whether the answer reflects the
toss of the coin or his true experience.
Pros:
- A well-known technique.
- Easy to use.
Cons:
- Adds noise to the result in response set that could distort the
accuracy of the data mining results.
Related Works Cont’d…
2. Cryptographic Techniques
-
Respondents employ two sets of keys to encrypt their
responses before sending to the data collector.
-
Each respondent strips off a layer off encryption sequentially
and shuffles decrypted results.
-
All respondents verify the intermediate results before the
data collector obtains the actual response set.
Pros:
- A deterministic technique.
- The data mining results are accurate.
Cons:
- Vulnerable against collusion attacks.
- Higher communication overhead.
Building Blocks of Our Approach
1. ElGamal Crypto
-
is a asymmetric public key encryption scheme.
-
is a probabilistic encryption.
-
achieves semantic security.
-
is malleable.
2. Substitution Cipher
- Replace a character with another character.
- Example:
The Hybrid Model
- Employs both ElGamal and Substitution Cipher.
- Builds an Onion for a response.
- Removes encryption layer (De-Onion) will result
in the original response.
An Onion
Original response
ElGamal Encryption
Substitution Cipher
ElGamal Encryption
An Onion
Layer
The Hybrid Model Cont’d..
An example
Onion
De-Onion
Original
response
1234567890
9809364789
7893456720
2901560011
1234567890
9809364789
7893456720
2901560011
Original
response
The Protocol
The Protocol
The Protocol has five phases
1. Data Preparation
2. Data Submission
3. Anonymization
4. Verification
5. Decryption
Phase I: Data Preparation
Suppose there are 3 respondents (Alice, Bob and Carol).
Bob’s Data Preparation Process
Bob’s Original
Response
1234
6652
5436
7065
1039
2309
3905
DM’s. Pri key
8902
Bob’s Sec. key
2453
Alice’s Sec. key
8091
Carol’s Sec. key
7609
9081
2098
8893
Bob’s
Encrypted
Response
dBob
Phase I: Data Preparation (cont’d..)
Bob also computes an partial intermediate
verification code WBob
Bob
Bob
Alice
Carol
…
…
Alice
…
Carol
…
…
WBob = 6652
4240
7056
bb
…
Phase II: Data Submission
- Each participant submits an encrypted response i.e.
and W to the data miner.
The Data Miner
- Computes the verification code ΩC = WBobWAlice WCarol
- Encrypts ΩC using its secondary key and sends the result in
encrypted value to each participant.
- Shuffles response set {d1 , d2 , d3 } = {
- Sends {d1 , d2 , d3 } to Carol.
,
,
}
Phase III: Anonymization
- Carol “de-onions” one layer from each of the responses
{d1 , d2 , d3 } . eg,
Intermediate verification
8893
ElGamal
Decryption
3905
Substitution
De-Cipher
7056
ElGamal
Decryption
5607
d’x
Phase III: Anonymization (cont’d..)
- … and computes intermediate verification Vcarol.
Bob
Alice
Carol
Carol
Alice
Bob
VCarol = 7809
2291
….
….
….
….
….
….
- Shuffles the results in set {d’y ,d’z ,d’x} = {
- Sends {d’y ,d’z ,d’x} to the Data Miner.
6790 
VC
,
,
}
Phase III: Anonymization (cont’d..)
- The Data Miner sends the randomize set {d’y ,d’z ,d’x} to
next participant (eg, Alice)
- Similar to Carol, Alice also ‘de-onion’ one layer from each
element of {d’y ,d’z ,d’x}.
- Computes intermediate verification.
- Shuffles the results in set {d’p ,d’q ,d’r}={
- Sends {d’p ,d’q ,d’r} to the Data Miner.
,
,
}
Phase III: Anonymization (cont’d..)
- The data miner sends {d’p ,d’q ,d’r} to the last participant
(i.e. Bob), who ‘de-onion’ another layer from this set.
- Computes intermediate verification, shuffles the result in
set ‘S’= {d’m ,d’n ,d’o} and sends S to data miner.
Phase IV: Verification
- Data miner computes the final secondary encryption value
‘R’ from S.
- Sends ‘R’ along with its secondary secret key to all
participants.
- Bob, Alice and Carol decrypt intermediate verification code
they received at Phase 2.
- They also compute ΩV and check ΩV = ΩC
- If ok, each of them sends their secondary secret key to the
data miner.
Phase V: Decryption
- Data miner uses the respondents’ secondary keys to
strip off remaining encryption layers from S.
- It uses its own primary key to strip off the final layer to
reveal the original responses {….,1234,…..}.
Results and Analysis
Performance Analysis
- Communication Overhead
• Brickell et al. KDD 2006
Complexity
- Computation
-
Respondent’s, O(N)
-
Data Miner, O(N2)
- Communication
-
Participant’s, O(N)
Conclusion
 The privacy of individual is an important
issue in online data collection.
 Ignoring respondents’ privacy will result in
inaccuracy in the data.
 Privacy-preserving online data collection
must be (i) deterministic and (ii) efficient.
Conclusion
 Deterministic: We employ crypto
techniques
 Collusion Resistance: We incorporate
onion/de-onion technique (using ElGama +
Substitution) to create a protective layer
against collusion
 Efficiency: Verification is done on single
values instead of entire datasets
Thank you
Q&A
The Protocol cont’d..
Suppose there are 3 respondents (Alice, Bob and Carol).
1. Data Preparation (Bob’s)
Bob’s Original
Response
DM’s. Pri
key
1234
8902
Bob’s Sec.
key
2453
Alice’s Sec.
key
8091
Carol’s Sec.
key
7609
Bob’s Pri. key
2094
Alice’s Pri. key
5607
Substitution
Cipher
Carol’s Pri.
key
4240
7056
Alice’s Pri.
key
Substitution
Cipher
9081
3905
Bob’s Pri.
key
Carol’s Pri.
key
1039
Substitution
Cipher
8893
Bob’s Encrypted
Response dBob
- Bob generates a random number θ and computes ba
= gθ and bb = gθ+7609
- Bob also generates WBob = 665242407056bb
6652
The Protocol cont’d..
Suppose there are 3 respondents (Alice, Bob and Carol).
1. Data Preparation (Bob’s)
Bob’s Original
Response
DM’s. Pri
key
1234
8902
Bob’s Sec.
key
2453
Alice’s Sec.
key
8091
Carol’s Sec.
key
7609
Bob’s Pri. key
2094
Alice’s Pri. key
5607
Substitution
Cipher
Carol’s Pri.
key
4240
7056
Alice’s Pri.
key
Substitution
Cipher
9081
3905
Bob’s Pri.
key
Carol’s Pri.
key
1039
Substitution
Cipher
8893
Bob’s Encrypted
Response dBob
- Bob generates a random number θ and computes ba
= gθ and bb = gθ+7609
- Bob also generates WBob = 665242407056bb
6652
Related Works Cont’d…
3. Mixed Networks
-
Respondents send response to an intermediate hop.
-
Each hop strips off a layer of encryption, which allows them
to obtain the next hop’s address and forward the result to it.
-
The process continues till the response reached to the data
collector.
Pros:
-
Require less communication overhead.
Cons:
-
Probabilistic approach and only works well if all participants
and honest.
-
Intermediate hops can collaborate to breach an honest
respondent’s anonymity.
The Hybrid Model Cont’d..
An example
Onion
1234567890
ElGamal Encryption
9809364789
Substitution Cipher
7893456720
ElGamal Encryption
2901560011
Original
response
De-Onion
2901560011
ElGamal Decryption
7893456720
Substitution De-cipher
9809364789
ElGamal Decryption
1234567890
Original
response