Transcript Document

Security Problem of Statistical
Databases and Combinatorics
of Finite Sets
Mirka Miller
University of Ballarat
AUSTRALIA
1
Seminar at University of West Bohemia, January 2006
Talk Overview




Confidentiality and Privacy
Statistical Databases
Statistical Compromise
Security Mechanisms


Noise Addition
Query Restriction
 Audit Expert
 Usability
 An application of Combinatorics of Finite Sets to
Solve a Security Problem
 Current Research
Seminar at University of West Bohemia, January 2006
3
Confidentiality and Privacy
 Confidential personal data is kept in










Census databases
Medical databases
Employee databases
Taxation office records
Criminal records
Bank balances and credit records
Phone calls
Shopping habits
Driving records
and many many others…
 Confidentiality of personal data is a major issue
Seminar at University of West Bohemia, January 2006
4
Confidentiality and Privacy
 This data is used for statistical analysis and
knowledge discovery and data mining, and
facilitates research in many areas, including:



Marketing
Medicine
Crime investigation
 Examples of databases:


Census
Medical
Seminar at University of West Bohemia, January 2006
5
Statistical Databases - Census
 Census in Australia goes back more than 200
years





Until mid 1880s musters for convicts, blue books
for free settlers
By 1850 it became a stigma to be the descendant
of a convict
Usually there were about 5 copies of a census –
kept well locked up
Not enough protection: early census books were
ordered destroyed – privacy concerns
Oldest surviving census in Australia is the 1828
NSW census
Seminar at University of West Bohemia, January 2006
6
Statistical Databases - Census (cont.)
 Summary tables based on the census were
published about a year later
 Census in Australia now




Data is kept on a computer
Release of summary tables takes several years
No on-line access
Much more detailed information – privacy
concerns
Seminar at University of West Bohemia, January 2006
7
Statistical Databases - Medical
 Another common concern regarding privacy is
medical records
 Traditionally:

confidentiality was assured by the doctor’s Hippocratic oath
and the privacy of the household
 The situation now:


patient records are usually kept in a computerised database
– access issues
statistical access:: de-personalising: removing patient names
before performing statistics - not enough: de-identified does
not always mean unidentifiable
Seminar at University of West Bohemia, January 2006
8
Statistical Databases
 What is sensitive data?
 Hundred years ago in Australia convict ancestry
 Now more likely to be the amount of money a
person makes or having an illness such as AIDS
or mental illness
 Demand for privacy of personal data is nothing
new
 What is new: our use of computers gives us a
great capacity to gather data, process data,
communicate data, generate statistics based on
the data
Seminar at University of West Bohemia, January 2006
9
Security Problem of SDB
 A statistical database is an ordinary database
that returns only statistical information to user
queries, based on groups of records
 SDB is compromised if it is possible to derive
confidential individual information from statistics.
 The security problem of statistical database
is to control the use of statistical database so
that only statistical information about groups of
records is available and no sequence of queries
is sufficient to infer protected information about
any individual
Seminar at University of West Bohemia, January 2006
10
Statistical Databases
 Various sizes


Few hundreds of records or less, e.g., medical
data collected in an experiment
Hundreds of thousands of records or more, e.g.,
census data, large medical databases
 General purpose or for statistics only
 On-line or off-line
 Static or dynamic
Seminar at University of West Bohemia, January 2006
11
Statistical Database: Example
Table 1. Employee
RowID
Name
Position
Section
Age
Gender
Salary
R1
Adam
Officer
Marketing
29
M
48500
R2
Natasha
Manager
Operations
33
F
62000
R3
Sarah
Officer
Operations
27
F
51000
R4
Smith
Clark
Marketing
24
M
42000
R5
Liz
Officer
Operations
24
F
38500
R6
Martha
Officer
Marketing
32
F
44500
R7
Brown
Clark
Marketing
27
M
33000
R8
Carlyle
Manager
Marketing
35
M
58000
R9
Diane
Clark
Operations
30
F
34500
R10
Joe
Officer
Operations
34
M
37500
Seminar at University of West Bohemia, January 2006
12
Database Compromise: Example
 If anybody knows that Joe is the only Male staff
in the Operations section then it is possible to
disclose his salary using a SUM query
select SUM(Salary)
from Employee
where Section = ‘Operations’ AND Gender = ‘M’;
Answer: 37500
Prevention: Do not allow (do not answer) a query involving
just one record
Seminar at University of West Bohemia, January 2006
13
Database Compromise: Example
 If a statistical query that selects only one record is
not allowed then we can use two queries:
select SUM(Salary)
from Employee
where Section=‘Operations’ AND Rank=‘Officer’ AND Gender=‘F’
Answer: 89500 [rows satisfying the query are R3 and R5]
select SUM(Salary)
from Employee
where Section = ‘Operations’ AND Rank = ‘Officer’;
Answer: 127000 [rows satisfying the query are R3, R5, R10]
Then it can be deduced that Joe’s salary is
127000 − 89500 = 37500
Seminar at University of West Bohemia, January 2006
14
Types of Compromise
 Positive compromise occurs whenever it is
found that an individual belongs to a particular
category or holds a particular data value
 Negative compromise occurs when it is
determined that the individual does not belong to
the category or does not possess a particular
data value
 If everything in a database can be deduced, the
database is completely compromised
 If information concerning at least one individual
can be determined, the database is partially
compromised
Seminar at University of West Bohemia, January 2006
15
Types of Compromise
 If the restricted information about a person is
disclosed, it is considered to be an individual
compromise; on the other hand, the
disclosed information about a group of
individuals is called a group compromise
 A subset S (|S| >1) of records in a database is
relatively compromised if the relative order
of magnitude of the individuals in S is known
Seminar at University of West Bohemia, January 2006
16
Data Inference and Compromise
All statistics contain vestiges of information
about individual records on which they are based
 It is often possible to compare several statistics and be
able to derive some additional information about an
individual – this is called data inference
 If the data inference concerns supposedly
protected (confidential) information then the
statistical database has been compromised
 We are mostly interested in databases used


For day-to-day processing of individual records, as well as
To obtain statistics about various subpopulations of the
database
Seminar at University of West Bohemia, January 2006
17
Security Problem of SDB
 Many information security problems can be
solved by the use of


proper access methods
encryption
but the problem of statistical compromise
cannot be so solved since it concerns an
authorised user performing authorised actions
and the statistics are needed in the clear
Seminar at University of West Bohemia, January 2006
18
Control Mechanisms
 Most data inference techniques rely on set
theory and/or linear algebra
 Many techniques for the prevention of data
compromise have been proposed
 Two basic categories


Noise addition
Query restriction
Seminar at University of West Bohemia, January 2006
19
Noise Addition
 Noise addition techniques include




Data perturbation
Response perturbation
Data swapping
Random queries
 All introduce some error to the answers of
statistical queries
 Problem is, how much error should be used?


If too little then can get approximate
compromise
If too much then the statistics are useless
Seminar at University of West Bohemia, January 2006
20
Query Restriction
 Query restriction techniques include




Query set-size restriction
Query overlap restriction
Maximum order queries
Audit Expert
 Query restriction techniques either answer a
query correctly (precisely) or not at all – possibly
giving an error message
Seminar at University of West Bohemia, January 2006
21
Audit Expert
 The task of auditing may be delegated to the
database system so that the database system
 keeps track of the history of answered
queries and changes in SDB
 checks for possible compromise by every
new query
Query
SDB
Result
User
Seminar at University of West Bohemia, January 2006
22
Audit Expert Advantages
 Absolute Security
By checking the past history of all the answered
queries, auditing allows the SDB to answer a query
only when it is secure to do so
 Maximum Information
Given the previous query history of the SDB,
auditing can provide the maximum information to the
users. This includes accurate answers and as many
query answers to the users as the security of the
SDB permits
Seminar at University of West Bohemia, January 2006
23
Security Problem of SDB - Usability
 A query is answerable if answering it does not
lead to a compromise.
 SDB usability is the ratio of the number of
answerable queries to the number of posable
queries.
[Statistical query types: SUM, COUNT, MAX, MIN, MEAN,
etc.]
 The security problem of SDB is to find a
control mechanism which would prevent
compromise while maximising the usability.
Seminar at University of West Bohemia, January 2006
24
Problem 1
Given: A collection of SUM queries;
Compromise: exact
Question: Is the collection of queries
compromise-free?
Seminar at University of West Bohemia, January 2006
25
Audit Expert
In a database of n+1 records, a collection of k
SUM queries can be thought of as a system of
k linear equations:
1,1x1 + 1,2x2 + . . . + 1,n+1xn+1 = q1
2,1x1 + 2,2x2 + . . . + 2,n+1xn+1 = q2
.
.
.
k,1x1 + k,2x2 + . . . + k,n+1xn+1 = qk
Seminar at University of West Bohemia, January 2006
26
Audit Expert
In matrix form, Mk X = Q
Mk is called the query matrix.
1,1
Mk
=
2,1
...
... .
1,2
.
2,2
.
.
.
1,n+1
.
.
2,n+1
.
...
:
:
k,1
k,2
.
:
.
.
.
k,n+1
In order to avoid compromise we can have at most n
linearly independent queries.
Seminar at University of West Bohemia, January 2006
27
Audit Expert
Table 1. Employee
RowID
Name
Position
Section
Age
Gender
Salary
R1
Adam
Officer
Marketing
29
M
48500
R2
Natasha
Manager
Operations
33
F
62000
R3
Sarah
Officer
Operations
27
F
51000
R4
Smith
Clark
Marketing
24
M
42000
R5
Liz
Officer
Operations
24
F
38500
Salary of all Males
query set
Salary of all Females
Salary of all Officers
Salary of all under 25
(1 0 0 1 0)
(0 1 1 0 1)
(1 0 1 0 1)
(0 0 0 1 1)
Seminar at University of West Bohemia, January 2006
28
Audit Expert
Salary of all Males
query set
Salary of all Females
Salary of all Officers
Salary of all under 25
M4 =
1
0
0
1
0
1
1
0
0
1
1
0
1
0
1
0
0
0
1
1

(1 0 0 1 0)
(0 1 1 0 1)
(1 0 1 0 1)
(0 0 0 1 1)
1
0
0
1
0
0
0
-1
0 . -1
0
0
1
0 . 1
0
0
0
1 . 1
Seminar at University of West Bohemia, January 2006
29
Audit Expert
Using Gaussian elimination we obtain:
Mk = (Ik | M’k)
where Ik is k  k identity matrix and M’k is k(n+1-k)
matrix.
Answer to Problem 1:
Database is compromised if and only if there is a
row in the normalised query matrix Mk that
contains exactly one non-zero element (Chin
and Oszoyoglu, 1982)
Seminar at University of West Bohemia, January 2006
30
Problem 2
Given: A database of n+1 records;
Posable queries: all SUM queries;
Compromise: exact
Question: What is the maximum number
of queries that can be
answered without compromise?
Seminar at University of West Bohemia, January 2006
31
Mathematical Formulation
In a multiset A = {a1, a2, . . . , an}, ai  R, ai  0,
what is the maximum number of partial sums
that are equal to 0 or 1?
What can the numbers a1, a2, . . . , an be to
achieve the maximum?
Seminar at University of West Bohemia, January 2006
32
Mathematical Formulation (cont.)
Where does this formulation come from?
Mn =
1
0 0
0
. . . 0 a1
0
1 0
0
. . . 0 a2
0
0 1
0
. . . 0 a3
0
0 0
1
. . . 0 a4
:
:
:
:
:
0
0 0
0
. . . 1 an
:
Seminar at University of West Bohemia, January 2006
33
Example of Low Usability
Example: n+1=6, maximum number of
answerable queries is 6 (incl. empty query)
M5 =
1
0
0
1
0
0
0
0
0
. 0
1
1
0
0
1
0
. 0
1
0
0
0
1
. 0
1
0
0
0
0
. 1
1
Seminar at University of West Bohemia, January 2006
34
Example of High Usability
Example: n+1=6, maximum number of
answerable queries is 20.
M5 =
1
0
0
1
0
0
0
0
0
. 0
1
-1
0
0
1
0
. 0
1
0
0
0
1
. 0
-1
0
0
0
0
. 1
1
Seminar at University of West Bohemia, January 2006
35
Usability of Secure SDB
Answer to Problem 2:
Theorem (Miller, Roberts, Simpson, 1991)
The maximum usability of a statistical database of
n records in which all queries are posable, for
exact compromise, is
U


 n 
  n  
 2 
 n 
2
Seminar at University of West Bohemia, January 2006
36
Usability of Secure SDB
Proof makes use of a symmetric chain decomposition of an n-set and
the following theorem.
Theorem (Anderson, 1987)
The subsets of a n-set can be partitioned into
(n choose n/2 ) disjoint symmetric chains.
The number of symmetric chains containing
exactly i subsets is equal to
(n choose (n+i)/2 ) - (n choose (n+i+1)/2 )
Seminar at University of West Bohemia, January 2006
37
Symmetric Chain Decomposition
Symmetric chain decomposition of a 4-set S={a,b,c,d}
Ø  {a}  {a,b}  {a,b,c}  {a,b,c,d}
{b}  {b,c}  {b,c,d}
{c}  {c,d}  {a,c,d}
{d}  {a,d}  {a,b,d}
{a,c}
{b,d}
Seminar at University of West Bohemia, January 2006
38
Symmetric Chain Decomposition
Suppose a=-2, b=-1, c=1, d=1. Associate with each
subset the sum of the values of its elements.
Ø  {a}  {a,b}  {a,b,c}  {a,b,c,d}
0
-2
-3
-2
-1
{b}  {b,c}  {b,c,d}
-1
0
1
{c}  {c,d}  {a,c,d}
1
2
0
{d}  {a,d}  {a,b,d}
1
{a,c}
-1
-1 and
-2
{b,d}
0
Seminar at University of West Bohemia, January 2006
39
Symmetric Chain Decomposition
a=-2, b=-1, c=1, d=1. Rearrange the order of the subsets so
that the partial sums in an “antichain” are strictly increasing:
Ø  {a}  {a,b}  {a,b,c}  {a,b,c,d}
{a,b} ~ {b} ~ Ø ~
{c} ~
{c,d}
-3
-1
0
1
2
{b}  {b,c}  {b,c,d}
{c}  {c,d}  {a,c,d}
{d}  {a,d}  {a,b,d}
{a,c}
{b,d}
Seminar at University of West Bohemia, January 2006
40
Symmetric Chain Decomposition
The rest of the proof is a routine application of
Anderson’s Theorem, adding once the number of
singleton antichains plus twice the number of nonsingleton antichains.
{a,b} ~ {b} ~ Ø ~
{c} ~
{c,d}
{a} ~ {a,c} ~ {a,c,d}
{a,b,c} ~ {a,b,c,d} ~ {b,c,d}
{a,b,d} ~ {b,d} ~ {d}
{b,c}
{a,d}
Seminar at University of West Bohemia, January 2006
41
Maximum Usability of Secure SDB
The maximum usability of a statistical database of
n records in which all queries are posable, for
exact compromise, is
U


 n 
  n  
 2 
 n 
2
It is achieved by having roughly half +1’s and half
-1’s.
Seminar at University of West Bohemia, January 2006
42
Our Research
Identifying a new type of compromise called a
relative compromise. We gave techniques
both for achieving it as well as preventing
against it (with Jennie Seberry)
 Finding general upper bounds for the number
of statistical queries that can be safely
answered (with Jamie Simpson, Ian Roberts,
Ljiljana Brankovic)
 For practical reasons we allow to include a
given set of queries before maximising (with
Ljiljana Brankovic, Jozef Siran)

Seminar at University of West Bohemia, January 2006
43
Our Research
Supplementary knowledge is always needed
for the identification of a particular record. We
have given a general framework of data
inference and compromise incorporating
supplementary knowledge
 Lately we became interested in range queries
(with Ljiljana Brankovic, Peter Horak)
 Implementing expert system including legal
requirements (with Vivek Mishra, Andrew
Stranieri, Joe Ryan)
 Theoretical research continues (with Yuliya
Lenard)

Seminar at University of West Bohemia, January 2006
44
Our Research
The security problem of statistical
databases is
 An important practical problem
 Should
be of concern to any owner of a database
that releases statistics based on confidential
personal data
 Is becoming more and more important as new
privacy acts and legislations are introduced
 As insurance companies get involved it will be
necessary to demonstrate that security measures
are used
Seminar at University of West Bohemia, January 2006
45
Our Research
The security problem of statistical
databases is
 An ideal topic for a PhD thesis
 An
excellent vehicle for theoretical research,
practical implementations, comparative study of
security mechanisms
 Links with other researchers in Australia and
overseas
 Information security courses are being introduced in
many universities so chances of employment may
be higher than in other areas
Seminar at University of West Bohemia, January 2006
46
Thank you
Seminar at University of West Bohemia, January 2006
47
Model of a SDB Compromise
Seminar at University of West Bohemia, January 2006
48
Example of Data Inference
Seminar at University of West Bohemia, January 2006
49
Example of a SDB Compromise
Seminar at University of West Bohemia, January 2006
50
Example of a SDB Compromise
Seminar at University of West Bohemia, January 2006
51
Example of a SDB Compromise
Seminar at University of West Bohemia, January 2006
52