Introduction

Download Report

Transcript Introduction

Aspects of Bayesian Inference
and
Statistical Disclosure Control
in Python
Duncan Smith
Confidentiality and Privacy Group
CCSR
University of Manchester
Introduction

Bayesian Belief Networks (BBNs)
probabilistic inference

Statistical Disclosure Control (SDC)
deterministic inference (attribution)
Bayesian Belief Networks

Decision-making in complex domains

Hard and soft evidence

Correlated variables

Many variables
Bayes’ Rule
P A B  
PB A
P B 
P  A
A prior belief and evidence
combined to give a posterior belief
Venn Diagram
Event B
Event A
A
only
Both
A&B
Neither A nor B
B
only
Inference
1. Prior probability table
P(A)
a
a
0.7
0.3
2. Conditional probability
table P(B|A)
a
a
b
3/7
2/3
b
4/7
1/3
1
1
3. Produce joint probability
table by multiplication
b
b
a
a
0.3
0.2
0.4
0.7
0.1
0.3
4. Condition on evidence
b
a
a
0.3
0.2
0.5
0.5
1
5. Normalise table
probabilities to sum to 1
b
a
a
0.6
0.4
def Bayes(prior, conditional, obs_level):
"""Simple Bayes for two categorical variables. 'prior'
is a Python list. 'conditional' is a list of lists
(‘column’ variable conditional on ‘row’ variable). 'obs_level'
is the index of the observed level of the row variable"""
levels = len(prior)
# condition on observed level
result = conditional[obs_level]
# multiply values by prior probabilities
result = [result[i] * prior[i] for i in range(levels)]
# get marginal probability of observed level
marg_prob = sum(result)
# normalise the current values to sum to 1
posterior = [value / marg_prob for value in result]
return posterior
Note: conditioning
can be carried out
before calculating the
joint probabilities,
reducing the cost of
inference
>>> A = [0.7, 0.3]
>>> B_given_A = [[3.0/7, 2.0/3], [4.0/7, 1.0/3]]
>>> Bayes(A, B_given_A, 0)
[0.59999999999999998, 0.39999999999999997]
>>>

The posterior distribution can be used as a
new prior and combined with evidence
from further observed variables

Although computationally efficient, this
‘naïve’ approach implies assumptions that
can lead to problems
Naive Bayes
P A, B ,C   PC APB AP A
A
B
C
A ‘correct’ factorisation
P A, B ,C   PC A, B PB AP A
A
B
C
Conditional independence

The Naive Bayes example assumes:
PC A, B   PC A

But if valid, the calculation is easier and
fewer probabilities need to be specified
The conditional independence
implies that if A is observed, then
evidence on B is irrelevant in
calculating the posterior of C
A Bayesian Belief Network
R
W

S
H
R and S are independent until H is
observed
A Markov Graph
R
W

S
H
The conditional independence structure is
found by marrying parents with common
children
Factoring

The following factorisation is implied
PH , R, S ,W   PW R PR PS PH R, S 

So P(S) can be calculated as follows
(although there is little point, yet)
PS   PS  PR  PH R, S  PW R 
R
H
W

If H and W are observed to be in states h
and w, then the posterior of S can be
expressed as follows (where epsilon
denotes ‘the evidence’)
PS    PS  PR  PH  h R, S  PW  w R 
R
Graph Triangulation
A
B
A
B
A
B
D
C
D
C
D
C
E
E
E
Belief Propagation
R,H,S
R
W,R

Message passing in a Clique Tree
S
R,S
R,H,S

W,R
Message passing in a Directed Junction
Tree
A Typical BBN
Belief Network Summary

Inference requires a decomposable graph

Efficient inference requires a good

decomposition
Inference involves evidence instantiation,
table combination and variable
marginalisation
Statistical Disclosure Control

Releases of small area population
(census) data

Attribution occurs when a data intruder
can make inferences (with probability 1)
about a member of the population
Lawyer
Profession
Accountant
Col sum


Department
A
B
C
18
4
2
2
3
0
20
7
2
Row sum
24
5
29
Negative Attribution - An individual who
is an accountant does not work for
Department C
Positive Attribution - An individual who
works in Department C is a lawyer

Release of the full table is not safe from an
attribute disclosure perspective (it contains
a zero)

Each of the two marginal tables is safe
(neither contains a zero)

Is the release of the two marginal tables
‘jointly’ safe?
The Bounds Problem

Given a set of released tables (relating to
the same population), what inferences
about the counts in the ‘full’ table can be
made?

Can a data intruder derive an upper bound
of zero for any cell count?
A non-graphical case
A
B
C

All 2 × 2 marginals of a 2×2×2 table

A maximal complete subgraph (clique)
without an individual corresponding
table
Var1
Var1
Var2
A
B
Var3
A
B
C
3
9
E
1
10
D
2
2
F
4
1
Var2
Var1 and Var2
Var3 C
D
Var3 A, C A, D B, C B, D
E
8
3
E
0
1
8
2
F
4
1
F
3
1
1
0

Original cell counts can be recovered from
the marginal tables
Lawyer
Profession
Accountant
Col sum

Department
A
B
C
20
7
2
5
5
2
20
7
2
Row sum
24
5
29
Each cell’s upper bound is the minimum of it’s
relevant margins (Dobra and Fienberg)
SDC Summary

A set of released tables relating to a given
population

If the resulting graph is both graphical and
decomposable, then the upper bounds can
be derived efficiently
Common aspects

Graphical representations
Graphs / cliques / nodes / trees

Combination of tables
Pointwise operations

BBNs
pointwise multiplication

SDC
and
pointwise minimum
pointwise addition
pointwise subtraction
}
For calculating
exact lower
bounds
Coercing Numeric built-ins

A table is a numeric array with an
associated list of variables

Marginalisation is trivial, using the built-in
Numeric.add.reduce() function and
removing the relevant variable from the list

Conditioning is easily achieved using a
Numeric.take() slice, appropriately
reshaping the array with
Numeric.reshape() and removing the
variable from the list

Pointwise multiplication

Numeric.multiply() generates the
appropriate table IF the two tables have
identical ranks and variable lists

This is ensured by adding new axes
(Numeric.NewAxis) for the ‘missing’ axes
and transposing one of the tables
(Numeric.transpose()) so that the variable
lists match
array([24, 5]) ['profession']
(2,)
array([20, 7, 2]) ['department']
(3,)
array([[24],
[ 5]])
['profession', 'department']
array([[20, 7, 2]])
['profession', 'department']
(2, 1)
(1, 3)
>>> prof * dept
array ([[480, 168, 48],
[100, 35, 10]])
['profession', 'department']
>>> (prof * dept).normalise(29)
array([[ 16.551, 5.793, 1.655],
[ 3.448, 1.206, 0.344]])
['profession', 'department']

Pointwise minimum / addition / subtraction

Numeric.minimum(), Numeric.add() and
Numeric.subtract() generate the
appropriate tables IF the two tables have
identical ranks and variable lists AND the
two tables also have identical shape

This is ensured by a secondary
preprocessing stage where the tables from
the first preprocessing stage are multiplied
by a ‘correctly’ shaped table of ones (this
is actually quicker than using
Numeric.concatenate())
array([[24],
[ 5]])
['profession', 'department']
array([ [20, 7, 2]])
['profession', 'department']
array([[20, 7, 2]
[20, 7, 2]])
(2, 1)
(1, 3)
(2nd stage preprocessing)
(2,3)
>>> prof.minimum(dept)
array([[20, 7, 2],
[ 5, 5, 2]])
['profession', 'department']
Summary

The Bayesian Belief Network software
was originally implemented in Python for
two reasons
1. The author was, at the time, a relatively
inexperienced programmer
2. Self-learning (albeit with some help) was
the only option

The SDC software was implemented in
Python because,
1. Python + Numeric turned out to be a
wholly appropriate solution for BBNs
(Python is powerful, Numeric is fast)
2. Existing code could be reused