Automatic Reassembly of Document Fragments via Context
Download
Report
Transcript Automatic Reassembly of Document Fragments via Context
Automated Reassembly of
Document Fragments
Kulesh Shanmugasundaram, Nasir Memon
{kulesh,memon}@isis.poly.edu
Outline
Introduction & Motivation
A Framework for Reassembly
Reassembly Problem
Our Solution
Implementation & Experiments
Summary
2
Introduction
Reassembly of objects from mixed fragments
Common problem in:
Classical Forensics
Failure Analysis
Archaeology
Well studied, automated…
Is there a similar problem in digital forensics?
3
Motivation
Digital evidence is
malleable
easily scattered
Unintentional Scattering:
File Fragmentation in File Systems
Difficult to reconstruct without proper file table
FAT is being ubiquitously in mini-storage devices
Swap File
Addressing and state information is not available on disk
When fragmented contents are difficult to reconstruct
4
Motivation…
Intentional Scattering:
Split
F1
F2
F3
Evidence
Fn-1
pass phrase
Hiding in Slack Space
Fn
Criminal splits the document and hides them selectively
into slack spaces based on a pass phrase
Peer-to-peer systems
Fragments are assigned a keyword from a pass phrase as
the key and scattered across the network
e.g. FreeNet, M-o-o-t
5
A Framework for Reassembly
F1
Gy
Hz
• cryptanalysis
• transformation
to a canonical
form
• cluster documents of
the same type
• group fragments of
the same document
• weight assignments
G1
G1
Gy
reassembly
collating
H1
preprocessing
G1
Fx
F
F1
F2
Fx
H1
H1
Hz
G
H
• rearrange the fragments
in an order to form the
original document
6
The Problem of Reassembly
Suppose we have fragments {A0, A1, … An} of
document A
Reassembly– compute a permutation such
that:
A= A(0)||A(1)|| … A(n)
In order to compute A:
We need to find adjacent fragments
Adjacent fragments: fragment pairs that are
adjacent in the original document, A
8
Finding Adjacent Fragments
One may use a dictionary
A pair of fragments is considered adjacent if the
word that straddles the boundary is in the
dictionary
wn fox jumps over
the quick bro
Two issues:
the lazy dog.
wnie points goes to
What if we don’t have a dictionary?
How do we resolve ambiguities?
9
Context-Based Statistical Models
Context based models are
used in data compression
abracadabra
Predicts subsequent
symbols based on current
abr
a
context
context
prediction
Works well on natural
languages as well as other
data types
We can use a context-based model to find probable
adjacent fragments
Candidate probability: probability a pair of
fragments could have been adjacent in the original
document
10
Adjacency Matrix
Candidate probabilities of
each pair of fragments form
complete graph
A Hamiltonian path that
maximizes the sum of
candidate probabilities is
our solution
But this problem is
intractable
We will discuss a near
optimal solution
a b c d e
a 0 c(a,b) ….
b c(b,a) 0 ….
c
d
e
c(e,a)
a
e
b
d
c
11
Putting the pieces together
Build context model using all the fragments of a
document.
Compute candidate probabilities for each pair of
candidate fragments (adjacency matrix)
Using the adjacency matrix find the Hamiltonian
Path that maximizes the sum of candidate
probabilities or a near optimal solution
12
Implementation & Experiments
Prediction by Partial Matching (PPM)
Uses a suite of fixed
order context models
Uses one or more
orders to predict
upcoming symbol
We process each
fragment with PPM
Combine the statistics
to form a model for
all the fragments
abracadabra
order=2
order=1
order=0
ab r 2 2/3
^ 1 1/3
a b 2 2/7
c 1 1/7
d 1 1/7
^ 3 3/7
a 5 5/12
b 2 2/16
c 1 1/16
ac a 1 ½
^1 ½
ad a 1 ½
^1 ½
br a 2 2/3
^ 1 1/3
ca d 1 ½
^ 1 ½
14
Candidate Probability
fragment i
abracada
fragment j
bracdaba
Slide a window of size d from one fragment
into the other
At each position, use the window as context
and determine the probability (pi) of next
symbol
Candidate prob. C(i,j) = (p0* p1*…pd)
15
Solution Tree
Assumptions:
Fragments are recovered
without data loss
First fragment is known/easily
identified
a
Paths in complete graph can
be represented as a tree
b
Tree grows exponentially!
We have to prune the tree c d
c
d
e
e
c
e
e
c
16
Pruning the Solution Tree
a
At every level choose a node
with the largest candidate
probability
b
c
e
d
e
We can choose nodes at
each level
c d
b c
e
c
e
b
c
By looking at candidate
probabilities levels deep
e
c
c
b
17
Data Set
OS Forensics
Log, history files of various users
Source Code
C/C++, Java source code
Binary Code
Executable, object code (Window,
Linux, Solaris)
Binary Document
MS Office, PDF documents
Unstructured
Documents
Unformatted text, chat transcripts
Encrypted &
Compressed
Encrypted, compressed files
18
Reassembly for Various Types
(On a Single Pass)
80
Reassembled (%)
70
60
50
40
30
20
10
0
ns ic
o de
ode
ents
e nts ompr es s
C
C
m
m
e
y
u
u
r
c
For e
c
c
Bina
Sour
d D o ncr yp t/C
ry Do
e
a
r
n
u
i
t
B
E
ruc
Un st
19
Compression Ratio
Compression Ratio
10
9
8
7
6
5
4
3
2
1
0
sic
n
re
o
F
e
rc
u
So
de
o
C
Bi
ry
a
n
s
de
o
C
Bi
ry
na
t
en
m
cu
o
D
U
tru
s
n
s
c
d
re
u
t
t
en
m
cu
o
D
p
ry
c
En
t/
s
es
r
p
m
o
C
20
Iterative Approach
F1
F2
Reassembly
Fn
Concatenate
Final
Sequence
Concatenated
fragments
OS Forensic
4
Type
Top 5
Top 10
Top 15
Top 20
Source Code
6
OS Forensics
57.7%
58.0%
58.7%
68.0%
Binary Code
9
Binary Code
30.0%
30.7%
33.4%
33.4%
Binary Doc
10
Binary Doc
23.4%
24.6%
28.4%
28.4%
Unstructured
10
Unstructured
26.4%
28.3%
29.0%
31.0%
21
Conclusion
Introduced the problem of reassembly of
scattered evidence
Proposed a solution based on context based
statistical models
Experiments & results
Future work:
Identifying preprocessing heuristics
Compare performance with other models
Work on reassembling images
22
Questions…
23