Automatic Reassembly of Document Fragments via Context

Transcript Automatic Reassembly of Document Fragments via Context

Automated Reassembly of
Document Fragments
Kulesh Shanmugasundaram, Nasir Memon
{kulesh,memon}@isis.poly.edu
Outline






Introduction & Motivation
A Framework for Reassembly
Reassembly Problem
Our Solution
Implementation & Experiments
Summary
2
Introduction


Reassembly of objects from mixed fragments
Common problem in:



Classical Forensics
Failure Analysis
Archaeology

Well studied, automated…

Is there a similar problem in digital forensics?
3
Motivation

Digital evidence is



malleable
easily scattered
Unintentional Scattering:

File Fragmentation in File Systems



Difficult to reconstruct without proper file table
FAT is being ubiquitously in mini-storage devices
Swap File


Addressing and state information is not available on disk
When fragmented contents are difficult to reconstruct
4
Motivation…

Intentional Scattering:
Split
F1
F2
F3
Evidence
Fn-1
pass phrase

Hiding in Slack Space


Fn
Criminal splits the document and hides them selectively
into slack spaces based on a pass phrase
Peer-to-peer systems


Fragments are assigned a keyword from a pass phrase as
the key and scattered across the network
e.g. FreeNet, M-o-o-t
5
A Framework for Reassembly
F1
Gy
Hz
• cryptanalysis
• transformation
to a canonical
form
• cluster documents of
the same type
• group fragments of
the same document
• weight assignments
G1
G1
Gy
reassembly
collating
H1
preprocessing
G1
Fx
F
F1
F2
Fx
H1
H1
Hz
G
H
• rearrange the fragments
in an order to form the
original document
6
The Problem of Reassembly

Suppose we have fragments {A0, A1, … An} of
document A
Reassembly– compute a permutation  such
that:
A= A(0)||A(1)|| … A(n)

In order to compute A:



We need to find adjacent fragments
Adjacent fragments: fragment pairs that are
adjacent in the original document, A
8
Finding Adjacent Fragments

One may use a dictionary

A pair of fragments is considered adjacent if the
word that straddles the boundary is in the
dictionary
wn fox jumps over
the quick bro

Two issues:


the lazy dog.
wnie points goes to
What if we don’t have a dictionary?
How do we resolve ambiguities?
9
Context-Based Statistical Models





Context based models are
used in data compression
abracadabra
Predicts subsequent
symbols based on current
abr
a
context
context
prediction
Works well on natural
languages as well as other
data types
We can use a context-based model to find probable
adjacent fragments
Candidate probability: probability a pair of
fragments could have been adjacent in the original
document
10
Adjacency Matrix




Candidate probabilities of
each pair of fragments form
complete graph
A Hamiltonian path that
maximizes the sum of
candidate probabilities is
our solution
But this problem is
intractable
We will discuss a near
optimal solution
a b c d e
a 0 c(a,b) ….
b c(b,a) 0 ….
c
d
e
c(e,a)
a
e
b
d
c
11
Putting the pieces together



Build context model using all the fragments of a
document.
Compute candidate probabilities for each pair of
candidate fragments (adjacency matrix)
Using the adjacency matrix find the Hamiltonian
Path that maximizes the sum of candidate
probabilities or a near optimal solution
12
Implementation & Experiments
Prediction by Partial Matching (PPM)




Uses a suite of fixed
order context models
Uses one or more
orders to predict
upcoming symbol
We process each
fragment with PPM
Combine the statistics
to form a model for
all the fragments
abracadabra
order=2
order=1
order=0
ab  r 2 2/3
 ^ 1 1/3
a  b 2 2/7
 c 1 1/7
 d 1 1/7
 ^ 3 3/7
 a 5 5/12
 b 2 2/16
 c 1 1/16
ac  a 1 ½
^1 ½
ad  a 1 ½
^1 ½
br  a 2 2/3
 ^ 1 1/3
ca  d 1 ½
^ 1 ½
14
Candidate Probability
fragment i
abracada



fragment j
bracdaba
Slide a window of size d from one fragment
into the other
At each position, use the window as context
and determine the probability (pi) of next
symbol
Candidate prob. C(i,j) = (p0* p1*…pd)
15
Solution Tree

Assumptions:





Fragments are recovered
without data loss
First fragment is known/easily
identified
a
Paths in complete graph can
be represented as a tree
b
Tree grows exponentially!
We have to prune the tree c d
c
d
e
e
c
e
e
c
16
Pruning the Solution Tree



a
At every level choose a node
with the largest candidate
probability
b
c
e
d
e
We can choose  nodes at
each level
c d
b c
e
c
e
b
c
By looking at candidate
probabilities  levels deep
e
c
c
b
17
Data Set
OS Forensics
Log, history files of various users
Source Code
C/C++, Java source code
Binary Code
Executable, object code (Window,
Linux, Solaris)
Binary Document
MS Office, PDF documents
Unstructured
Documents
Unformatted text, chat transcripts
Encrypted &
Compressed
Encrypted, compressed files
18
Reassembly for Various Types
(On a Single Pass)
80
Reassembled (%)
70
60
50
40
30
20
10
0
ns ic
o de
ode
ents
e nts ompr es s
C
C
m
m
e
y
u
u
r
c
For e
c
c
Bina
Sour
d D o ncr yp t/C
ry Do
e
a
r
n
u
i
t
B
E
ruc
Un st
19
Compression Ratio
Compression Ratio
10
9
8
7
6
5
4
3
2
1
0
sic
n
re
o
F
e
rc
u
So
de
o
C
Bi
ry
a
n
s
de
o
C
Bi
ry
na
t
en
m
cu
o
D
U
tru
s
n
s
c
d
re
u
t
t
en
m
cu
o
D
p
ry
c
En
t/
s
es
r
p
m
o
C
20
Iterative Approach
F1
F2
Reassembly
Fn
Concatenate
Final
Sequence
Concatenated
fragments
OS Forensic
4
Type
Top 5
Top 10
Top 15
Top 20
Source Code
6
OS Forensics
57.7%
58.0%
58.7%
68.0%
Binary Code
9
Binary Code
30.0%
30.7%
33.4%
33.4%
Binary Doc
10
Binary Doc
23.4%
24.6%
28.4%
28.4%
Unstructured
10
Unstructured
26.4%
28.3%
29.0%
31.0%
21
Conclusion


Introduced the problem of reassembly of
scattered evidence
Proposed a solution based on context based
statistical models


Experiments & results
Future work:



Identifying preprocessing heuristics
Compare performance with other models
Work on reassembling images
22
Questions…
23

Automatic Reassembly of Document Fragments via Context

Transcript Automatic Reassembly of Document Fragments via Context

Directory