Identifying File

Transcript Identifying File

Identifying File Similarity in
Large Data Sets by Modulo File
Length
Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie
Department of Computer Science, Jinan University, Guangzhou, 510632,
P. R.China
Agenda







Motivation
Challenges
Simhash
Traditional sampling algorithm(TSA)
Position-aware similarity algorithm(PAS)
Evaluation
Conclusion
Motivation

The explosive growth of data


Data more and more complex



IDC: 4ZBbytes data in 2014, 50% growth in contrast to 2012
5V: Volume, Variety, Velocity, Value, and Veracity
Structured, semi-structured, and unstructured data
File similarity detection is very important to data management




Cluster similar data in data mining
Find similar file in backup
Employ similarity to enhance the cache hierarchy in clouds
…………
Challenges

Reducing the computing overhead of similarity detection




Reducing the time of similarity detection


Similarity detection is I/O bound and CPU bound task
Require lots of memory space and frequently access disk
The compute overhead increase with the growth of data sets
Some applications requiring real time and high throughput
Achieving both the efficiency and accuracy
Simhash


Fingerprints of similar
files differ in a small
number of bit positions
Determine the similarity
of files by working out
their Hamming distance
Traditional sampling algorithm(TSA)


TSA is described in algorithm
1 by using pseudo-code
Transforming similarity
detection problem into set
intersection problem
Traditional sampling algorithm(TSA)


TSA is simple and fixed overhead, but it is very sensitive to
file modifications
We chosen n = 6, Lenc = 1KB.
Position-aware similarity algorithm(PAS)

Position-aware similarity algorithm(PAS)

PAS is described in
algorithm 2 by using
pseudo-code
Position-aware similarity algorithm(PAS)

We chosen T as 28KB, n = 6, Lenc = 1KB
Evaluation environment




Ubuntu operation system (kernel version is 2.6.32)
VirtualBox virtual machine software (4.3.8.r92456)
1GB memory
2.0 GHz Intel(R) Pentium(R) CPU
Popularity
Data set

Data set D1




Storage Space
Rank
Ext.
%Occur
Ext.
%Occur
1
h
55.30
pdf
77.52
2
pdf
14.7
mkv
4.38
3
Jpg
5.34
rar
4.24
4
c
4.28
mp3
4.01
5
mp3
3.48
zip
2.39
Total
83.1
11.5GB, 2756 files
Table 1. The profile of data set D1
Table 1 summarizes the profile of D1
Figure 4 show the distribution of file size
92.54
Data set D2


14 txt files
128MB
Figure 4. The file size distribution of data set D1
The principle of parameters selection


Comparing the detection probability of PAS against the actual
of matching chunks.
Splitting file into chunks, then maps these chunks into
fingerprint by using MD5 hash function and get fingerprint
set Finger(A). The actual portion of matching chunk
fingerprint of file A and file B is described as follows:
Sampling position impact factor T


The impact of T on the
detection probability
(Lenc = 32byte, N = 8, T
= 2KB, 8KB, 32KB,
128KB, 512KB)
We take T = 512K.


PAS and Simhash parameters configuration

PAS compare to Simhash: Time overhead
The size of file are 2MB, 5MB and 10MB,
respectively
Data set D1
PAS compare to Simhash: CPU and memory usage
CPU and memory utilization of PAS and simhash with data set D1
PAS compare to Simhash: Precision and Recall
PAS
Simhash
Precision
0.875
1
Recall
1
0.125
Conclusion


PAS is very effective in detecting file similarity
The time overhead, CPU and memory occupation of PAS are
much less than that of simhash
We believe that the PAS algorithm is applicable.
Thanks!

Identifying File

Transcript Identifying File

Directory