SCALABLE DECENTRALIZED DE

Transcript SCALABLE DECENTRALIZED DE

SCALABLE DECENTRALIZED
DE-DUPLICATION STORE
Prakash Chandrasekaran – Anand Gupta
Gautham Narayanasamy – Vijayaraghavan Subbaiah
Motivation

Importance of storage space
Finding enough space to meet the demands of the customers
has been a huge challenge for cloud providers.
 Saving significant resources during web crawling, indexing,
and search.


Backup Strategies


To backup the data and replicate them across many
geographical locations.
Need for devising ingenious techniques to use the
storage space more efficiently.
Deduplication


Removing duplicate copies of files and storing only the
pointers to the original copy.
Block-level deduplication
Allows more granularity and hence offers a greater
reduction in storage space.
 Requires more processing power when compared to filelevel deduplication.


Use case
Storage of snapshots of virtual machine (VM) images in a
virtualized cloud environment.
 Detecting exact duplicates and near duplicates in web
pages.

Architecture
Cassandra Schema

create keyspace minhash;
create column family minhash_chunks with
column_type=Super;
 create column family minhash_filerecipe with
column_type=Super;
 create column family minhash_fullhash;


create keyspace files;

create column family files_minhash;
Data Distribution
Client / Application
Cassandra Cluster
Load Balancing
Cassandra Nodes
Data Flow in Cassandra
OS Snapshot file /
Web page
Start
Chunks
Chunking Process
Compute
and fullhash
Full hash
MinHash minhash
File input to Client
Check full hash
already exists
Insert
Insert
<fileid
<minhash,filerecipe>
<minhash,
File
Name
, minhash>
chunkData>
Match
Insert
<minhash,
fullhash>
Check file already exists
Client
Cassandra Cluster
System Implementation
Sequence - put
Sequence – get
System Efficiency



Calculating the total amount of space saved.
Demonstrate the extent of similarity in various
snapshots and web pages.
The overhead associated with file storage and
retrieval in our system.
Questions ?

SCALABLE DECENTRALIZED DE

Transcript SCALABLE DECENTRALIZED DE

Directory