SCALABLE DECENTRALIZED DE
Download
Report
Transcript SCALABLE DECENTRALIZED DE
SCALABLE DECENTRALIZED
DE-DUPLICATION STORE
Prakash Chandrasekaran – Anand Gupta
Gautham Narayanasamy – Vijayaraghavan Subbaiah
Motivation
Importance of storage space
Finding enough space to meet the demands of the customers
has been a huge challenge for cloud providers.
Saving significant resources during web crawling, indexing,
and search.
Backup Strategies
To backup the data and replicate them across many
geographical locations.
Need for devising ingenious techniques to use the
storage space more efficiently.
Deduplication
Removing duplicate copies of files and storing only the
pointers to the original copy.
Block-level deduplication
Allows more granularity and hence offers a greater
reduction in storage space.
Requires more processing power when compared to filelevel deduplication.
Use case
Storage of snapshots of virtual machine (VM) images in a
virtualized cloud environment.
Detecting exact duplicates and near duplicates in web
pages.
Architecture
Cassandra Schema
create keyspace minhash;
create column family minhash_chunks with
column_type=Super;
create column family minhash_filerecipe with
column_type=Super;
create column family minhash_fullhash;
create keyspace files;
create column family files_minhash;
Data Distribution
Client / Application
Cassandra Cluster
Load Balancing
Cassandra Nodes
Data Flow in Cassandra
OS Snapshot file /
Web page
Start
Chunks
Chunking Process
Compute
and fullhash
Full hash
MinHash minhash
File input to Client
Check full hash
already exists
Insert
Insert
<fileid
<minhash,filerecipe>
<minhash,
File
Name
, minhash>
chunkData>
Match
Insert
<minhash,
fullhash>
Check file already exists
Client
Cassandra Cluster
System Implementation
Sequence - put
Sequence – get
System Efficiency
Calculating the total amount of space saved.
Demonstrate the extent of similarity in various
snapshots and web pages.
The overhead associated with file storage and
retrieval in our system.
Questions ?