dCACH: Content Aware Clustered and Hierarchical Distributed Deduplication

Authors: Girum Dagnaw, Ke Zhouorcid, Hua Wangorcid

In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunks for deduplication at a cost of low storage saving and limited scalability. Another way is to distribute the deduplication process among multiple nodes but this approach introduces storage node island effect and also incurs high communication cost. In this paper, we explore dCACH, a content-aware clustered and hierarchical deduplication system, which implements a hybrid of inline course grained and offline fine-grained distributed deduplication where routing decisions are made for a set of files instead of single files. It utilizes bloom filters for detecting similarity between a data stream and previous data streams and performs stateful routing which solves the storage node island problem. Moreover, it exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and deduplicate each group in their own fingerprint index space. It implements hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. Locality is created and exploited first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, the use of bloom filter for similarity detection between streams has low communication and computation cost while it enables to achieve duplicate elimination performance comparable to single node deduplication. dCACH is evaluated using a prototype deployed on a server environment distributed over four separate machines. It is shown to have 10× the speed of Extreme_Binn with a minimal communication overhead, while its duplicate elimination effectiveness is on a par with a single node deduplication system.


Journal: Journal of Software Engineering and Applications
DOI: 10.4236/jsea.2019.1211029(PDF)
Paper Id: 96634


See also: Comments to Paper

About scirp

(SCIRP: http://www.scirp.org) is an academic publisher of open access journals. It also publishes academic books and conference proceedings. SCIRP currently has more than 200 open access journals in the areas of science, technology and medicine. Readers can download papers for free and enjoy reuse rights based on a Creative Commons license. Authors hold copyright with no restrictions. SCIRP calculates different metrics on article and journal level. Citations of published papers are shown based on Google Scholar and CrossRef. Most of our journals have been indexed by several world class databases. All papers are archived by PORTICO to guarantee their availability for centuries to come.
This entry was posted in JSEA. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *