Online Data Deduplication for In-Memory Big-Data Analytic Systems
Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed file partitioning solution is able to achieve compression ratio close to the optimal compression performance achieved by a centralized solution.