Data Duplication Removal Using File Checksum
Data duplication technology usually identifies redundant data quickly and correctly by using file checksum technique. A checksum can determine whether there is redundant data. However, there are the presences of false positives. In order to avoid false positives, we need to compare a new chunk with chunks of data that have been stored. In order to reduce the time to exclude the false positives, current research uses extraction of file data checksum. However, the target file stores multiple attributes such as user id, filename, size, extension, checksum and date-time table. Whenever user uploads a particular file, the system then first calculates the checksum and that checksum is cross verified with the checksum data stored in database. If the file already exists, then it will update the entry else it will make a new entry into the database.
Data de-duplication has an important role in reducing storage consumption to make it affordable to manage in today’s explosive data growth. The main goals of this project is, to maximally reduce the amount of duplicates in one type of NoSQL DBs, namely the key-value store, to maximally increase the process performance such that the backup window is marginally affected, and to design with horizontal scaling in mind such that it would run on a Cloud Platform competitively.