Deduplication is a critical technology for modern production and research systems. In many domains, such as cloud computing, it is often taken for granted [0]. Deduplication magnifies the amount of data you can store in memory [1], on disk [2], and in transmission across the network [3]. It comes at the cost of more CPU cycles, and potentially more IO operations at origin and destination storage backends. Microsoft [4], IBM [5], EMC [6], Riverbed [7], Oracle [8], NetApp [9], and other companies tout deduplication as a major feature and differentiator across the computing industry. So what exactly is deduplication?
Deduplication is a special case of compression [10]. It focuses on removing redundancy by replacing redundant data with small references to canonical copies which already exist within a binary file or dataset. There are three steps in deduplication:
- Chunking: breaking data into fixed-size lists of bytes
- Hashing: deriving unique names for each chunk
- Deduplicating: keeping only one chunk for each unique hash, and pointing duplicates to this canonical chunk
There are many hashing algorithms we could choose from, but in this article we use MD5 [11] because it is fast and we use it only for illustration. In practice, MD5 should not be used as it is fairly easy to find a collision—a name generated that equals the name of other content.
Chunking
Chunking has one parameter: the size of a chunk. In this article we consider fixed-size chunks; however, in general, chunk sizes can be variable and chunking can leverage semantic information to deduplicate at a file level. For fixed-size chunk algorithms, if you pick a small chunk size you might deduplicate more, but you will have a lot more chunks hurting scalability—the index of names to chunks becomes intractably large. If you pick a large chunk size, you might deduplicate less, but you will scale better by having a lot less chunks to keep in an index. Why might deduplication suffer? Because the data might not align well with your chosen chunk size. In practice, chunk sizes are almost always chosen to match the file system block size because they then align well with data written by the target file system. Today, the most common block size is 4096 bytes (4 Kibibytes).
1 2 3 4 5 |
def chunks(f, size): buf = f.read(size) while buf: yield buf buf = f.read(size) |
Given a chunk size, the above Python generator function (Gist) will continuously generate fixed-size chunks until it reaches the end of file (the last chunk might not be full size).
Hashing
There are many hash functions that have been developed which may be used to derive unique names for arbitrary chunks of bytes. We use MD5, a 128-bit hash, for performance, but in practice another hash algorithm should be used because it is relatively easy to generate chunks which have the same MD5 hash [12].
1 2 3 4 5 6 |
from md5 import new as hasher def hash(f, size): for chunk in chunks(f, size): print hasher(chunk).hexdigest() |
This snippet of Python (Gist) walks through all of the fixed-size chunks within a given file, and prints human-readable hex strings representing their binary MD5 hashes. Normally, only the binary would be stored—but for this example we will output human-readable strings for easy inspection.
Deduplicating
1 2 3 4 5 6 |
def index(f) uniques = dict() lines = f.readlines() for i,line in enumerate(lines): if line: print uniques.setdefault(line, i) |
This snippet of Python (Gist) walks through all of the chunk names, records their canonical copy’s index, and prints the index. Whenever a unique hash is encountered, it gets added to the set of canonical chunks (uniques in the snippet). Future references to the same hash return the first index value. Thus, all duplicates map to a canonical chunk index.
1 2 3 4 5 |
def dedup(index, original, dedup, size): for line in index: if line: original.seek(int(line) * size) dedup.write(original.read(size)) |
This snippet of Python (Gist) takes a list of canonical chunks, the original file, the deduplicated chunk store, and the fixed chunk size. It produces a compact file with only the canonical chunks taken from the original binary file.
The list of canonical chunks is generated by taking the list generated by the index function, sorting it, and keeping only unique lines. For example, [0,1,1,1,1,5,6,7] becomes [0,1,5,6,7]. The generated deduplicated chunk store contains binary fixed-size chunks from positions 0,1,5,6,7 in the original file.
Does it work?
Let’s see deduplication in action on a single virtual image—a synonym for virtual disk—(20 Gibibytes) containing an installation of Windows 7 Enterprise using the NTFS file system with a block size of 4096 bytes.
Figure 1. Compression ratio as a function of chunk size
Figure 1 shows compression ratio as a function of chunk size. Compression ratio is simply defined as the uncompressed size of a file divided by the compressed size. Thus, a compression ratio of 2 means the compressed size of a file is half its uncompressed size. In Figure 1 we see a lot of interesting results. gzip [13], a general open source compression application, outperforms deduplication alone at all fixed-size chunk choices. However, when gzip is combined with deduplication, the compression ratio significantly increases for chunk sizes greater than 256 bytes. 256 bytes represents the crossover chunk size where the chunk index causes the overall compression ratio to suffer and gzip alone becomes more impressive. In these results, we compressed the deduplicated data, but we did not compress the index. With a chunk size matching the block size of the NTFS file system, we achieve almost a 90% (compression ratio of 10) decrease in overall data storage requirements when combined with gzip (gzip was given the entire file without chunking).
In cloud computing, deduplication is critical for scalability. Many clouds support tens of thousands of VM instances at any given time. Storage requirements for such large numbers of VM images and their associated backups are prohibitive and grow quickly with every newly created instance. However, many of these VM’s are running similar operating systems with similar applications. Therefore, large parts of their VM images are exactly the same. Thus, deduplication on VM images in the cloud should work very well.
Figure 2 [14] shows how VM images, when added to a central repository, take up less space after deduplication (variable chunk size; file-level deduplication). Instead of scaling linearly in the number of VM images which are often similarly sized, deduplicated VM image storage scales sub-linearly which makes it much more tractable for large scale VM deployments. Deduplication is a key enabler for private and public cloud infrastructure.
Let me leave you with a question, feel free to comment online with your answer: are there any systems techniques like deduplication that you’d like to have explained and explored? Examples include: batching, double-buffering, content addressable storage, column stores, etc.
hello sir
i want to find duplicate files in hdfs that is hadoop distributed file system
so i want to need a python code
will you please give or help me out to do this work
i will be grateful to you.