All deduplication implementations must maintain a catalog and must support some form of block referencing. A variety of implementations exist -- all have subtle differences that allow them to be patented. These include Hashing and Indexing.
Request a Data Deduplication Discovery Session from Mosaic
Our Deduplication Discovery Session is a complementary session where we meet with you – on site or by phone – and give you an independent assessment of what impact deduplication could have in your environment.
We’ll review the five common types of deduplication and help you with initial scoping. You can leverage our expertise to get a fix on the benefits, potential problems, and budget implications deduplication could have on your operations.
Hashing
Data deduplication begins with a comparison of two data objects. All vendors create small hash values for each new object, and store these values in a catalog.
A hash value is a small number generated from a longer string of data. It’s generated by a mathematical formula in such a way that it is unlikely for two non-identical data objects to produce the same hash value.
A hash value can be as simple as a parity calculation or as elaborate as a SHA-1 or MD-5 encryption hash. Once the hash values are created, they can easily be compared and deduplication candidates identified.
Indexing
Different implementations use varying methods when modifying the data pointer structure. However, all forms of data pointer indexing fall into two broad categories:
Hash catalog: A catalog of hash values used to identify candidates for deduplication. A system process identifies duplicates, and data pointers are modified accordingly. The advantage of catalog deduplication is that the catalog is used only to identify duplicate objects; it is not accessed during the actual reading or writing of the data objects. That task is still handled by the normal file system data structure.
Lookup table: A lookup table extends hash catalog functionality to contain a hash lookup table to index deduplicated objects parent data pointer. The advantage of a lookup table is that it can be used on file systems that do not support multiple block referencing; a single data object can be stored and referenced many times by using the lookup table.
Deduplication indexing important in your evaluation. When you use lookup tables to index data objects, the table itself becomes a single point of failure. Any table corruption will likely render the entire file system unusable.
Catalog-based deduplication, on the other hand, is used only for discovery and not for actual reading and writing of objects. Catalog deduplication, however, requires that the native file system support multiple block referencing.
Inline or Postprocessing
When deduplication occurs is another factor to consider. There are two options:
- Inline deduplication: Deduplication is performed as the data is written to the storage system.
With inline deduplication, the hash catalog is usually placed into system memory to facilitate fast object comparisons. Its advantage? It does not require duplicate data to actually be written to disk. The duplicate object is hashed, compared, and re-referenced on the fly. The disadvantage? Significant system resources are required to handle the entire deduplication operation in milliseconds. In most systems duplicate object validation beyond a quick hash compare is generally not feasible when done inline. These systems usually rely on “trusted” hash compares without validating that the objects are indeed identical.
The exception to the above rule is Diligent – a software solution that stores its index in RAM (up to 1PB in 4GB of RAM). This allows for fast compares and ultimately binary level validation.
- Post-processing deduplication: Deduplication is performed after the data is written to the storage system.
With post-processing, deduplication can be done at a more leisurely pace, and it does not require heavy utilization of system resources. The disadvantage of post-processing is that all duplicate data must first be written to the storage system, requiring additional, although temporary, physical space on the system.
Inline versus post-processing deduplication has more to do with your application than with technical advantages or disadvantages. When performing data backups, your objective is completion of backups within an allowed time window. When adding deduplication to backups, your objective is to free up redundant storage space required for these backups.
These two objectives should not compete — additional time required for deduplication (if any) should not drive backups beyond an allotted time window. You should determine if the time penalty of deduplication is offset by the space savings realized after deduplication, regardless of whether the deduplication is performed inline or postprocessing.

