Data Deduplication Basics
Data deduplication reduces the amount of data that needs to be physically stored by eliminating redundant information. Data deduplication devices inspect data down to block and bit level. After the initial inspection, only the changed data will be saved, while the rest is discarded and replaced with a pointer to the previously saved information. Block and bit level deduplication methods are able to achieve compression ratios of 20x to 60x, or even higher, under the right conditions.
There is also file-level deduplication, called single instance storage. In file-level deduplication, if two files are identical, then only one copy of the file is kept while the duplicate is not. File-level deduplication is not as efficient as block- and bit-level storage because even a single changed bit results in a new copy of the whole file being stored.
Practical Benefits of Data Deduplication
Data deduplication reduces the amount of data that has to be stored and backed up. This means that less media has to be bought, and it takes longer to fill up disk and tape backup devices. Data can be backed up more quickly to disk, which means shorter backup windows and faster restores. A reduction in the amount of space taken up in disk systems, VTLs, for example, means longer retention periods are possible, bringing faster restores, and reducing dependence on tapes and tape backup management. Less data also means less bandwidth utilization, which will, in turn, accelerate remote backup, as well as replication and the disaster recovery processes.
What Deduplication Ratios Can Be Achieved?
Deduplication ratios vary greatly, according to the type of data being processed and over what time period. Data that contains an abundance of repeated information, such as databases or email will bring the highest levels of deduplication, in excess of 30, 40 or 50 times. With that said, data that contains lots of unique information (image files, video), will not contain a great deal of redundancy that can be eliminated.
Advantages of Hardware-Based Deduplication
Purpose-built data deduplication solutions can relieve the processing burden associated with software-based data deduplication products, as well as incorporate deduplication into other types of data protection hardware: backup appliances, VTLs and NAS. While software-based deduplication usually eliminates redundancy in data at its source, hardware-based deduplication emphasizes data reduction at the storage subsystem. For this reason, hardware-based deduplication might not provide the bandwidth savings that might be gained by deduplicating at source, but compression levels are generally better. Hardware-based data deduplication brings high performance, scalability and relatively non-disruptive deployment.
The use of software for this process is typically less expensive to deploy than dedicated hardware and should require no significant changes to the physical network. However, software-based deduplication can be more disruptive to install and more difficult to maintain. Lightweight “agents” are sometimes required on each host system to be backed up, allowing it to communicate with a backup server running the same software. The software will need updating as new versions become available or as each host’s operating environment changes over time.
How Does Inline Differ from Post-Process?
Data deduplication can be carried out inline or post-process. Inline (or in-band) data deduplication removes redundant data as it’s being written to media or device. Inline can be more efficient because data is deduplicated and ingested simultaneously. The advantage to the inline method is the data only passes through once. Because of the simultaneous ingest this process may experience slow throughput and extend backup time.
Post-process (or out-of-band) data deduplication is carried out after data has been written to disk. This method does not affect the backup window, and sidesteps CPU processing that might create a bottleneck between the backup server and the storage device. Post-process deduplication uses more disk space during the data deduplication process because data is ingested and then deduplicated.
Hash-based methods of redundancy elimination process each piece of data using a hash algorithm, such as SHA-1 or MD5. This method generates a unique number for each piece of data which is compared to an index of other existing hash numbers. If that hash number already exists on the index, the data need not be stored again. Otherwise, the new hash number is added to the index and the data stored.
• SHA-1 was originally devised to create cryptographic signatures for security applications. SHA-1 creates a 160-bit value that’s statistically unique for each piece of data.
• MD5 is a 128-bit hash that was also designed for cryptographic purposes.
Hash collisions occur when two different chunks of data produce the same hash. The chances of this are very slim indeed, but SHA-1 is considered the more secure of the two algorithms.
The best way to compare two chunks of data is to perform a bit-level comparison of the two blocks. The cost involved in doing this is the I/O required to read and compare them.
Some vendors use custom methods to identify duplicate data, such as their own hash algorithm combined with other methods.
What Is the Difference Between Source Deduplication and Target Deduplication?
Data can be deduplicated at the target or source. Deduplicating at the target means you can use your current backup software and the backup system operates as usual. The target identifies and eliminates redundant data sent by the backup system. Deduplication at the source involves installing backup client software from the deduplication vendor. The client communicates deduplication and has high deduplication ratios (e.g., 50:1), then the deduplicated data will occupy less storage space, and you can use a a smaller backup target system. If you have a mix of data that does not deduplicate well (i.e., 10:1 or less data reduction), then you’ll need a much larger backup target system. What matters is what deduplication ratio is achieved in a real-world environment with a real mix of data types.
The deduplication method has a significant impact on deduplication ratio. All deduplication approaches are not created equal.
Zone-level with byte comparison or 8KB block-level with variable length content splitting will get the best deduplication ratios. The average is a 20:1 deduplication ratio with a general mix of data types.
64KB and 128KB fixed block will produce the lowest deduplication ratio, as the blocks are too big to find many repetitive matches. The average is a 7:1 deduplication ratio.
4KB fixed block will get close to the above, but often suffers a performance hit. A 13:1 deduplication ratio is the average with a general mix of data types.
The number of weeks of retention you keep impacts deduplication ratio, as well. The longer the retention, the more the system is seeing repetitive data. Therefore, the ratio increases as the retention increases. Most vendors will say they get a deduplication ratio of 20:1, typically based on a 16-week retention period. If you keep only two weeks of retention, you may only get about a 4:1 reduction.
Your backup rotation will also impact the size of the deduplication system that you need. If you’re performing rolling full backups each night as opposed to incremental backups on files during the week and a full backup on the weekend, then you would probably need a larger system.
How does a deduplication device record the existence of redundant data?
Once a deduplication device has identified a redundant piece of data, it has to decide how to record its existence. There are two ways it can do so:
- Reverse referencing, which creates a pointer to the original instance of the data when additional identical pieces of data occur.
- Forward referencing, which writes the latest version of the piece of data to the system, then makes the previous occurrence a pointer to the most recent.
How Does Encryption Affect Data Deduplication?
Deduplication works by eliminating redundant files, blocks or bits, and encryption turns data into a data stream that’s random by its nature. Therefore, if you encrypt data first – that is, effectively randomize it and remove similar patterns – then it may be impossible to deduplicate it. So you may find that data should be deduplicated first and then encrypted.
Correctly sizing a disk backup with deduplication to meet your current and future needs is an important part of your data protection strategy. If you ask the right questions up front and analyze each aspect of your environment that impacts backup requirements, you can avoid the consequences of buying an undersized system that quickly exceeds capacity.
Sizing Data Protection Strategy
It’s important to understand how to size properly. It’s a bit different than the process of sizing a primary storage system.
The data types you have directly impact the deduplication ratio and, therefore, the system you need. If your mix of data types is conducive to cross protection, then here are two scenarios:
• Scenario 1: You’re backing up data at Site A and replicating to Site B for disaster recovery. For example, if Site A is 30TB and Site B is just for disaster recovery, then a system that can handle 30TB at Site A and 30 TB at Site B is required.
• Scenario 2: If backup data is kept at both Site A (30TB) and at Site B (18TB), and the data from Site A is being replicated to Site B while the data from Site B is being cross-replicated to Site A, then a larger system on both sides is required.
Questions to Be Addressed while Planning the Right Sized System
- How much data is in your full backup?
- What percentage of the data is compressed (including media files), encrypted, database, unstructured?
- What is the required retention period in weeks/months off-site?
- What is the nightly backup rotation?
- Is data being replicated one way only or backed up from multiple sites and cross-replicated?
Where to go from here
To speak with a Data Deduplication specialist, call (631) 789-9595 or fill out our Information Request Form and a representative will call you back shortly