The term data deduplication increasingly refers to the technique of data reduction by breaking streams of data down into very granular components, such as blocks or bytes, and then storing only the first instance of the item on the destination media, and then adding all other occurrences to an index. Because it works at a more granular level than single instance storage, the resulting savings in space are much higher, thus delivering more cost effective solutions. The savings in space translate directly to reduced acquisition, operation, and management costs.
Data deduplication technologies are deployed in many forms and many places within the backup and recovery infrastructure. It has evolved from being delivered within specially designed disk appliances offering post processing deduplication to being a distributed technology found as an integrated part of backup and recovery software. According to CA Technologies, along the way solution suppliers have identified the good and bad points of each evolution and developed what today are high performance efficient technologies.
This slideshow looks at data deduplication and five areas, identified by CA Technologies, that you should consider carefully when approaching a data deduplication project.
Click through for points you should consider when approaching a data deduplication implementation, as identified by CA Technologies.
As with many things in the world of IT, there are numerous techniques in use for deduplicating data, some are unique to specific vendors, who guard their technology behind patents and copyrights, others use more open methods. The goal of all is to identify the maximum amount of duplicate data using the minimum of resources.
The most common technique in use is that of “chunking” the data. Deduplication takes place by splitting the data stream into chunks” and then comparing the chunks with each other. Some implementations use fixed chunk sizes, other use variable chunk sizes. The latter tends to offer a higher success rate in identifying duplicate data as it is able to adapt to different data types and environments. The smaller the chunk size then the more duplicates will be found, however, performance of the backup and more importantly the restore is affected. Therefore, vendors spend a lot of time identifying the optimal size for different data types and environments, and the use of variable chunk sizes often allow tuning to occur, sometimes automatically.
During deduplication every chunk of data is processed using a hash algorithm and assigned a unique identifier, which is then compared to an index. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again, and instead a link is made to the original data. Otherwise the new hash number is added to the index and the new data is stored on the disk. When the data is read back, if a link is found, the system simply replaces that link with the referenced data chunk. The deduplication process is intended to be transparent to end users and applications.
Deduplication can occur in any of three places. “At the client”- where the source data sits, “in-line” – as the data travels to the target, or “on the target” – after the data has been written, the latter is often referred to as “post process.” All three locations offer advantages and disadvantages, and one or more of these techniques will be found in the deduplication solutions available on the market today. The choice of which type of deduplication an organization deploys is governed by their infrastructure, budgets, and perhaps most importantly, their business process requirements.
Post-process deduplication
This works by first capturing and storing all the data, and then processing it at a later time to look for the duplicate chunks. This requires a larger initial disk capacity than in-line solutions, however, because the processing of duplicate data happens after the backup is complete, there is no real performance hit on the data protection process. And CPU and memory requirements for use in the deduplication process are consumed on the target, away from the original application and therefore not interfering with business operations. As the target device may be the destination for data from many file and application servers, post-process deduplication also offers the additional benefit of comparing data from all sources – this global deduplication increases the level of saved storage space even further.
In-line deduplication
The analysis of the data, the calculation of the hash value, and the comparison with the index all takes place as the data travels from source to target. The benefit being it requires less storage as data is first placed on the target disk, however, on the negative side because so much processing has to occur, the speed of moving the data can be slowed down. In reality, the efficiency of the in-line processing has increased to the point that the performance on the backup job is so small it is inconsequential. Historically, the main issue with in-line deduplication was that it was often focused only on the data stream being transported, and did not always take into account data from other sources. This could result in a less “global” deduplication occurring and therefore more disk space being consumed than is necessary.
Client-side deduplication
Sometimes referred to as Source deduplication, this takes place where the data resides. The deduplication hash calculations are initially created on the client (source) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data and results in less data being transferred to the target. This efficiency does however, incur a cost. The CPU and memory resources required to analyze the data will also be needed by the application being protected, therefore, application performance will most likely be negatively affected during the backup process.
As data deduplication solutions have evolved, they have been packaged into a variety of products. The first major deployment of deduplication technology came in disk storage arrays. The units consist of a processor to manage the deduplication process, a bunch of disks to store the data, and a number of data connections through which the source data will travel. Different vendors deployed different techniques, some delivered post-processing capabilities, and others did their deduplication in-line. Some tuned their boxes to store generic data, while others had intelligence built in that helped recognize specific data types to increase deduplication efficiency. Most vendors offered connections to servers over FibreChannel and iSCSI connections, and others included Network Attached Storage (NAS) options. The units either looked like standard disk, or emulated tape libraries. The latter allowed for seamless integration with backup and recovery solutions already in place at a customer’s site.
Deduplication technology has now evolved from within the hardware disk array and is built into a number of backup and recovery solutions. Putting deduplication within the backup application offers numerous advantages, not least the extra efficiencies that are often gained in performance, and the non-reliance on proprietary disk drives. The latter frequently leads to a more cost effective solution. These software solutions can be found in “standalone” software appliances, or fully integrated components of the backup product.
The data growth challenges that every organization is facing are pushing the implementation of new backup and recovery technologies to help meet service level agreements around availability and performance. Shrinking budgets are pulling IT departments in the opposite direction. Data Deduplication is a technology that helps organizations balance these opposing demands. Disk based backups can be rolled out in order to reduce the backup window and improve recovery time, and deduplication means the investment in those disk based targets is maximized.
Organizations should review the different deduplication technologies on the market and choose a solution that is able to integrate seamlessly into their backup environment, in a cost effective way. Making sure that any investment does not tie them into a hardware solution that is difficult to expand as the organization’s data grows.
In designing the deployment of deduplication careful thought should be put into the following five areas.
Depending on the technology in use, different types of data will affect the efficiency of the deduplication process. For example, backing up images of multiple virtual machines often leads to high disk saving figures, as there is a lot of duplication of operating system data across each of the images. In comparison, heavily compressed data, or encrypted data may not offer as much savings as the content is more unique. The use of different “chunk” sizes during the deduplication phase can alleviate some of these issues. Highly distributed source data can also prove difficult to deduplicate if the technology being used does not provide a “global” view.
The different technologies available for deduplication each have a different impact on the performance of the backup job. The post-process techniques would on the face of it appear to offer the least impact, however, given the need for additional staging space on the target disk, and the subsequent additional I/O’s to run the comparison jobs, the effect on the start to finish process can be extreme. In-band processing, although being done as part of the backup job, is beginning to prove more efficient as the deduplication vendors improve their algorithm and index lookup technology.
Often overlooked in the decision making process, but realistically the most important aspect. The restore speed of the deduplicated data has a direct impact on the availability of the business applications.
This is the headline of many pieces of deduplication marketing, as this speaks directly to the space savings that the organization is going to achieve. The reality is that the deduplication ratio is affected by many factors; the data, the change rates of the data (few changes mean more data to deduplicate), the location of the data, the backup schedules (more fulls makes compression effect higher), and perhaps least by the deduplication algorithm. 95% reductions in disk usage are achievable, however, organizations should be pleased if the savings are over 50%, as these still represent big reductions in capital and operational spending.
Even with highly efficient deduplication technology in place, an organization will eventually run out of capacity. Therefore, before choosing a technology, understand the implications of outgrowing your capacity. Will it mean maintaining numerous “silos of storage” or require a forklift upgrade to a new system? Will you be tied to a specific hardware vendor, or will you be able to flexibly add capacity as needed.