1
Using patented high-speed inline deduplication technology, Data Domain systems identify redundant data as they are being stored, creating a storage foot print that is 10X 30X smaller on average than the original dataset and that reduces WAN bandwidth needed for replication by up to 99%. Originally an ideal solution for backup and disaster recovery application, customers are now deploying Data Domain deduplication storage more broadly as a storage tier including near-line file storage, backup, disaster recovery (DR), and long term retention of enterprise data for reference, litigation support and regulatory compliance. The Data Domain product family ranges from the low-end DD140 system to the highend Global Deduplication Array. 2
A Data Domain appliance is a storage system with shelves of disks and a controller. It s optimized, first to backup and second to archive applications, and supports most of the industry-leading backup and archiving applications. The list on the slide, which is composed primarily of leading backup applications, not only EMC s offerings with NetWorker but also Symantec, CommVault, and so on. On the way into the storage system, data can pass through either Ethernet or Fibre Channel. With Ethernet it can use various protocols and NFS or CIFS; it can also use optimized protocols, such as Data Domain Boost, a custom integration with leading backup applications. After the data is stored and it s deduplicated during the storage process, it can replicate for disaster recovery, replicating only the compressed deduplicated unique data segments that have been filtered out through the right process on the target tier. Within the hardware, there are best-of-class approaches for using commodity hardware for maximum effect. Data Domain supports RAID 6 implementation. 3
The end result of identifying duplicate segments and compressing the data before storing is a significant reduction in the data stored on disk. The overall reduction is viewed as compression, and it is sometimes discussed in two parts: global and local. Global compression refers to the deduplication process that compares received data to data already stored on disks. Data that is new is then locally compressed before being written to disk. To see how the effect of global compression increases over time consider a backup stream from a first full backup that contains five segments, A, B, C, another copy of B, and D. This gets stored on disk as A,B,C,D and a reference to B instead of a second copy. Global compression at this point is the ratio of the size of the original 5 segments received (A+B+C+B again+d) to the size of the 4 segments (A+B+C+D) stored on disk. If the next backup is incremental that includes copies of A and B as well as a new segment E, only E needs to be stored. A and B are already stored so simply create references to the previously stored segments. Global compression of this backup is quite good since it is the ratio of the 3 received segments (A+B+E) to the single stored segment E. The second full backup is when the savings from global compression start to become very large. A,B,C,D and E are recognized as duplicates from the previous two backups, and only the new segment F gets stored. Global compression of this second full backup is very high, with 6 segments coming in but only the one new segment getting stored. Global compression taken over all three backups is the ratio of all 14 segments coming from the backup software to be stored to the 6 segments that get stored to represent all the data received over time. Local compression further reduces the space needed for the 6 stored segments by as much as another ratio of 2:1. 4
In the post-process architecture, data is stored to a disk before deduplication. Then after it s stored, it s read back internally, deduplicated, and written again to a different area. Although this approach may sound appealing because it seems as if it would allow for faster backups and the use of less resources. By doing post process deduplication, a lot more disks are needed to store the multiple pools of data, and for speed. In Inline approach, the data is all filtered before it s stored to disk which improves overall performance. 5
Data Domain operating system (DD OS) is purpose built for data protection, its design elements comprise an architectural design whose goal is data invulnerability. Since every component of a storage system can introduce errors, an end-to-end test is the simplest path to ensure data integrity. End-to-end verification means reading data after it is written and comparing it to what it is supposed to be, proving that it is reachable through the file system to disk. When DD OS receives a write request from backup software, it computes checksum for the data. After analyzing the data for redundancy, it stores the new data segments and all of the checksums. After the backup is compete and all the data has been synchronized to disk, DD OS verifies that it can read the entire file from the disk platter and through the Data Domain file system, and that the checksums of the data read back match the checksums of the written. This ensures that the data on the disks is readable and correct and that the file system metadata structures used to find the data are also readable and correct. The data is correct and recoverable from every level of the system. If there are problems anywhere along the way, for example if a bit has flipped on a disk drive, it will be caught. For the most part it can be corrected through self-healing feature. If for any reason it can t be corrected, it will be reported immediately, and a backup can be repeated while the data is still valid on the primary store. Conventional, performance-optimized storage systems cannot afford such rigorous verifications. The tremendous data reduction achieved by Data Domain Global Compression reduces the amount of data that needs to be verified and makes such verifications possible. 6
Once the data is stored in a Data Domain system, there are a variety of replication options to move the compressed deduplicated changes to a secondary site or a tertiary site for restore in multiple locations for disaster recovery. Collection replication performs whole-system mirroring in a one-to-one topology, continuously transferring changes in the underlying collection, including all of the logical directories and files of the Data Domain filesystem. In addition, the most popular is a directory or tape pool-oriented approach that lets you select a part of the file system, or a virtual tape library or tape pool, and only replicate that. So a single system could be used as both a backup target and a replica for another Data Domain system. This graphic shows a number of smaller sites all replicated into one hub site. In those cases the communication between those systems asks the hub whether or not it has a given segment of data yet. If it doesn t, then it sends the data. If the destination system does have the data already, the source site doesn t have to send the data again. In this scenario with multiple systems replicating to one system, in a many-to-one configuration, there is cross-site deduplication, further reducing the WAN bandwidth required and the price. 7
EMC Data Domain Boost software distributes parts of the deduplication process to the DD Boost Library that runs on backup servers. Traditional backup is a three-tier system. There s a backup client, a backup server, and a storage array. The whole stream of backup data from the client has to go through the backup server, across two LAN hops, to a storage device. Traditionally with Data Domain, since all of the deduplication occurs on the array, the network and each system along the way has to ship the whole dataset over both hops of the backup LAN. DD Boost distributes some of the deduplication processing to the backup server, so the last hop sends only deduplicated, compressed data. This makes the backup network more efficient, it makes Data Domain systems 50% faster, and it makes the whole system more manageable. It works across the entire Data Domain product line. 8
Today s IT environments are facing challenges with the combination of data growth and shrinking backup windows. Recovery time objectives (RTOs) and Recovery point objectives (RPOs) are also becoming more stringent, increasing the importance of a highly reliable, high-performance backup environment. As a complement to tape for long-term, offsite storage, backup-to-disk such as the EMC Disk Library products have emerged as powerful solutions. Customers seeking the advanced virtual tape library (VTL) functionality of the Disk Library as well as the ROI benefits of deduplication can leverage a Disk Library deployment with Data Domain. This enables customers to move data to Data Domain deduplication storage systems for longer-term retention of data and network-efficient replication. Figure on slide shows a Disk Library with the Data Domain deployment scenario. In this deployment, data in the Disk Library virtual tape cartridges is migrated or copied to the Data Domain system where it is deduplicated to remove data redundancies, resulting in longer data retention capability than a stand-alone Disk Library. The Data Domain system does not need to be dedicated to the Disk Library. While operations are occurring from the Disk Library to the Data Domain system, concurrent NAS or VTL jobs can be occurring in parallel on the Data Domain system. 9
The most common scenarios for using the Disk Library with the Data Domain system are shown in the slide. 1. Copying data from the Disk Library to the Data Domain system: In this scenario, either one or two engines are writing data to the Data Domain system. Data is migrated from the Disk Library (using tape caching) or is copied (using the embedded media managers) to the Data Domain system. With the Automated Tape Caching feature, the backup application sees the local copy of data and data access is through the Disk Library. With the embedded storage node or embedded media server, the backup application is aware of both copies of data and data access is through the backup application. 2. Copying data from the Disk Library to Data Domain and to a physical tape library: In this scenario, data is copied to the Data Domain system and a physical tape library via the embedded storage node/media server. In this configuration, the data can reside on each of the three units for different retention periods. Each engine would have to see the Data Domain system and the physical tape library since the data is seen by each engine individually. Multiple engines can be used in a dual- engine configuration, with each writing to its own Data Domain system and physical tape unit. 3. Copying data to the Data Domain and replicating to another Data Domain: In this scenario, data is written to the Data Domain system and then replicated to another Data Domain system. Data can either be migrated from the Disk Library (using tape caching) or is copied (using the embedded media managers) to the Data Domain system. The data is then automatically replicated to another Data Domain system. A dedicated Disk Library on the target side is not required, although in some tape caching environments, a Disk Library on the target side may be required. 10