E-Guide Data reduction techniques and performance metrics Data reduction technologies include anything that reduces the footprint of your data on disk. In primary storage, there are three types of data reduction techniques that are used: compression, file-level deduplication and sub-file-level deduplication. This eguide will explore the challenges of data reduction, three data reduction techniques and how to choose the best technique for your data storage environment. Sponsored By:
E-Guide Data reduction techniques and performance metrics Table of Contents Data reduction techniques for primary data storage systems Performance metrics: Evaluating your data storage efficiency Resources from IBM Sponsored By: Page 2 of 10
Data reduction techniques for primary data storage systems W. Curtis Preston The No. 1 rule to keep in mind when introducing a change in your primary data storage system is primum non nocere, or "First, do no harm." Data reduction techniques can help save money in disk systems, and power and cooling costs, but if by introducing these technologies you negatively impact the user experience, the benefits of data reduction may seem far less attractive. The next challenge for data reduction in primary data storage is the expectation that spacesaving ratios will be comparable to those achieved with data deduplication for backups. They won't. Most backup software creates enormous amounts of duplicate data, with multiple copies stored in multiple places. Although there are exceptions, that's not typically the case in primary storage. Many people feel that any reduction beyond 50% (a 2:1 reduction ratio) should be considered gravy. This is why most vendors of primary data reduction systems don't talk much about ratios; rather, they're more likely to cite reduction percentages. (For example, a 75% reduction in storage sounds a whole lot better than a 3:1 reduction ratio.) If you're considering implementing data reduction technologies in primary data storage, the bottom line is this: Compared to deploying deduplication in a backup environment, the job is harder and the rewards are fewer. That's not to suggest you shouldn't consider primary storage data reduction technologies, but rather, you need to properly set expectations before making a commitment. Primary storage data reduction technologies The following are three primary storage data reduction technologies: Compression. Compression technologies have been around for decades, but compression is typically used for data that's not accessed very much. That's because the act of Sponsored By: Page 3 of 10
compressing and uncompressing data can be a very CPU-intensive process that tends to slow down access to the data. However, backup is one area of the data center where compression is widely used. Every modern tape drive is able to dynamically compress data during backups and uncompress data during restores. Not only does compression not slow down backups, it actually speeds them up. How is that possible? The secret is that the drives use a chip that can compress and uncompress at line speeds. By compressing the data by approximately 50%, it essentially halves the amount of data the tape drive has to write. Because the tape head is the bottleneck, compression actually increases the effective speed of the drive. Compression systems for primary data storage use the same concept. Products such as Ocarina Networks' ECOsystem appliances and Storwize Inc.'s STN-2100 and STN-6000 appliances compress data as it's being stored and then uncompress it as it's being read. If they can do this at line speed, it shouldn't slow down write or read performance. They should also be able to reduce the amount of disk necessary to store files by between 30% and 75%, depending on the algorithms they use and the type of data they're compressing. The advantage of compression is that it's a very mature and well understood technology. The disadvantage is that it only finds patterns within a file and doesn't find patterns between files, therefore limiting its ability to reduce the size of data. File-level deduplication. A system employing file-level deduplication examines the file system to see if two files are exactly identical. If it finds two identical files, one of them is replaced with a link to the other file. The advantage of this technique is that there should be no change in access times, as the file doesn't need to be decompressed or reassembled prior to being presented to the requester; it's simply two different links to the same data. The disadvantage of this approach is that it will obviously not achieve the same reduction rates as compression or sub-file-level deduplication. Sub-file-level deduplication. Sub-file-level deduplication is very similar to the technology used in hash-based data deduplication systems for backup. It breaks all files down into segments or chunks, and then runs those chunks through a cryptographic hashing algorithm to create a numeric value that's then compared to the numeric value of every other chunk that has ever been seen by the deduplication system. If the hashes from two different Sponsored By: Page 4 of 10
chunks are the same, one of the chunks is discarded and replaced with a pointer to the other identical chunk. Depending on the type of data, a sub-file-level deduplication system can reduce the size of data quite a bit. The most dramatic results using this technique are achieved with virtual system images, and especially virtual desktop images. It's not uncommon to achieve reductions of 75% to 90% in such environments. In other environments, the amount of reduction will be based on the degree to which users create duplicates of their own data. Some users, for example, save multiple versions of their files on their home directories. They get to a "good point" and save the file, and then save it a second time with a new name. This way, they know that no matter what they do, they can always revert to the previous version. But this practice can result in many versions of an individual file -- and users rarely go back and remove older file versions. In addition, many users download the same file as their coworkers and store it on their home directory. These activities are why sub-file-level deduplication works even within a typical user home directory. The advantage of sub-file-level deduplication is that it will find duplicate patterns all over the place, no matter how the data has been saved. The disadvantage of this approach is that it works at the macro level as opposed to compression that works at the micro level. It might identify a redundant segment of 8 KB of data, for example, but a good compression algorithm might reduce the size of that segment to 4 KB. That's why some data reduction systems use compression in conjunction with some type of data deduplication. Overall, each primary data storage reduction technique has its pros and cons, and none are better than the other. How you decide which technique is right for you comes down to your individual data storage environment and how these reduction techniques will fit in. About this author: W. Curtis Preston (a.k.a. "Mr. Backup"), Executive Editor and Independent Backup Expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS." Sponsored By: Page 5 of 10
Building the engines of a Smarter Planet: How midsize businesses get more from their data, while paying less to store it. On a smarter planet, information doesn t just grow it evolves. That s why midsize businesses need a storage system designed to grow with both their business and their increasingly complex information. Enter the IBM Storwize V7000, a compact midrange disk system designed and priced for midsize companies. The IBM Storwize V7000 includes advanced features like storage virtualization, thin provisioning, and automated tiering at no additional cost, helping midsize companies store their data in a way that s simple, flexible and affordable. Here s how: 1 Improve 2 Maximize 1 application throughput by up to 200%. Automated tiering moves frequently used information to faster drives, which can provide quicker search results and lower costs for storing data. the potential of your infrastructure. With essential technologies like virtualization and thin provisioning, you can maximize storage potential without having to choose between performance and efficiency. 3 Simplify your storage management. A graphical user interface can simplify configuration, provisioning, tiering and upgrades, making users more productive, resources better utilized and growth easier to manage. IBM Storwize V7000 A compact midrange disk system designed and priced for the growing needs of midsize companies. Starting at per month for 36 months. $ 1,250 Midsize businesses are the engines of a Smarter Planet. To learn more about products like the IBM Storwize V7000, connect with an IBM Business Partner today. Call 1-877-IBM-ACCESS or visit ibm.com/engines/storage 1. Based on IBM internal study. Actual results may be different based on storage, server and database configuration. Prices subject to change and valid in the U.S. only. Actual costs will vary depending on individual customer configurations and environment. IBM Global Financing offerings are provided through IBM Credit LLC in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government customers. Rates are based on a customer's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM, the IBM logo, ibm.com, Smarter Planet and the planet icon and Storwize are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml. International Business Machines Corporation 2010.
Performance metrics: Evaluating your data storage efficiency By Greg Schulz Performance metrics can help data storage pros judge the effectiveness of their enterprise data storage resources. For example, data storage efficiency can be measured in terms of capacity utilization or productivity (such as performance). Likewise, quality of service (QoS) can indicate compliance with data protection among other application service requirements. Examples of metrics and measurements for storage efficiency and optimization include the following: Macro (e.g., facilities such as power usage effectiveness) and micro (device or component level) Time (performance or activity) vs. availability vs. space (capacity) Performance metrics, including IOPS, bandwidth, and response time or latency Additional performance metrics, including reads, writes, random, sequential or IO size Storage capacity metrics, including percent utilization as well as reduction ratios Other capacity metrics, including raw, formatted, free, allocated or allocated not used Metrics can be obtained from in-house, third-party, or operating system and applicationspecific tools. Other metrics can be estimated or simulated; for example, benchmarks running specific workloads such as those from the Transaction Processing Performance Council (TPC), Storage Performance Council (SPC), Standard Performance Evaluation Corporation (SPEC) or Microsoft Exchange Solution Reviewed Program (ESRP). Compound metrics, those made up of multiple metrics, include cost per GB and cost per IOP, along with capacity per watt or activity per watt, such as IOPS or bandwidth per watt of energy used. Sponsored By: Page 7 of 10
A list of common storage performance metrics Here is a list of common storage performance metrics: IOPS: I/O operations per second where the I/O can be of various size Latency: The response time where lower is better for time-sensitive applications MTBF: Mean time between failures indicates reliability or availability MTTR: Mean time to repair or replace a failed component or storage device Quality of Service (QoS): Refers to performance, availability or general service experience Recovery point objective (RPO): To what point in time is data saved or lost Recovery time objective (RTO): How quickly data or applications can be made available SPC: Storage Performance Council workload (IOP, bandwidth and others) TPC: Transaction Processing Council workload comparisons Other metrics include uptime, planned or unplanned downtime, errors or defects, and missed windows for data protection or other infrastructure resource management tasks. Remember to keep idle and active modes of operation in perspective when comparing tiered storage. Applications that rely on performance or data access need to be compared on an activity basis, while applications and data that are focused more on data retention should be compared on a cost per-capacity basis. For example, active, online and primary data that needs to provide performance should be looked at in terms of activity per-watt per-footprint cost, while inactive or idle data should be looked at on a capacity per-watt per-footprint cost basis. Given that productivity is also a tenet of storage efficiency, metrics that shed light on how effectively resources are being used are important. For example, QoS, performance, transactions, IOPS, files serviced or other activity-based metrics should be looked at to determine how effective and productive storage resources are. Sponsored By: Page 8 of 10
Tips for using data storage resource metrics Here are three other storage efficiency tips to remember: Look beyond cost per-capacity comparisons Remember that GB per watt can mean capacity or performance bandwidth While hit rates may indicate good utilization, they may not necessarily mean effective performance It can be easy to end up with an apples-to-oranges comparison when looking at different storage products optimized for idle or low activity that may have a good capacity per watt, but poor performance and low IOPS or bandwidth per watt. Likewise, a high-performance storage system may have good IOPS or bandwidth per watt, but may not be as attractive when compared on a capacity basis. Remember that more information will have to be processed, stored and protected in multiple locations and at a lower cost in the future. Therefore, performance efficiency can enable more effective storage capacity at a given QoS level, for both active and idle storage. Sponsored By: Page 9 of 10
Resources from IBM IBM System Storage: Hardware, Software and Services Solutions IBM Real-time Compression - Storage efficiency solutions for primary, active data IBM ProtecTIER Deduplication Solutions About IBM At IBM, we strive to lead in the creation, development and manufacture of the industry's most advanced information technologies, including computer systems, software, networking systems, storage devices and microelectronics. We translate these advanced technologies into value for our customers through our professional solutions and services businesses worldwide. Sponsored By: Page 10 of 10