Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Similar documents
Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Hadoop Online NoSQL with NetApp FAS NFS Connector for Hadoop

SanDisk Enterprise Solid State Drives: Providing Cost-Effective Performance Increases for Microsoft SQL Server Database Deployments

EsgynDB Enterprise 2.0 Platform Reference Architecture

All-Flash Storage Solution for SAP HANA:

Deploying and Managing Dell Big Data Clusters with StackIQ Cluster Manager

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

Benefits of Automatic Data Tiering in OLTP Database Environments with Dell EqualLogic Hybrid Arrays

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

IBM InfoSphere Streams v4.0 Performance Best Practices

Ghislain Fourny. Big Data 5. Column stores

NVMe SSDs Future-proof Apache Cassandra

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Big Data Analytics. Rasoul Karimi

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

10 Million Smart Meter Data with Apache HBase

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Ghislain Fourny. Big Data 5. Wide column stores

NoSQL Performance Test

SanDisk /Kaminario Processing Rate of 7.2M Messages per day Indexed 100 Million s and Attachments Solves SanDisk ediscovery Challenges

Aerospike Scales with Google Cloud Platform

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

A Fast and High Throughput SQL Query System for Big Data

Accelerate Big Data Insights

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

Presented by Nanditha Thinderu

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Architecture of a Real-Time Operational DBMS

OpenStack and Hadoop. Achieving near bare-metal performance for big data workloads running in a private cloud ABSTRACT

HBase... And Lewis Carroll! Twi:er,

Typical size of data you deal with on a daily basis

Rapid Automated Indication of Link-Loss

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Deployment Planning and Optimization for Big Data & Cloud Storage Systems

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

EMC XTREMCACHE ACCELERATES MICROSOFT SQL SERVER

Dell Reference Configuration for Hortonworks Data Platform 2.4

Replica Parallelism to Utilize the Granularity of Data

Comparing SQL and NOSQL databases

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Virtualization of the MS Exchange Server Environment

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

HPE ProLiant DL380 Gen9 and HPE PCIe LE Workload Accelerator 24TB Data Warehouse Fast Track Reference Architecture

Flash Storage Complementing a Data Lake for Real-Time Insight

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

Reduce Costs & Increase Oracle Database OLTP Workload Service Levels:

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

IBM Power Systems solution for SugarCRM

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

Embedded Technosolutions

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

ADDENDUM TO: BENCHMARK TESTING RESULTS UNPARALLELED SCALABILITY OF ITRON ENTERPRISE EDITION ON SQL SERVER

Accelerate Applications Using EqualLogic Arrays with directcache

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Oracle GoldenGate for Big Data

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

White Paper. File System Throughput Performance on RedHawk Linux

MongoDB on Kaminario K2

Service Oriented Performance Analysis

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Four-Socket Server Consolidation Using SQL Server 2008

BrainPad Accelerates Multiple Web Analytics Systems with Fusion iomemory Solutions

HCI: Hyper-Converged Infrastructure

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

A Glimpse of the Hadoop Echosystem

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS

Big Data is Better on Bare Metal

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Big Data Architect.

Quobyte The Data Center File System QUOBYTE INC.

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Configuring and Deploying Hadoop Cluster Deployment Templates

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Micron and Hortonworks Power Advanced Big Data Solutions

Planning for the HDP Cluster

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

MAKING CLOUD DEPLOYMENTS A REALITY WITH HYPERCONVERGED INFRASTRUCTURE

White Paper October 2014

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

DriveScale-DellEMC Reference Architecture

Transcription:

WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com

Table of Contents Executive Summary..................................................................... 3 Apache HBase.......................................................................... 3 YCSB Benchmark....................................................................... 5 Test Design............................................................................. 5 Test Environment....................................................................... 5 Technical Component Specifications................................................... 6 Compute Infrastructure............................................................... 6 Network Infrastructure................................................................ 6 Storage Infrastructure................................................................. 6 HBase configuration.................................................................. 7 Test Validation and Results Summary..................................................... 7 Test Methodology.................................................................... 7 Update-Heavy Workload (Workload A)................................................. 8 Read-Intensive Workload (Workload B)............................................... 10 Read-Only Workload (Workload C)................................................... 12 Conclusion............................................................................ 14 Summary.............................................................................. 14 References............................................................................ 15 2

Executive Summary In today s hyper-connected world, there is a significant amount of data being collected, and later analyzed, to make business decisions. This explosion of data has led to various analytics technologies that can operate on this Big Data, which stretches into the petabyte (PB) range within organizations, and into the zettabyte (ZB) range worldwide. Traditional database systems and data warehouses are being augmented with newer scale-out technologies like Apache Hadoop and with NoSQL databases like Apache HBase, Cassandra and MongoDB to manage the massive scale of data being processed today, and analyzed by the organizations that use these databases. This technical paper describes the benchmark testing conducted by SanDisk, using SanDisk SSDs in conjunction with an Apache HBase deployment. The testing uses the Yahoo! Cloud Serving Benchmark (YCSB), and and data sets running on a single-node Apache HBase deployment. This testing had a focus on update-heavy and read-heavy YCSB workloads. The primary goal of this paper is to show the benefits of using SanDisk solid-state drives (SSDs) within an HBase environment. As the testing demonstrated, SanDisk SSDs provided significant performance improvements when running I/O-intensive workloads, especially with a mixed workload having random read and write data accesses over mechanically driven hard disk drives (HDDs). For more information, see www.sandisk.com. SanDisk SSDs help boost the performance in HBase environments by providing 30x to 40x more transactions/sec (TPS), when compared to HDD configurations, and very low latencies of less than 50 seconds compared to > 1,000 seconds with HDDs. These performance improvements result in significantly improved efficiency of data-analytics operations by providing faster time to results, and thus providing cost savings related to reduced operational expenses for business organizations. Apache HBase Apache HBase is an open-source, distributed, scalable, big-data database. It is a non-relational database (also called a NoSQL database), unlike many of the traditional relational database management systems (RDBMS). Apache HBase uses the Hadoop Distributed File System (HDFS) in distributed mode. HBase can also be deployed in standalone mode, and in this case it uses the local file system. It is used for random, real-time read/write access to large quantities of sparse data. Apache HBase uses Log Structured Merge trees (LSM trees) to store and query the data. It features compression, in-memory caching, and very fast scans. Importantly, HBase tables can serve as both the input and output for MapReduce jobs. Some major features of HBase that have proven effective in supporting analytics workloads are: Linear and modular scalability Strictly consistent reads and writes Automatic and configurable sharding of tables, with shards being smaller and more manageable parts of large database tables 3

Support for fault tolerant and highly available processing, through using an automatic failover support between RegionServers Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables Uses the Java API for client access, as well as Thrift or REST gateway APIs, thus providing applications with a variety of access methods to the data Near-real-time query support Support for exporting metrics via the Hadoop metrics subsystem to files or the Ganglia open source cluster management and administration tool The HBase architecture defines two different storage layers, the MemStore and the StoreFile. An object is written into MemStore first, and when the MemStore is filled to capacity, a flush-to-disk operation is requested. This flush-to-disk operation results in the MemStore contents being written to disk. HBase performs compaction/compression during the flush-to-disk operations (which are more frequent for heavy write workloads), merging different StoreFiles together. HBase read requests will try to read an object first from the MemStore. In case of a read miss, the read operation is fulfilled from the StoreFile. With a heavy read workload, the number of concurrent read operations from the disk increases accordingly. The HBase storage architecture for a distributed HBase instance is shown in Figure 1. The figure shows the various components within HBase, including the HBase Master and the HBase Region Servers and it also shows how the various components access the underlying files residing in HDFS. Please visit some of the references in the References section for detailed information about the HBase Architecture. Client Zookeeper HMaster Hadoop HBase HRegionServer HRegion HLog Store StoreFile HFile HRegionServer HRegion MemStore Store MemStore Store MemStore Store MemStore StoreFile HFile StoreFile HFile DFS Client StoreFile DataNode DataNode DataNode DataNode DataNode HLog StoreFile HFile HFile StoreFile HFile DFS Client Figure 1: HBase Components and HDFS 4

YCSB Benchmark The Yahoo! Cloud Serving Benchmark (YCSB) consists of two components: The client, which generates the load according to a workload type and records the latency and throughput associated with that workload. The workload files, which define the workload type by describing the size of the data set, the total number of requests, the ratio of read and write queries. There are six major workload types in YCSB: 1. Update-Heavy Workload A: 50/50 update/read ratio 2. Read-Heavy Workload B: 5/95 update/read ratio 3. Read-Only Workload C: 100% read-only, maps closely to Workload B. 4. Read-Latest Workload D: 5/95 insert/read ratio, the read load is skewed towards the end of the key range, has similarities to Workload B. 5. Short-Ranges Workload E: 5/95 insert/read ratio, over a range of up to 100 records 6. Read/modify/write Workload F: 50/50 write/read ratio, similar to Workload A, but the writes are actual updates/modifications rather than just blind writes like in Workload A. For additional information about YCSB and workload types, please visit the official YCSB page, as listed in the References section of this paper. This technical paper describes the first three types of Workloads, A, B and C, and the latency and transactions-per-second results for these workloads using different storage configurations. These workloads provide a good mix of update-heavy and read-heavy test scenarios. Test Design A standalone HBase instance using the local file system was set up for the purpose of determining the benefits of using solid-state disks within an HBase environment. The testing was conducted with the YCSB benchmark, using different workloads (Workloads A, B and C) and different workload dataset sizes ( and ). The testing also included HBase, as it was deployed on different storage configurations. The transactions per second (TPS) and latency results for each of these tests were collected and analyzed. The results are summarized in the Test Validation and Results Summary section of this paper. Test Environment The test environment consisted of one Dell PowerEdge R720 server with two Intel Xeon CPUs (processors), with 12 cores per CPU) and 96GB of DRAM, which served as the HBase server, and one Dell PowerEdge R720, which served as a client to the test HBase server (a client to the test HBase server) running the YCSB workloads. For testing purposes, a 1GbE network interconnect was used between the server and the client. The local storage configuration on the HBase server varied between an all-hdd and an all-ssd configuration (more details are provided in the Test Methodology section of this paper). The dataset sizes used by the YCSB workloads on the clients were and. 5

Technical Component Specifications Hardware Software if applicable Purpose Quantity Dell Inc. PowerEdge R720 CentOS 5.10, 64-bit HBase server 1 Two Intel Xeon CPU E5-2620 0 @ 2GHz Apache HBase 2.4.9 96GB memory Dell Inc. PowerEdge R720 CentOS 5.10, 64-bit YCSB client 1 Two Intel Xeon CPU E5-2620 0 @ 2GHz YCSB 0.1.4 96GB memory Dell PowerConnect 2824 24-port switch 1GbE network switch Data Network 1 500GB 7.2K RPM Dell SATA HDDs 480GB CloudSpeed Ascend SATA SSDs Configured as a single RAID0 volume Configured as a single RAID0 volume Table 1: Hardware components HBase local file system HBase local file system 6 6 Software Version Purpose CentOS Linux 5.10 Operating system for server and client Apache HBase 2.4.9 HBase database server YCSB 0.1.4 Client workload benchmark Table 2: Software components Compute Infrastructure The Apache HBase server is a Dell PowerEdge R720 with two Intel Xeon 2.00GHz CPU E5-2620 and 96GB of memory. The client s compute infrastructure is the same as the server. Network Infrastructure The client and the HBase server are connected to a 1GbE network via the onboard 1GbE NICs using a Dell PowerConnect 2824 24-port switch. This network was used as the primary means of communication between the client server and the HBase server. Storage Infrastructure The Apache HBase server uses six 500GB 7.2K RPM Dell SATA HDDs in a RAID0 configuration for the all-hdd configuration. The HDDs are replaced by six 480GB CloudSpeed Ascend SATA SSDs for the all-ssd test configuration. The RAID0 volume is formatted with the XFS file system and mounted for use within HBase. 6

HBase configuration Most of the default HBase configuration parameters were used for testing purposes across all of the workloads and storage configurations under test. A few HBase configuration parameters were modified during the testing, and these are listed in the tables, below: Name of parameter Default Modified hbase.hregion.memstore.chunkpool.maxsize 0 0.5 hbase.hregion.max.filesize 1073741824 53687091200 hbase.regionserver.handler.count 10 150 Table 3: HBase configuration parameter for hbase-env.sh Name of parameter Default Modified hbase.hregion.memstore.chunkpool.maxsize 0 0.5 hbase.hregion.max.filesize 1073741824 53687091200 hbase.regionserver.handler.count 10 150 hbase.block.cache.size 0.2 0.4 hbase.hregion.memstore.mslab.enabled False True habse.hregion.memstore.mslab.maxallocation 262144 262144 Table 4: HBase configuration parameter for hbase-site.xml Test Validation and Results Summary Test Methodology The primary goal of this technical paper is to showcase the benefits of using SSDs within an HBase environment. To achieve this goal, SanDisk tested three main YCSB workloads (Workloads A, B and C) against multiple tests configurations. The distinct test configurations were set up by varying the underlying storage configuration for the standalone HBase server (using either an all-hdd or an all-ssd configuration), and the size of the data set (both and datasets were used in the testing). These storage configuration and data size variations are described below: Storage Configuration Storage configuration for the standalone HBase instance was varied between an all-hdd and an all-ssd configuration: 1. All-HDD configuration: The HBase server uses a single RAID0 volume with 6 HDDs for the single instance HBase. 2. All-SSD configuration: In this configuration, the HDDs of the first configuration are replaced with SSDs and a corresponding RAID0 volume for use with the HBase instance. 7

YCSB Workload Dataset Size Two YCSB dataset sizes were used: 1. dataset size: This particular dataset is smaller than the size of the available memory. As a result, all of the data is stored in-memory, thus avoiding disk I/O as much as possible. 2. dataset size: This dataset size is larger than the size of the available memory, therefore requiring more disk I/O operations as compared to the smaller dataset size. Update-Heavy Workload (Workload A) The YCSB Workload A is referred to as an update-heavy workload. It is a workload with a 50/50 ratio of updates and reads. The distribution of the update and read operations can be uniformly spread out across the entire data set uniform distribution or can be of type named, in which case a portion of the dataset is accessed more frequently than the rest of the dataset. A distribution is named after Zipf s Law, an empirical law of mathematical statistics, regarding statistical distribution of data. Workload A models the behavior of applications like a session store, where recent user actions are recorded. Throughput TPS Results The throughput results for YCSB Workload A are summarized in Figure 2. The X axis on the graph shows the different dataset sizes ( and ). The Y axis shows the throughput in terms of transactions per second. For each dataset size, the results for the uniform and distributions are shown for the all-hdd and all-ssd configuration via different colored bars (see legend in figures for more details). These same results are summarized in Table 5. Workload A Transactions per Second Transactions per Second 30,000 25,000 20,000 15,000 10,000 5,000 SSD HDD 0 Figure 2: Throughput comparison for Workload A 8

YCSB Workload Types Dataset Size YCSB Workload Distribution TPS All-HDD TPS All-SSD Workload A (50r/50w) 3,263 21,681 Workload A (50r/50w) 5,705 27,561 Workload A (50r/50w) 411 13,925 Workload A (50r/50w) 748 23,769 Table 5: Throughput results summary for Workload A Latency Results The read latency results for YCSB Workload A are summarized in Figure 3. The X axis on the graph shows the read latencies for the different dataset sizes ( and ), for both uniform and distributions, and the Y axis shows the latency in seconds. These same results are summarized in Table 6. The write latency results for YCSB Workload A are not shown in the figures. HBase uses a write-ahead log, which writes any updates/edits to memory, and these updates are later flushed to disk when the flush interval is reached. Due to the use of memory for updates/edits, the write latencies are between 0-1 milliseconds (ms), which get recorded as 0 seconds. As a result, the write latencies are not shown in the figures. They are, however, recorded in Table 6. Workload A Read Latency in Seconds Latency in Seconds 1,200 1,000 800 600 400 200 SSD HDD 0 Figure 3: Latency comparisons for Workload A YCSB Workload Types Storage Configuration YCSB Workload Distribution Workload A (50r/50w) HDD 505 0 >1,000 0 Workload A (50r/50w) SSD 36 0 38 0 Workload A (50r/50w) HDD 278 9 >1,000 0 Workload A (50r/50w) SSD 33 0 30 0 Table 6: Latency results summary for Workload A Read Write Read Write 9

Read-Intensive Workload (Workload B) YCSB Workloads B and C are read-intensive workloads. Workload B generates 95% read operations, with just 5% update operations in that workload. Workload C is a 100% read workload. Again, for both these workloads, the frequency of access across all parts of the dataset can be uniform, or a distribution of access can be used where part of the data is being accessed more frequently. These workloads simulate applications like photo-tagging, where most operations involve reading tags or user profile caches, and 100% of the workload is reading from the user profile caches. These types of workloads are being increasingly adopted today, especially for Web-enabled cloud-computing workloads and social media workloads. Throughput Results The throughput results for YCSB Workload B are summarized in Figure 4. The X axis on the graph shows the different dataset sizes ( and ). The Y axis shows the throughput in terms of transactions per second. For each dataset size, the results for the uniform and distributions are shown for the all-hdd and all-ssd configuration via different colored bars. These same results are summarized in Table 7. Workload B Transactions per Second Transactions per Second 25,000 20,000 15,000 10,000 5,000 SSD HDD 0 Figure 4: Throughput comparison for Workload B YCSB Workload Types Dataset Size YCSB Workload Distribution TPS All-HDD TPS All-SSD Workload B (95r/5w) 2,186 16,051 Workload B (95r/5w) 3,870 20,169 Workload B (95r/5w) 219 8,193 Workload B (95r/5w) 410 14,943 Table 7: Throughput results summary for Workload B 10

Latency results The read latency results for YCSB Workload B are summarized in Figure 5. The X axis on the graph shows the read latencies for the different dataset sizes ( and ), for both uniform and distributions, and the Y axis shows the latency in seconds. These same results are summarized in Table 8. The write latency results for YCSB Workload B are not shown in the figures. Since this particular workload is a read-heavy workload, there is a very small proportion of writes for Workload B. As a result, write latencies are not significant for this workload. For the small proportion of writes, HBase writes any updates/edits to memory, with latencies of 0-1 milliseconds (ms) which get recorded by YCSB as 0 seconds. Workload B Read Latency in Seconds Latency in Seconds 1,200 1,000 800 600 400 200 SSD HDD 0 Figure 5: Latency comparisons for Workload B YCSB Workload Types Storage Configuration YCSB Workload Distribution Workload B (50r/50w) HDD 427 0 >1,000 0 Workload B (50r/50w) SSD 29 0 37 0 Workload B (50r/50w) HDD 76 0 >1,000 0 Workload B (50r/50w) SSD 23 0 31 0 Table 8: Latency results summary for Workload B Read Write Read Write 11

Read-Only Workload (Workload C) Throughput Results The throughput results for YCSB Workload C are summarized in Figure 6. The X axis on the graph shows the different dataset sizes ( and ). The Y axis shows the throughput in terms of transactions per second. For each dataset size, the results for the uniform and distributions are shown for the all-hdd and all-ssd configuration via different colored bars (see the legend in the figures for more details). These same results are summarized in Table 9. Workload C Transactions per Second Transactions per Second 25,000 20,000 15,000 10,000 5,000 SSD HDD 0 Figure 6: Throughput comparison for Workload C YCSB Workload Types Dataset Size YCSB Workload Distribution TPS All-HDD TPS All-SSD Workload C (100r/0w) 1,865 15,334 Workload C (100r/0w) 3,582 19,756 Workload C (100r/0w) 209 7,799 Workload C (100r/0w) 373 14,214 Table 9: Throughput results summary for Workload C 12

Latency Results The read latency results for YCSB Workload C are summarized in Figure 7. The X axis on the graph shows the read latencies for the different dataset sizes ( and ), for both uniform and distributions, and the Y axis shows the latency in seconds. These same results are summarized in Table 10. The write latency results for YCSB Workload C are not shown in the figures. Since this particular workload is a read-only workload, there are no write operations in Workload C. As a result, write latencies are not applicable for this workload. Workload C Read Latency in Seconds Latency in Seconds 1,200 1,000 800 600 400 200 SSD HDD 0 Figure 7: Latency comparisons for Workload C YCSB Workload Types Storage Configuration YCSB Workload Distribution Workload C (50r/50w) HDD 463 0 >1,000 0 Workload C (50r/50w) SSD 28 0 37 0 Workload C (50r/50w) HDD 124 0 >1,000 0 Workload C (50r/50w) SSD 22 0 31 0 Table 10: Latency results summary for Workload C Read Write Read Write 13

Conclusion The key takeaways from the YCSB tests for Apache HBase, comparing all-hdd and all-ssd storage configurations, are as follows: 1. SanDisk SSDs show a dramatic increase (30-40x) in number of transactions per second for updateheavy and read-heavy workloads for different types of data access, whether uniform across the dataset or targeted to only a portion of the dataset. This translates to a higher rate of query processing in a variety of different data analytics applications, 2. SanDisk SSDs also benefit HBase workloads with significantly lower latencies than mechanically driven HDDs. Lower latencies allow quicker query-response time, therefore increasing the productivity and efficiency of analytics operations. 3. Following is a graphical summation of the key takeaways for this SanDisk test of the server/storage platform supporting the Apache HBase workload: 30-40x Transactions SanDisk SSDs within the HBase environment provide dramatic Significantly lower latencies SanDisk SSDs consistently provide read latencies of < 50 Per Second with SanDisk SSDs improvements (30x 40x) in transactions/second when compared to HDDs. Improved performance seen for (< 50 seconds) with SanDisk SSDs seconds, whereas with HDDs, the latency is consistently > 1000 seconds (for the data sets). update-heavy workloads as well as Lower latencies translate to read-heavy workloads, for uniform much quicker response times and zipfian distribution. for queries. Summary The higher transactions/second and lower latency results seen with CloudSpeed SSDs directly translate to faster query processing, thus reducing the time-to-results. In an organization s IT department, this results in significant improvements in business process efficiency and therefore cost-savings. The tests described in this paper clearly demonstrate the immense benefits of deploying SanDisk s CloudSpeed SATA solid-state drives (SSDs) drives within HBase environments. Using CloudSpeed drives will help deploy a cost-effective and process-efficient data analytics environment by providing increased throughput and reduced latency for data analytics queries. It is important to understand that workloads will vary significantly in terms of their I/O access patterns. Therefore, all workloads many not benefit equally from the use of SSDs for an HBase deployment. For the most effective deployments, customers will need to develop a proof of concept for their deployment and workloads, so that they can evaluate the benefits that SSDs can provide to their organization. 14

References SanDisk website: www.sandisk.com Apache Hadoop Distributed File System: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html Apache HBase: http://hbase.apache.org HortonWorks HBase: http://hortonworks.com/hadoop/hbase/ Yahoo LABS Yahoo! Cloud Serving Benchmark: http://research.yahoo.com/web_information_management/ycsb/ HBase Storage Architecture: http://files.meetup.com/1228907/nyc%20hadoop%20meetup%20- %20Introduction%20to%20HBase.pdf HBase performance writes: http://hbase.apache.org/book/perf.writing.html Specifications are subject to change. 2014-2016 Western Digital Corporation or its affiliates. All rights reserved. SanDisk and the SanDisk logo are trademarks of Western Digital Corporation or its affiliates, registered in the U.S. and other countries. CloudSpeed and CloudSpeed Ascend are trademarks of Western Digital Corporation or its affiliates. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). 20160623 Western Digital Technologies, Inc. is the seller of record and licensee in the Americas of SanDisk products. 15