Micron and Hortonworks Power Advanced Big Data Solutions

Similar documents
NVMe SSDs Future-proof Apache Cassandra

NVMe SSDs A New Benchmark for OLTP Performance

Three Paths to Better Business Decisions

Micron Quad-Level Cell Technology Brings Affordable Solid State Storage to More Applications

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Huge Benefits With All-Flash and vsan 6.5

Lenovo Database Configuration for Microsoft SQL Server TB

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Flash Storage Complementing a Data Lake for Real-Time Insight

Apache Spark Graph Performance with Memory1. February Page 1 of 13

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN

Doubling Performance in Amazon Web Services Cloud Using InfoScale Enterprise

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

Selecting Server Hardware for FlashGrid Deployment

Lenovo Database Configuration

Lenovo Database Configuration

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Optimizing Apache Spark with Memory1. July Page 1 of 14

stec Host Cache Solution

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Deep Learning Performance and Cost Evaluation

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Deep Learning Performance and Cost Evaluation

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Hortonworks Data Platform Reference Architecture

IBM Emulex 16Gb Fibre Channel HBA Evaluation

IBM Power AC922 Server

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar

Innovative Edge Storage Solutions for the Video Surveillance Industry

Technical Note. Improving Random Read Performance Using Micron's SNAP READ Operation. Introduction. TN-2993: SNAP READ Operation.

SanDisk Enterprise Solid State Drives: Providing Cost-Effective Performance Increases for Microsoft SQL Server Database Deployments

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

SOLUTION BRIEF. QxStack vsan ReadyNode -Solution Brief for MSSQL

Automated Netezza Migration to Big Data Open Source

All-Flash Storage Solution for SAP HANA:

Extending the Benefits of GDDR Beyond Graphics

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Accelerate Big Data Insights

By John Kim, Chair SNIA Ethernet Storage Forum. Several technology changes are collectively driving the need for faster networking speeds.

How Architecture Design Can Lower Hyperconverged Infrastructure (HCI) Total Cost of Ownership (TCO)

IBM Power Advanced Compute (AC) AC922 Server

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Reduce Costs & Increase Oracle Database OLTP Workload Service Levels:

New Approach to Unstructured Data

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

SPC BENCHMARK 2 FULL DISCLOSURE REPORT VEXATA INC. VX100-F SCALABLE NVME FLASH ARRAY SPC-2 V1.7.0 SUBMITTED FOR REVIEW: AUGUST 29, 2018

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Consolidating Microsoft SQL Server databases on PowerEdge R930 server

vsan 6.6 Performance Improvements First Published On: Last Updated On:

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

Four-Socket Server Consolidation Using SQL Server 2008

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Teradata Aster Database Platform/OS Support Matrix, version AD

Accelerating Digital Transformation with InterSystems IRIS and vsan

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Innovations in Non-Volatile Memory 3D NAND and its Implications May 2016 Rob Peglar, VP Advanced Storage,

A New Metric for Analyzing Storage System Performance Under Varied Workloads

Open storage architecture for private Oracle database clouds

Analyze Big Data Faster and Store It Cheaper

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

VMware vsan Ready Nodes

EMC VMAX 400K SPC-2 Proven Performance. Silverton Consulting, Inc. StorInt Briefing

Accelerate Applications Using EqualLogic Arrays with directcache

Monitoring Ready/Busy Using the READ STATUS (70h) Command

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

SNAP Performance Benchmark and Profiling. April 2014

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

HP SAS benchmark performance tests

... IBM Advanced Technical Skills IBM Oracle International Competency Center September 2013

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Microsoft Office SharePoint Server 2007

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

Storwize/IBM Technical Validation Report Performance Verification

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

TPC Benchmark H Full Disclosure Report. SPARC T4-4 Server Using Oracle Database 11g Release 2 Enterprise Edition with Partitioning

Certification Document macle GmbH IBM System xx3650 M4 03/06/2014. macle GmbH IBM System x3650 M4

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Executive Brief June 2014

Optimizing the Data Center with an End to End Solutions Approach

Transcription:

Micron and Hortonworks Power Advanced Big Data Solutions Flash Energizes Your Analytics Overview Competitive businesses rely on the big data analytics provided by platforms like open-source Apache Hadoop software to get the answers they need faster. The volume, velocity and variety of big data demand a lot of storage. For years, legacy HDDs have been the primary storage technology for big data solutions, with price taking precedence over performance (this has been reinforced by the predominance of batch-processed analytics). However, companies are now seeing the value of real-time analytics and faster time to insights. Historically, one of the major challenges for data scientists has been providing CPUs or GPUs with data fast enough to fully utilize these expensive resources, thereby reducing idle time. CPU/GPU idle time is inefficient and prevents us from attaining real-time results. This is where judicious, cost-effective additions of flash SSDs to existing big-data deployments can help. Fast Facts Adding NVMe SSDs to an all-hdd data platform resulted in: TCP-DS benchmarks completing 1.7X faster Complex queries experiencing up to 2.5X better response times Large clusters experiencing reductions in node count requirements of up to 40% resulting in: Reduced power consumption Reduced cooling Improved reliability Reduced data center space Micron s leadership and expertise in developing enterprise SSDs enable us to harness the benefits of advanced memory and flash storage to help reduce the time to actionable answers. Attaining the benefits of real-time analytics requires faster storage, which can be provided by our NVMe SSDs. In this technical brief, we will discuss how we added an NVMe SSD to an all-hdd Hortonworks Data Platform (HDP) a leading Hadoop platform implementation for managing large data repositories as a cost-efficient way to achieve results faster. Micron 9200 MAX SSD With NVMe Faster Time to Insights Adding a single NVMe SSD to an existing Hadoop cluster can improve time to insights by more than 2X depending on the types of queries. 1

Background We used a standard benchmark based on the Transaction Processing Council s TPC-DS query set to show that introducing SSDs into existing big data clusters that currently leverage typical high-performance 15K RPM SAS HDDs brings real value. This benchmark simulates a typical data analytics workload. Our goal was to determine if CPU wait times could be reduced by adding a flash SSD as a YARN cache to the Hortonworks cluster. This method of adding flash to an existing cluster gives quick access to the benefits of faster storage without incurring the higher costs of an all-flash replacement. Test Configuration For this analysis, we used an early beta version of HDP 3.0. The Hadoop cluster software consisted of a Hortonworks HDP 3.0 Hive database on HDFS/YARN deployed on two separate four-node clusters configured as shown in Table 1. The two clusters differed only in that one cluster used a group of 15K SAS HDDs and the second cluster used the same HDD configuration plus a single Micron 9200 MAX SSD with NVMe added to each node as a YARN cache. To ensure true measurement of the storage I/O, the database size-to-memory ratio was targeted at about 2-to-1 (2TB of data with an aggregate cluster memory of 822GB available after operating system overhead). Cluster 1 Node Configuration Cluster 2 Node Configuration Software Operating System Red Hat Enterprise Server 7.5 (7.5.1804) Hortonworks HDP HDP 3.0.0 (beta) Hive 3.0.0 HDFS/YARN 3.1.0 Hardware Server CPU Memory Network Data Storage Supermicro SYS- 2028U-TNRT+ 2x Intel E5-2690v4 at 2.6 GHz 256GB (8x 32GB) Micron DDR4-2666 2x Broadcom BCM957304 dual-port NIC 15x 300GB 15K SAS HDDs YARN Cache Storage N/A 1x 3.2TB Micron 9200 MAX SSD with NVMe Table 1: Test Cluster Configurations The test performed 94 of the 99 queries used within the TPC-DS benchmark and measured the total completion time for each query on both clusters as well as the individual query completion times. Of the 99 benchmark queries queued, 94 completed with enough confidence to publish the results. Each query was processed three times and a mean result for each query was calculated for this brief. 2

The Results 1 Our testing showed that the introduction of a single NVMe SSD to the all-hdd cluster resulted in an average overall 1.7X improvement in benchmark completion times as shown in the figure below. HDDs + NVMe Cluster HDD-Only Cluster Figure 1: Completion Time Is Faster With NVMe SSD The primary contributor to this improvement is the near elimination of CPU wait time. Error! Reference source not found. and 3 show system CPU I/O wait time for each cluster. As you can see in Figure 3, the SSDenhanced cluster CPU wait time is nearly zero, meaning it was not a limiting factor to cluster performance. Figure 2: HDD-Only Cluster CPU Wait Time Figure 3: HDDs + NVMe Cluster CPU Wait Time 1. All test results quoted in this brief are based on benchmark tests as described and are for illustrative purposes only. Your actual experience may differ from the results discussed in this brief. 3

Individual queries for the HDD + NVMe cluster can experience a more than 2X performance improvement over the HDD-only cluster. The majority (55%) of queries resulted in improvement between 1.25X and 1.75X. The queries that showed more than 2X performance improvement are shown in Figure 4. SSD performance is affected by many factors such as query complexity, distribution of data across nodes and read percentage; therefore, actual improvement of real-world workloads may differ. NVMe Advantage in Complex DSS Queries 120% 100% 2.2X 2.3X 2.1X 2.1X 2.6X 2.0X 3X 80% 2X 60% 40% 1X 20% 0% Query 5 Query 12 Query 44 Query 50 Query 67 Query 75 0X 15K HDD NVMe Cached Figure 4: DSS Queries With Largest Improvements Using SSDs This improvement can directly affect real-world big data solutions. As shown in the figure below, it was determined that an existing 200-node big data cluster used for manufacturing efficiency analytics 2 would require an additional 80 HDD-based nodes to reach the same performance as adding an NVMe SSD to each existing 200 nodes. 3 Figure 5: Adding NVMe SSD Cache Is Equivalent to Adding 80 HDD-Based Nodes 2. See Dataworks Summit How to use flash drives with Apache Hadoop 3.x: Real world use cases and proof points presentation for details. 3. Based on real-world analysis of existing production Hadoop cluster using Apache Hive for data management. Your individual results may vary from those mentioned here. 4

What could this mean in real dollars? Using the advertised prices of the server configuration described earlier in this brief, 80 nodes would cost approximately $1.8 million. Adding a 3.2TB Micron 9200 MAX SSD to each sever in the existing 200-node cluster would cost only $372,000. 4 This is an 80% potential cost reduction. 200-Node Hadoop Cluster Upgrade Additional Cost for Equivalent Performance Add 80 Additional Servers $1,800,000 Add 200 NVMe Drives to Existing Servers $372,000 80% Lower Cost The Bottom Line Figure 6: Massive Cost Advantage With NVMe SSDs Adding SSDs can be a cost-effective way to scale existing Hadoop analytics deployments like Hortonworks HDP. Consider adding SSDs if you want faster time to insights than the current all-hdd Hadoop solution can provide. Choosing to add additional servers to your compute cluster can be cost-prohibitive. There are better ways to improve results while staying within budget. For more information about Micron s SSDs and how they can potentially improve the performance of your existing open-source solutions like Hadoop, visit micron.com. 4. Based on advertised server cost for a server configured as describe in table for Cluster 1, all-hdd cluster versus advertised cost of 3.2TB Micron 9200 MAX SSD as of the date of publishing of this brief. This technical brief is published by Micron and has not been authorized, sponsored, or otherwise approved by Hortonworks. Products are warranted only to meet Micron s production data sheet specifications. Products, programs and specifications are subject to change without notice. Dates are estimates only. 2018 Micron Technology, Inc. All rights reserved. Performance metrics may vary depending on systems, configurations and workloads. All information herein is provided on an "AS IS" basis without warranties of any kind. Micron and the Micron logo are trademarks of Micron Technology, Inc. All other trademarks are the property of their respective owners. Rev. A 10/18 CCM004-676576390-11187 5