Optimizing Apache Spark with Memory1. July Page 1 of 14

Similar documents
Apache Spark Graph Performance with Memory1. February Page 1 of 13

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar

New Approach to Unstructured Data

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Top 5 Reasons to Consider

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

Accelerate Big Data Insights

HCI: Hyper-Converged Infrastructure

Aerospike Scales with Google Cloud Platform

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

Taking Hyper-converged Infrastructure to a New Level of Performance, Efficiency and TCO

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

DriveScale-DellEMC Reference Architecture

Reduce Latency and Increase Application Performance Up to 44x with Adaptec maxcache 3.0 SSD Read and Write Caching Solutions

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Massive Scalability With InterSystems IRIS Data Platform

Dell EMC Hyper-Converged Infrastructure

Micron and Hortonworks Power Advanced Big Data Solutions

Why Converged Infrastructure?

EMC XTREMCACHE ACCELERATES ORACLE

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Virtualization of the MS Exchange Server Environment

Why Converged Infrastructure?

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

Dell EMC Hyper-Converged Infrastructure

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

Was ist dran an einer spezialisierten Data Warehousing platform?

SOLUTION BRIEF TOP 5 REASONS TO CHOOSE FLASHSTACK

DRAM and Storage-Class Memory (SCM) Overview

Dell EMC Isilon All-Flash

NGD Systems: Introduction to Computational Storage

For Healthcare Providers: How All-Flash Storage in EHR and VDI Can Lower Costs and Improve Quality of Care

For DBAs and LOB Managers: Using Flash Storage to Drive Performance and Efficiency in Oracle Databases

Solution Brief. A Key Value of the Future: Trillion Operations Technology. 89 Fifth Avenue, 7th Floor. New York, NY

De-dupe: It s not a question of if, rather where and when! What to Look for and What to Avoid

Key Considerations for Improving Performance And Virtualization in Microsoft SQL Server Environments

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Storage Solutions for VMware: InfiniBox. White Paper

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Top 4 considerations for choosing a converged infrastructure for private clouds

Virtuozzo Containers

Benefits of SD-WAN to the Distributed Enterprise

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

System Memory at a Fraction of the DRAM Cost

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

How Architecture Design Can Lower Hyperconverged Infrastructure (HCI) Total Cost of Ownership (TCO)

Planning For Persistent Memory In The Data Center. Sarah Jelinek/Intel Corporation

Business Benefits of Policy Based Data De-Duplication Data Footprint Reduction with Quality of Service (QoS) for Data Protection

Introduction to Big-Data

UNLEASH YOUR APPLICATIONS

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS

Dell EMC ScaleIO Ready Node

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Dell EMC All-Flash solutions are powered by Intel Xeon processors. Learn more at DellEMC.com/All-Flash

stec Host Cache Solution

Samsung s Green SSD (Solid State Drive) PM830. Boost data center performance while reducing power consumption. More speed. Less energy.

The Impact of Hyper- converged Infrastructure on the IT Landscape

The next step in Software-Defined Storage with Virtual SAN

FLASHARRAY//M Business and IT Transformation in 3U

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

TITLE. the IT Landscape

IBM Power Systems solution for SugarCRM

E-Guide BENEFITS AND DRAWBACKS OF SSD, CACHING, AND PCIE BASED SSD

The 7 Habits of Highly Effective API and Service Management

Kingston s Data Reduction Technology for longer SSD life and greater performance

China Big Data and HPC Initiatives Overview. Xuanhua Shi

TOP 5 REASONS TO CHOOSE FLASHSTACK FOR HEALTHCARE

Embedded Technosolutions

Copyright 2012 EMC Corporation. All rights reserved.

REFERENCE ARCHITECTURE Quantum StorNext and Cloudian HyperStore

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

6WINDGate. White Paper. Packet Processing Software for Wireless Infrastructure

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

IBM XIV Storage System

Toward a Memory-centric Architecture

3D NAND Technology Scaling helps accelerate AI growth

IBM Real-time Compression and ProtecTIER Deduplication

Mellanox Virtual Modular Switch

It s Time to Move Your Critical Data to SSDs Introduction

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

When, Where & Why to Use NoSQL?

Increasing Performance of Existing Oracle RAC up to 10X

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

Deploying Application and OS Virtualization Together: Citrix and Virtuozzo

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

Transcription:

Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14

Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache Spark, leverage application memory to provide the required performance. Unfortunately, due to their heavy reliance on memory, the potential of these platforms is subverted by the cost and capacity limitations of DRAM. This paper demonstrates that by leveraging Diablo Technologies Memory1 to maximize the available memory, Inspur servers can do more work (75% efficiency improvement); unleashing the full potential of real-time, big data processing. Introduction Bigger, Faster Data Today, data is being generated at unprecedented rates and from a growing variety of sources. As a result, the term big data has become ingrained into our standard lexicon. But just how big is Big Data? To put the speed and size of Big Data into perspective, it s worth noting that it has been estimated that over 90% of the world s data has been generated in the past two years. Current estimates put our daily data generation rate at 2.5 exabytes 1. That s one billion gigabytes of new data... every single day. Accordingly, modern businesses are faced with both a unique opportunity and a significant challenge. The opportunity lies in transforming Big Data into actionable knowledge. The adage knowledge is power has never been more true than in today s data -saturated society. The ability to effectively leverage data impacts our lives in numerous ways. It helps us with everything from stopping disease by analyzing medical statistics to identify key patterns and indicators, to finding the perfect restaurant/car/outfit/mate by turning data-driven insights into targeted, personalized recommendations. However, performing these data- to- knowledge transformations is no easy task. In many areas, our ability to analyze data has lagged behind the increasing size and speed of the data itself. The sheer density and velocity of the incoming information, coupled with the need for accurate, real -time analysis, can make it very difficult to process and manage. 1 http://www 01.ibm.com/software/data/bigdata/what is big data.html Page 2 of 14

Apache Spark TM and In-Memory Computing To handle increasing data rates and demanding user expectations, big data processing platforms like Apache Spark have emerged and quickly gained popularity. Spark provides a general- purpose clustered computing framework that can rapidly ingest and process realtime streams of data, enabling instantaneous analytics and decision -making. How Spark Works Creation, transformation, and manipulation of in- memory data is significantly faster than alternative, storage -centric approaches. Consequently, to provide its uniquely high performance, Apache Spark relies on in -memory data management. In Spark, the main data abstraction is the Resilient Distributed Dataset (RDD). RDDs are simply collections of data that can be partitioned across the nodes of a cluster and operated upon in parallel. These RDDbased operations are central to Spark functionality and performance. Keeping Spark RDDs inmemory (as opposed to on disk) keeps data closest to the CPUs, enabling the fastest access and the most optimized system-level performance. Spark s Memory Capacity Problem As one might expect, Spark operations are extremely dependent on memory capacity. To facilitate rapid data retrieval, objects in Spark need to be quickly created, cached, sorted, grouped, and/or joined, thus creating a need for massive amounts of application memory. Unfortunately, the memory available in a single server is insufficient to handle most Spark jobs. In large part, this is due to the cost and capacity constraints imposed by DRAM. Those constraints are amongst the key reasons that Spark deployments are often distributed across many servers. Deploying many servers enables system designers to create memory footprints that are much larger than a single server could provide. Though its distributed architecture enables larger pools of memory, there are still issues that subvert the full potential of Spark s disaggregated approach. Key issues include: Cost of DRAM The high cost of DRAM deters system designers from providing the larger, single -server memory footprints that would minimize cluster sizes and fully optimize performance. Page 3 of 14

Cost of additional nodes Creating large clusters requires additional expense due to the cost of the servers, associated networking, and increased operational expenses (e.g. due to added power consumption). Networking overhead Too many network hops can bottleneck Spark performance. Splitting Spark jobs across a large cluster requires data transfer and coordination between cluster nodes. When clusters grow too large, this overhead can negatively impact performance. In this paper, we will demonstrate how expanded application memory capacity, enabled by Memory1TM from Diablo TechnologiesTM, addresses key issues faced by Spark in traditional, DRAM-only deployments. Introducing Memory1 What Is Memory1? Diablo Technologies Memory1 is the first memory DIMM to expose NAND flash as standard application memory. This revolutionary solution provides the industry s most economical and highest -capacity byte -addressable memory modules. Memory1 provides up to 4X more memory capacity than other DIMMs, enabling dramatic increases in application memory per server. This enables significant performance advantages, due to increased data locality and reduced access times. Memory1 also minimizes Total Cost of Ownership (TCO) by reducing the number of servers required to support memory -constrained applications (e.g. Apache Spark). Memory1 DIMMs interface seamlessly with existing hardware and software. They are JEDEC compatible DDR4 DIMMs and are deployed into standard DDR4 DIMM slots. Processors, motherboards, operating systems, and applications do not need to change. Target applications simply leverage the expanded memory capacity as they see fit. Figure 1: 128GB Memory1 DIMM By leveraging flash s massive cost, power, and capacity advantages over DRAM, Memory1 DIMMs drastically change the economics of server memory, unleashing applications to leverage huge pools of local memory that were previously infeasible to provide. Page 4 of 14

Solving Spark with Memory1 Benchmarking SORT Performance Because most big data applications must perform a Sort on the entire dataset, performance is crucial. The speed at which data can be organized into easily searchable configurations is often Spark s primary performance bottleneck. The more rapidly SORT operations can be performed on large datasets, the more quickly data can be ingested, retrieved, manipulated, and analyzed. Therefore, to simulate the critical demands of a typical Spark workload, we utilized the industry-standard spark-perf benchmark to perform SORT operations on a typically -sized 500GB dataset. For comparison purposes, the SORT jobs were performed on both DRAM - only and Memory1 hardware configurations. Hardware Setup To efficiently sort datasets of a given magnitude, Spark requires significantly more memory than the dataset size. This is due to the additional memory needed to support RDD creation, object manipulation, and to support Spark management processes. Clustered Spark servers also require additional memory capacity to support the additional overhead created by intra-server coordination and synchronization activities. To facilitate the effective sorting of a 500GB total dataset, the DRAM -only configuration was sized to provide 1.5TB of application memory, using a cluster of three 2 -socket servers, each with 512GB of DRAM. This 3 -to- 1 ratio of dataset size -to- available memory enables a 500GB SORT job to complete in an acceptable timeframe (i.e. under 1 hour). Each cluster node represents a typical 2 -socket, 16 DIMM- slot server, fully populated with 32GB DRAM modules (16 DIMM slots * 32GB = 512GB). To demonstrate the improved efficiency and economics provided by Memory1, the Memory1 setup included only a single Inspur NF5180M4 2 -socket server, populated with 1TB of application memory. The Memory1 server was populated with eight 16GB DRAM DIMMs and eight 128GB Memory1 DIMMs, providing a total of 1TB of application memory. Note that, in the single- server Memory1 configuration, there is no additional overhead due to intra -server coordination/synchronization. Therefore, in this case, we expected a 2 -to- 1 ratio of dataset size -to- available memory to provide acceptable performance when sorting a 500GB dataset. The CPU and memory configuration for both setups is summarized in Table 1 below. Page 5 of 14

Testing With DRAM-Only DRAM-only: Test Setup Table 1: Server configuration details To simulate a typical customer configuration in today s DRAM -only deployments, our 3- server Spark cluster was deployed using the aforementioned setup and presented with a 500GB dataset generated by the spark-perf benchmark. DRAM-only: Results Using the DRAM- only cluster, sorting a 500GB dataset took 27.5 minutes, as shown in Figure 2. Figure 2: DRAM-Only SORT time for 500GB dataset Page 6 of 14

DRAM -only: Total Cost of Ownership (TCO) The total CAPEX for the three cluster DRAM-only setup (based on typical server and memory costs) was $47,400. This represents the cost of servers, processors, memory and other associated hardware. Of course, operational costs are also important to consider, so we calculated a simple OPEX based purely on the electrical costs associated with the deployment. For the DRAM -only cluster, the 3 -year OPEX totaled nearly $3,500 as shown in Table 2 below. Note that, in a realworld deployment, OPEX costs would be even higher when considering additional expenses associated with server management, physical space required, cooling costs, etc. Table 2: OPEX For DRAM -Only Configuration (1.5 TB Total Application Memory) Adding the $47,400 CAPEX and the $3,434 3 -year OPEX yields a 3- year TCO of $50,834. So, in summary, utilizing at 3 -server cluster based solely on DRAM can sort 500GB of Spark data in 27.5 minutes at a 3 -year cost of $50,834. To complete the analysis, we also calculated several efficiency metrics as shown below in Table 3 below. Page 7 of 14

Table 3: Efficiency Metrics For DRAM -Only Configuration (1.5TB Total Application Memory) Testing With Memory1 Memory1 Setup To test Memory1, we again sorted a 500GB dataset generated by the spark-perf benchmark. This time, however, the entire sort job was handled within the expanded memory of a single Memory1 server. Memory1 Results When using the Inspur NF5180M4 with Memory1, sorting a 500GB dataset took just 19.5 minutes. This represents more than a 29% reduction in SORT time versus the DRAM -only Figure 3: DRAM-Only and Memory1 SORT times Page 8 of 14

Memory1 TCO Total CAPEX for this setup (based on typical server and memory costs) was $16,496, including the server, processors, eight DRAM and eight Memory1 DIMMs. Again, operational costs are also critical, so we also calculated OPEX based on the electrical costs associated with the Memory1 deployment. For the Memory1 server, the 3 -year OPEX totaled just $1,144 as shown in Table 5 below. Table 5: OPEX For Memory1 Configuration (1TB Total Application Memory) Adding the $16,496 CAPEX and the $1,144 3 -year OPEX yields a 3 -year TCO of $17,640. So, in summary, a single 1 -terabyte Memory1 server can sort 500GB of Spark data in 19.5 minutes at a 3- year cost of $17,640. To complete the picture, we also calculated several efficiency metrics as shown in Table 6 below. Page 9 of 14

Table 6: Efficiency Metrics For Memory1 Configuration (1TB Total Application Memory) Compare DRAM-Only vs. Memory1 As shown in Table 7 below, a side -by- side comparison of performance, cost, and efficiency is very telling. When compared to the 3 -server DRAM- only cluster, the single Memory1 server was able to sort data faster and with significantly reduced TCO. The improvement in both cost and power efficiency is dramatic and demonstrates clear superiority over the DRAM -only configuration. Table 7: TCO comparison between DRAM -Only Configuration and Memory1 Configuration Page 10 of 14

As clearly evidenced by the test results, Memory1 -enabled servers provide compelling advantages in all facets of an Apache Spark deployment. Solution cost, power consumption, and SORT efficiency are all significantly improved by the Memory1 configuration. Spark Shuffle Architecture Memory1 Advantage Apache Spark allows for large data sets to be acted upon in memory, making it a faster alternative than Hadoop or other data sources alone. Spark follows a similar process to the MapReduce paradigm implemented in Hadoop, though allows for greater flexibility by providing architects the ability to persist, or cache, Resilient Distributed Datasets (RDD s) to memory ( persist memory ) or storage ( persist disk ). The results discussed thus far have been in persist memory mode. However, many Spark architects will necessarily persist to storage instead, potentially shifting the performance bottleneck to storage. Using a Shuffle-Sort model, the data is first Mapped, a procedure in which the job is divided among multiple nodes in the cluster. Each node then applies a key-value to each piece of data, writing to a separate file (or bucket) for each key, causing a shuffle write. Once the data has been mapped to these files, the data is Sorted by their corresponding key-values, combining data from similar buckets in the Reduce stage, causing a shuffle read. In persist disk mode, shuffle data is written to and read from storage, causing huge slowdowns in the processing of data. Memory1 allows for the shuffle data to be accelerated as well, even when persisting to disk. Because Memory1 is application memory, a RAMDisk can be created and used for the shuffle data in Spark, removing the storage bottleneck and making the full performance of Memory1 available for Shuffle Data. To illustrate the use of Memory1 in the persist disk mode, we performed the Sort test on the Inspur NF5180M4 in persist disk mode using Memory1 as a RAMDisk. First, we performed a sort on a 200 GB data set in order to establish a baseline. We then reran the test using 500 GB, 1 TB and 1.5 TB datasets. Figure 3 shows the results from each of these tests. Page 11 of 14

Figure 4: Results of Memory1 in persist disk mode As can be seen in figure 4, the results show very linear performance in completion times as the dataset increases in size. By writing shuffle data to a Memory1 RAMDisk, Spark takes advantage of the latency and bandwidth benefits of the processors memory controllers. Because Memory1 is connected directly to the processors of the server, latency is significantly reduced and bandwidth is considerably higher. Apache Spark architects now have an option available to increase the performance of their operations by mapping Memory1 using a RAMDisk. Page 12 of 14

What We ve Shown Diablo Technologies Memory1 economically enables a dramatic expansion of the application memory available in each server. By enabling a 65% decrease in TCO and a 75% increase in SORT efficiency-per -dollar, Memory1 eliminates the hardware cost concerns traditionally faced in multi -server Spark deployments. In addition, having more memory -per- server enables each Spark server to perform more work, which mitigates the impact of networking overhead by reducing the number of servers required. In summary, Memory1 allows Spark system designers to: Avoid the high cost of DRAM -only implementations Reduce the number of Spark servers required for a given job, thus minimizing the cost of additional servers Minimize the number of network hops, thereby minimizing the bandwidth and latency impact of networking overhead These benefits are only possible with the improved capacity and economics provided by Memory1. By massively increasing application memory per server, Memory1 improves Spark s Return on Investment (ROI) by both maximizing performance and minimizing Total Cost of Ownership (TCO). Page 13 of 14

About Inspur Inspur Systems Inc., located in Fremont, CA, is part of Inspur Group, a leading Cloud Computing and global IT Solutions Provider. Inspur was founded in 1945 and has since provided IT products and services for over 85 countries in the world. Inspur is ranked by Gartner as one of the Top5 largest server manufacturers in the world and #1 in China. Inspur provides our global customers with data center servers and storage solutions which are Tier1 quality and performance, energy efficient, cost effective and built specific to actual workloads and data center environments. As a leading total solutions and services provider, Inspur is capable of providing total solutions at IaaS, PaaS and SaaS level with high-end servers, mass storage systems, cloud operating system and information security technology. For more information, visit www.inspursystems.com About Diablo Technologies Diablo Technologies is a leading developer of high-performance memory products that solve urgent business problems by wringing more performance out of fewer servers. Diablo s Memory1 combines the highest capacity memory modules with their leading Software Defined Memory platform. Memory1 enables a dramatic reduction in datacenter expenses with significant increases in server and application capability. Diablo s products and technology are included in solutions from leading server vendors such as Inspur. Diablo is best known for its innovative Memory Channel Storage (MCS ) architecture. Memory Channel Storage dramatically decreased storage access times by more than 80% by attaching flash storage directly to the CPU s memory controller 2016. All Rights Reserved. The dt logo, Diablo Technologies, and Memory1 are trademarks or registered trademarks of Diablo Technologies, Incorporated. All other trademarks are property of their respective owners. The Inspur Logo, is a trademark of Inspur Group. All other trademarks are property of their respective owners. Page 14 of 14