Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14

Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache Spark, leverage application memory to provide the required performance. Unfortunately, due to their heavy reliance on memory, the potential of these platforms is subverted by the cost and capacity limitations of DRAM. This paper demonstrates that by leveraging Diablo Technologies Memory1 to maximize the available memory, Inspur servers can do more work (75% efficiency improvement); unleashing the full potential of real-time, big data processing. Introduction Bigger, Faster Data Today, data is being generated at unprecedented rates and from a growing variety of sources. As a result, the term big data has become ingrained into our standard lexicon. But just how big is Big Data? To put the speed and size of Big Data into perspective, it s worth noting that it has been estimated that over 90% of the world s data has been generated in the past two years. Current estimates put our daily data generation rate at 2.5 exabytes 1. That s one billion gigabytes of new data... every single day. Accordingly, modern businesses are faced with both a unique opportunity and a significant challenge. The opportunity lies in transforming Big Data into actionable knowledge. The adage knowledge is power has never been more true than in today s data -saturated society. The ability to effectively leverage data impacts our lives in numerous ways. It helps us with everything from stopping disease by analyzing medical statistics to identify key patterns and indicators, to finding the perfect restaurant/car/outfit/mate by turning data-driven insights into targeted, personalized recommendations. However, performing these data- to- knowledge transformations is no easy task. In many areas, our ability to analyze data has lagged behind the increasing size and speed of the data itself. The sheer density and velocity of the incoming information, coupled with the need for accurate, real -time analysis, can make it very difficult to process and manage. 1 http://www 01.ibm.com/software/data/bigdata/what is big data.html Page 2 of 14

Apache Spark TM and In-Memory Computing To handle increasing data rates and demanding user expectations, big data processing platforms like Apache Spark have emerged and quickly gained popularity. Spark provides a general- purpose clustered computing framework that can rapidly ingest and process realtime streams of data, enabling instantaneous analytics and decision -making. How Spark Works Creation, transformation, and manipulation of in- memory data is significantly faster than alternative, storage -centric approaches. Consequently, to provide its uniquely high performance, Apache Spark relies on in -memory data management. In Spark, the main data abstraction is the Resilient Distributed Dataset (RDD). RDDs are simply collections of data that can be partitioned across the nodes of a cluster and operated upon in parallel. These RDDbased operations are central to Spark functionality and performance. Keeping Spark RDDs inmemory (as opposed to on disk) keeps data closest to the CPUs, enabling the fastest access and the most optimized system-level performance. Spark s Memory Capacity Problem As one might expect, Spark operations are extremely dependent on memory capacity. To facilitate rapid data retrieval, objects in Spark need to be quickly created, cached, sorted, grouped, and/or joined, thus creating a need for massive amounts of application memory. Unfortunately, the memory available in a single server is insufficient to handle most Spark jobs. In large part, this is due to the cost and capacity constraints imposed by DRAM. Those constraints are amongst the key reasons that Spark deployments are often distributed across many servers. Deploying many servers enables system designers to create memory footprints that are much larger than a single server could provide. Though its distributed architecture enables larger pools of memory, there are still issues that subvert the full potential of Spark s disaggregated approach. Key issues include: Cost of DRAM The high cost of DRAM deters system designers from providing the larger, single -server memory footprints that would minimize cluster sizes and fully optimize performance. Page 3 of 14

Cost of additional nodes Creating large clusters requires additional expense due to the cost of the servers, associated networking, and increased operational expenses (e.g. due to added power consumption). Networking overhead Too many network hops can bottleneck Spark performance. Splitting Spark jobs across a large cluster requires data transfer and coordination between cluster nodes. When clusters grow too large, this overhead can negatively impact performance. In this paper, we will demonstrate how expanded application memory capacity, enabled by Memory1TM from Diablo TechnologiesTM, addresses key issues faced by Spark in traditional, DRAM-only deployments. Introducing Memory1 What Is Memory1? Diablo Technologies Memory1 is the first memory DIMM to expose NAND flash as standard application memory. This revolutionary solution provides the industry s most economical and highest -capacity byte -addressable memory modules. Memory1 provides up to 4X more memory capacity than other DIMMs, enabling dramatic increases in application memory per server. This enables significant performance advantages, due to increased data locality and reduced access times. Memory1 also minimizes Total Cost of Ownership (TCO) by reducing the number of servers required to support memory -constrained applications (e.g. Apache Spark). Memory1 DIMMs interface seamlessly with existing hardware and software. They are JEDEC compatible DDR4 DIMMs and are deployed into standard DDR4 DIMM slots. Processors, motherboards, operating systems, and applications do not need to change. Target applications simply leverage the expanded memory capacity as they see fit. Figure 1: 128GB Memory1 DIMM By leveraging flash s massive cost, power, and capacity advantages over DRAM, Memory1 DIMMs drastically change the economics of server memory, unleashing applications to leverage huge pools of local memory that were previously infeasible to provide. Page 4 of 14

Solving Spark with Memory1 Benchmarking SORT Performance Because most big data applications must perform a Sort on the entire dataset, performance is crucial. The speed at which data can be organized into easily searchable configurations is often Spark s primary performance bottleneck. The more rapidly SORT operations can be performed on large datasets, the more quickly data can be ingested, retrieved, manipulated, and analyzed. Therefore, to simulate the critical demands of a typical Spark workload, we utilized the industry-standard spark-perf benchmark to perform SORT operations on a typically -sized 500GB dataset. For comparison purposes, the SORT jobs were performed on both DRAM - only and Memory1 hardware configurations. Hardware Setup To efficiently sort datasets of a given magnitude, Spark requires significantly more memory than the dataset size. This is due to the additional memory needed to support RDD creation, object manipulation, and to support Spark management processes. Clustered Spark servers also require additional memory capacity to support the additional overhead created by intra-server coordination and synchronization activities. To facilitate the effective sorting of a 500GB total dataset, the DRAM -only configuration was sized to provide 1.5TB of application memory, using a cluster of three 2 -socket servers, each with 512GB of DRAM. This 3 -to- 1 ratio of dataset size -to- available memory enables a 500GB SORT job to complete in an acceptable timeframe (i.e. under 1 hour). Each cluster node represents a typical 2 -socket, 16 DIMM- slot server, fully populated with 32GB DRAM modules (16 DIMM slots * 32GB = 512GB). To demonstrate the improved efficiency and economics provided by Memory1, the Memory1 setup included only a single Inspur NF5180M4 2 -socket server, populated with 1TB of application memory. The Memory1 server was populated with eight 16GB DRAM DIMMs and eight 128GB Memory1 DIMMs, providing a total of 1TB of application memory. Note that, in the single- server Memory1 configuration, there is no additional overhead due to intra -server coordination/synchronization. Therefore, in this case, we expected a 2 -to- 1 ratio of dataset size -to- available memory to provide acceptable performance when sorting a 500GB dataset. The CPU and memory configuration for both setups is summarized in Table 1 below. Page 5 of 14

Testing With DRAM-Only DRAM-only: Test Setup Table 1: Server configuration details To simulate a typical customer configuration in today s DRAM -only deployments, our 3- server Spark cluster was deployed using the aforementioned setup and presented with a 500GB dataset generated by the spark-perf benchmark. DRAM-only: Results Using the DRAM- only cluster, sorting a 500GB dataset took 27.5 minutes, as shown in Figure 2. Figure 2: DRAM-Only SORT time for 500GB dataset Page 6 of 14

DRAM -only: Total Cost of Ownership (TCO) The total CAPEX for the three cluster DRAM-only setup (based on typical server and memory costs) was $47,400. This represents the cost of servers, processors, memory and other associated hardware. Of course, operational costs are also important to consider, so we calculated a simple OPEX based purely on the electrical costs associated with the deployment. For the DRAM -only cluster, the 3 -year OPEX totaled nearly $3,500 as shown in Table 2 below. Note that, in a realworld deployment, OPEX costs would be even higher when considering additional expenses associated with server management, physical space required, cooling costs, etc. Table 2: OPEX For DRAM -Only Configuration (1.5 TB Total Application Memory) Adding the $47,400 CAPEX and the $3,434 3 -year OPEX yields a 3- year TCO of $50,834. So, in summary, utilizing at 3 -server cluster based solely on DRAM can sort 500GB of Spark data in 27.5 minutes at a 3 -year cost of $50,834. To complete the analysis, we also calculated several efficiency metrics as shown below in Table 3 below. Page 7 of 14

Table 3: Efficiency Metrics For DRAM -Only Configuration (1.5TB Total Application Memory) Testing With Memory1 Memory1 Setup To test Memory1, we again sorted a 500GB dataset generated by the spark-perf benchmark. This time, however, the entire sort job was handled within the expanded memory of a single Memory1 server. Memory1 Results When using the Inspur NF5180M4 with Memory1, sorting a 500GB dataset took just 19.5 minutes. This represents more than a 29% reduction in SORT time versus the DRAM -only Figure 3: DRAM-Only and Memory1 SORT times Page 8 of 14

Memory1 TCO Total CAPEX for this setup (based on typical server and memory costs) was $16,496, including the server, processors, eight DRAM and eight Memory1 DIMMs. Again, operational costs are also critical, so we also calculated OPEX based on the electrical costs associated with the Memory1 deployment. For the Memory1 server, the 3 -year OPEX totaled just $1,144 as shown in Table 5 below. Table 5: OPEX For Memory1 Configuration (1TB Total Application Memory) Adding the $16,496 CAPEX and the $1,144 3 -year OPEX yields a 3 -year TCO of $17,640. So, in summary, a single 1 -terabyte Memory1 server can sort 500GB of Spark data in 19.5 minutes at a 3- year cost of $17,640. To complete the picture, we also calculated several efficiency metrics as shown in Table 6 below. Page 9 of 14

Table 6: Efficiency Metrics For Memory1 Configuration (1TB Total Application Memory) Compare DRAM-Only vs. Memory1 As shown in Table 7 below, a side -by- side comparison of performance, cost, and efficiency is very telling. When compared to the 3 -server DRAM- only cluster, the single Memory1 server was able to sort data faster and with significantly reduced TCO. The improvement in both cost and power efficiency is dramatic and demonstrates clear superiority over the DRAM -only configuration. Table 7: TCO comparison between DRAM -Only Configuration and Memory1 Configuration Page 10 of 14

As clearly evidenced by the test results, Memory1 -enabled servers provide compelling advantages in all facets of an Apache Spark deployment. Solution cost, power consumption, and SORT efficiency are all significantly improved by the Memory1 configuration. Spark Shuffle Architecture Memory1 Advantage Apache Spark allows for large data sets to be acted upon in memory, making it a faster alternative than Hadoop or other data sources alone. Spark follows a similar process to the MapReduce paradigm implemented in Hadoop, though allows for greater flexibility by providing architects the ability to persist, or cache, Resilient Distributed Datasets (RDD s) to memory ( persist memory ) or storage ( persist disk ). The results discussed thus far have been in persist memory mode. However, many Spark architects will necessarily persist to storage instead, potentially shifting the performance bottleneck to storage. Using a Shuffle-Sort model, the data is first Mapped, a procedure in which the job is divided among multiple nodes in the cluster. Each node then applies a key-value to each piece of data, writing to a separate file (or bucket) for each key, causing a shuffle write. Once the data has been mapped to these files, the data is Sorted by their corresponding key-values, combining data from similar buckets in the Reduce stage, causing a shuffle read. In persist disk mode, shuffle data is written to and read from storage, causing huge slowdowns in the processing of data. Memory1 allows for the shuffle data to be accelerated as well, even when persisting to disk. Because Memory1 is application memory, a RAMDisk can be created and used for the shuffle data in Spark, removing the storage bottleneck and making the full performance of Memory1 available for Shuffle Data. To illustrate the use of Memory1 in the persist disk mode, we performed the Sort test on the Inspur NF5180M4 in persist disk mode using Memory1 as a RAMDisk. First, we performed a sort on a 200 GB data set in order to establish a baseline. We then reran the test using 500 GB, 1 TB and 1.5 TB datasets. Figure 3 shows the results from each of these tests. Page 11 of 14

Figure 4: Results of Memory1 in persist disk mode As can be seen in figure 4, the results show very linear performance in completion times as the dataset increases in size. By writing shuffle data to a Memory1 RAMDisk, Spark takes advantage of the latency and bandwidth benefits of the processors memory controllers. Because Memory1 is connected directly to the processors of the server, latency is significantly reduced and bandwidth is considerably higher. Apache Spark architects now have an option available to increase the performance of their operations by mapping Memory1 using a RAMDisk. Page 12 of 14

What We ve Shown Diablo Technologies Memory1 economically enables a dramatic expansion of the application memory available in each server. By enabling a 65% decrease in TCO and a 75% increase in SORT efficiency-per -dollar, Memory1 eliminates the hardware cost concerns traditionally faced in multi -server Spark deployments. In addition, having more memory -per- server enables each Spark server to perform more work, which mitigates the impact of networking overhead by reducing the number of servers required. In summary, Memory1 allows Spark system designers to: Avoid the high cost of DRAM -only implementations Reduce the number of Spark servers required for a given job, thus minimizing the cost of additional servers Minimize the number of network hops, thereby minimizing the bandwidth and latency impact of networking overhead These benefits are only possible with the improved capacity and economics provided by Memory1. By massively increasing application memory per server, Memory1 improves Spark s Return on Investment (ROI) by both maximizing performance and minimizing Total Cost of Ownership (TCO). Page 13 of 14

About Inspur Inspur Systems Inc., located in Fremont, CA, is part of Inspur Group, a leading Cloud Computing and global IT Solutions Provider. Inspur was founded in 1945 and has since provided IT products and services for over 85 countries in the world. Inspur is ranked by Gartner as one of the Top5 largest server manufacturers in the world and #1 in China. Inspur provides our global customers with data center servers and storage solutions which are Tier1 quality and performance, energy efficient, cost effective and built specific to actual workloads and data center environments. As a leading total solutions and services provider, Inspur is capable of providing total solutions at IaaS, PaaS and SaaS level with high-end servers, mass storage systems, cloud operating system and information security technology. For more information, visit www.inspursystems.com About Diablo Technologies Diablo Technologies is a leading developer of high-performance memory products that solve urgent business problems by wringing more performance out of fewer servers. Diablo s Memory1 combines the highest capacity memory modules with their leading Software Defined Memory platform. Memory1 enables a dramatic reduction in datacenter expenses with significant increases in server and application capability. Diablo s products and technology are included in solutions from leading server vendors such as Inspur. Diablo is best known for its innovative Memory Channel Storage (MCS ) architecture. Memory Channel Storage dramatically decreased storage access times by more than 80% by attaching flash storage directly to the CPU s memory controller 2016. All Rights Reserved. The dt logo, Diablo Technologies, and Memory1 are trademarks or registered trademarks of Diablo Technologies, Incorporated. All other trademarks are property of their respective owners. The Inspur Logo, is a trademark of Inspur Group. All other trademarks are property of their respective owners. Page 14 of 14