Accelerate Big Data Insights

Size: px

Start display at page:

Download "Accelerate Big Data Insights"

Mary Walsh
6 years ago
Views:

1 Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not only provide businesses with immediate competitive advantage, it can also allow mission-critical systems to better protect life, property and country. Cancun Systems provides an in-memory SDML (software-defined memory lake) platform that delivers massive acceleration, cost efficiency and deployment flexibility to big data workloads. By employing Cancun MemoryLake technology, organizations can get insights considerably faster to improve decision making, minimize risk, and increase profits. Challenges of Large Datasets ACCELERATE TIME-TO-INSIGHT FROM HOURS TO MINUTES Cancun Systems MemoryLake TM demonstrated significantly faster time-to-insight while also eliminating infrastructure inefficiencies. David Vennergrund Director, Data Science, CSRA Whether a MapReduce, Hive, or Spark cluster, most big data sets are much larger than the physical memory capacity of the cluster, causing a bottleneck in memory and storage I/O. This leads to poor application performance, inefficient architectures, and expensive scaling requirements. Cancun's MemoryLake TM SDML platform makes intelligent usage of available resources across memory and storage and allows analytics workloads to access data at the speed of memory but at the cost efficiency of disk. This allows organizations to query more data faster and more efficiently. Cancun MemoryLake TM : An In-Memory Software Platform for Accelerated Insights Cancun s MemoryLake software platform delivers an SDML that enables applications to run up to 10X faster, allowing customers to accelerate time to insights and enjoy tremendous infrastrucutre efficiencies. Cancun s MemoryLake TM provides immediate benefits in three areas: Faster time to Insights: By pooling and virtualizing available memory and storage resources within or across nodes, Cancun can create a software-defined memory lake. In-memory applications like Spark can now run significantly faster by accelerating and pipelining applications at memory speed, enabling workflows to complete in a fraction of the time. Infrastructure Efficiency and Savings: Whether deployed on-premises or in the cloud, Cancun s MemoryLake TM software delivers unprecedented infrastructure efficiency. Existing build-outs can run more jobs and query more data without having to purchase additional infrastructure. New build-outs only require a fraction of the expected infrastructure. For cloud deployments, customers can experience both faster insights and immediate savings because they are able to complete jobs and decommission clusters much faster Rev#071717

Deployment Simplicity and Flexibility: Cancun enables businesses to deploy MemoryLake TM software in private, public, or hybrid cloud environments, and inge

And deployment is frictionless, requiring no changes to application code or underlying infrastructures.

2 Deployment Simplicity and Flexibility: Cancun enables businesses to deploy MemoryLake TM software in private, public, or hybrid cloud environments, and ingest data directly from various sources (e.g. HDFS, NAS, cloud object stores) for richer insights.. Installation is simple and takes only minutes. And deployment is frictionless, requiring no changes to application code or underlying infrastructures. Figure 1 Cancun MemoryLake TM technology delivers massive speed, agility, and cost efficiency to existing Big Data frameworks Virtualizing Multiple Tiers of Memory Cancun abstracts physical memory and storage resources resident in a node to give the impression of a very large memory pool available for memory-speed data access. It can also pool memory and storage from a remote node which makes deployment very easy. For example, in existing deployments, customers can add a new memory/ssd-dense node and dramatically improve the performance of the entire cluster at once. Figure 2 Cancun MemoryLake leverages memory and different classes of storage (1) and uses simple policies (2) to manage data so applications access data at memory speed (3)

The Cancun MemoryLake TM platform automatically caches or evicts data using simple policies so that applications see orders-of-magnitude larger memory footprint.

Off Heap Memory Management for Big Data When large amounts of data are involved, issues with Java memory management can arise resulting in a significant hit to performance.

In addition, if the JVM crashes, all data in memory is lost and must be rebuilt from disk a slow and cumbersome process.

3 The Cancun MemoryLake TM platform automatically caches or evicts data using simple policies so that applications see orders-of-magnitude larger memory footprint. Cancun supports RAM, NVMe, SSD, HDD, and has built-in support for upcoming 3D Xpoint for even faster acceleration. Off Heap Memory Management for Big Data When large amounts of data are involved, issues with Java memory management can arise resulting in a significant hit to performance. Java s inability to handle large data sizes in JVM results in frequent, expensive garbage collection during which there is a significant drop in performance. In addition, if the JVM crashes, all data in memory is lost and must be rebuilt from disk a slow and cumbersome process. Figure 3 Garbage collection (1) slows applications; if JVM crashes (2), it must be rebuilt from disk Cancun MemoryLake TM technology significantly speeds up jobs by avoiding garbage collection. Data blocks are moved off heap to remove the load on garbage collection and a persistent distributed cache ensures that data can quickly be fetched. In addition, if the JVM crashes, data is quickly retrieved from off-heap memory without having to read from slow HDFS, because with Cancun MemoryLake TM the data remains in memory, making crash recovery much faster. Figure 4 Cancun avoids garbage collection (1) and data is quickly retrieved (2) if JVM crashes

4 Cancun MemoryLake TM Accomplishes Data Transfer via Fast, In-Memory File System Big data workloads are typically built as pipelines. The output of one stage is fed into the next stage and this output is written to disk, which becomes a chokepoint. Using Cancun MemoryLake TM, data transfer across stages is done via in-memory file system (see notation 1 in Figure 5), which is an order of magnitude faster than disk-based file systems. For disk access within a stage (see notation 2 in Figure 5), Cancun allows numerous intermediate writes to disk be done via in-memory file system so that pipelined jobs are completed dramatically faster. Figure 5 Disk access is done via in-memory file system for blazing performance Other Performance Enhancements Compression: Cancun applies advanced compression techniques on the data that it manages. These compression techniques reduce storage footprint and optimizes network traffic due to the smaller size of packets being transferred between the nodes. Dedicated mount point for shuffle traffic: Big Data jobs spend considerable time on shuffle stage. Cancun allows a high speed, dedicated mount point, for shuffle traffic which accelerates the shuffle stage of the job.

5 Proof Points Shown below are the results of Spark and Hadoop tests that were run with and without Cancun Software-Defined Memory Lake (SDML) technology. 1. Real Customer ETL Scenario 1.1 Performance The customer scenario was to JOIN 1B+ rows, spread in two data files, and then use the join -ed file for downstream analysis. The test was run in two scenarios one without Cancun technology and a second with Cancun technology. The performance on an infrastructure consisting of 8 nodes without Cancun was 14 minutes, while the run time with Cancun was 1.3 minutes more than 10x improvement in performance.

6 1.2 Infrastructure Efficiency Efficiency testing on this scenario also demonstrated significant improvements. The run time on an infrastructure consisting of 8 nodes without Cancun was 14 minutes, while the run time on an infrastructure consisting of only 4 nodes with Cancun was 2.3 minutes. In essence, the Cancun environment dramatically cut run time while using only a small portion of the existing infrastructure footprint. 2. HiBench Spark TeraSort Benchmark TeraSort is a popular benchmark that measures the amount of time to sort large, randomly distributed data on a given computer system. The following tests show the benchmark performance results on different environments. Characteristics of the test bed are: Software:. Apache Hadoop Nodes:. 1 master + 4 worker nodes Dataset Size:. 320GB EC2 Instances:. M4.4xlarge (64GB RAM, 16 vcpus per instance) Number of executors per node: 8

7 2.1 Amazon EMR On an EMR cluster, the Teragen without Cancun run completed in 7 minutes while the with Cancun run finished in 2m and 46s - more than 2.5x improvement in performance. The Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 320GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was 20 minutes, while the run time with Cancun was 9 minutes more than 2x improvement in performance.

8 2.2 Amazon EC2 + HDFS on EBS On an EC2 cluster with HDFS storage, the Teragen without Cancun run completed in 52 minutes while the with Cancun run finished in 3m and 16s - more than 15x improvement in performance. The HiBench Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 320GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was slightly over 2 hours, while the performance with Cancun was 9m 35s more than 12x improvement in performance.

9 2.3 Amazon EC2 + S3 On an EC2 cluster with S3 storage, the Teragen without Cancun run completed in 26 minutes while the with Cancun run finished in 2m and 34s - more than 10x improvement in performance. The HiBench Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 32GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was 33 minutes, while the run time with Cancun was 9 minutes more than 3.6x improvement in performance. 3. TestDFSIO Benchmark TestDFSIO is a standard benchmark that is run on a cluster to test I/O performance to and from HDFS. Following tests show the benchmark performance results on different environments. Characteristics of the test bed are: Software:. Apache Hadoop Nodes:. 1 master + 4 worker nodes Dataset Size:. 800GB EC2 Instances:. M4.4xlarge (64GB RAM, 16 vcpus per instance) Number of executors per node: 2

10 3.1 Amazon EMR On an EMR cluster, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 15 minutes, while the run time with Cancun was 1m 15s - more than 12x improvement in performance. For reads, the run time was recorded at 28 minutes and 1m 34s, respectively, demonstrating a more than 18x improvement when using the Cancun MemoryLake TM platform.

11 3.2 Amazon EC2 + HDFS on EBS On an EC2 cluster with HDFS storage, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 2h 7m, while the run time with Cancun was 1m 16s - more than 98x improvement in performance. For reads, the run time was recorded at 14h 33m and 1m 54s, respectively, demonstrating a more than 452x improvement when using the Cancun MemoryLake TM platform.

12 3.3 Amazon EC2 + S3 On an EC2 cluster with HDFS storage, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 32m 10s, while the run time with Cancun was 1m 17s, - more than 25x improvement in performance. For reads, the run time was recorded at 12m 40s and 1m 54s, respectively, demonstrating a more than 6.6x improvement when using the Cancun MemoryLake TM platform. Conclusion Cancun Systems MemoryLake is a transparent software-defined memory lake layer that delivers memory-speed access to data. Optimized for big data workflows, Cancun MemoryLake TM solves the inefficiency of memory management and data IO in today s big data infrastructures. The innovative Cancun platform allows big data jobs to take advantage of fast, in-memory resources to deliver stunning performance that enables organizations to improve decision making, minimize risk, and increase profits. For more information or to request a demo, visit us at

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,