Accelerate Big Data Insights

Similar documents
Part 1: Indexes for Big Data

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Cloud Computing & Visualization

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

EsgynDB Enterprise 2.0 Platform Reference Architecture

Optimizing Apache Spark with Memory1. July Page 1 of 14

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

microsoft

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Technical Sheet NITRODB Time-Series Database

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

The Evolution of Big Data Platforms and Data Science

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Massive Scalability With InterSystems IRIS Data Platform

Introduction to Big-Data

Evolving To The Big Data Warehouse

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Oracle Big Data Connectors

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

QLIK INTEGRATION WITH AMAZON REDSHIFT

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

Stages of Data Processing

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

All-Flash Storage Solution for SAP HANA:

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Lecture 11 Hadoop & Spark

Was ist dran an einer spezialisierten Data Warehousing platform?

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Fast Big Data Analytics with Spark on Tachyon

SOLUTION BRIEF Fulfill the promise of the cloud

DATA SCIENCE USING SPARK: AN INTRODUCTION

Micron and Hortonworks Power Advanced Big Data Solutions

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Best Practices - PDI Performance Tuning

Modern Data Warehouse The New Approach to Azure BI

The Data Explosion. A Guide to Oracle s Data-Management Cloud Services

Approaching the Petabyte Analytic Database: What I learned

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

HCI: Hyper-Converged Infrastructure

Databricks, an Introduction

SoftNAS Cloud Data Management Products for AWS Add Breakthrough NAS Performance, Protection, Flexibility

Falling Out of the Clouds: When Your Big Data Needs a New Home

Vision of the Software Defined Data Center (SDDC)

Spark Overview. Professor Sasu Tarkoma.

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Big Data with Hadoop Ecosystem

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

New Features and Enhancements in Big Data Management 10.2

SoftNAS Cloud Performance Evaluation on AWS

Tuning Intelligent Data Lake Performance

THE RISE OF. The Disruptive Data Warehouse

Shark: Hive (SQL) on Spark

HDInsight > Hadoop. October 12, 2017

MixApart: Decoupled Analytics for Shared Storage Systems

powered by Cloudian and Veritas

CSE 444: Database Internals. Lecture 23 Spark

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Performance Testing December 16, 2017

Veeam with Cohesity Data Platform

Processing of big data with Apache Spark

Storage for HPC, HPDA and Machine Learning (ML)

Big data systems 12/8/17

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA WHITE PAPER NOVEMBER 2017

Dell EMC All-Flash solutions are powered by Intel Xeon processors. Learn more at DellEMC.com/All-Flash

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Accelerating Digital Transformation with InterSystems IRIS and vsan

Doubling Performance in Amazon Web Services Cloud Using InfoScale Enterprise

NEXT GENERATION EMC: LEAD YOUR STORAGE TRANSFORMATION

Apache Spark Graph Performance with Memory1. February Page 1 of 13

@Pentaho #BigDataWebSeries

Chapter 4: Apache Spark

MapR Enterprise Hadoop

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Hadoop Stack

Lambda Architecture for Batch and Stream Processing. October 2018

Introduction to Database Services

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Composable Infrastructure for Public Cloud Service Providers

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Exam Questions

VMworld 2015 Track Names and Descriptions

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Improved Solutions for I/O Provisioning and Application Acceleration

Transcription:

Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not only provide businesses with immediate competitive advantage, it can also allow mission-critical systems to better protect life, property and country. Cancun Systems provides an in-memory SDML (software-defined memory lake) platform that delivers massive acceleration, cost efficiency and deployment flexibility to big data workloads. By employing Cancun MemoryLake technology, organizations can get insights considerably faster to improve decision making, minimize risk, and increase profits. Challenges of Large Datasets ACCELERATE TIME-TO-INSIGHT FROM HOURS TO MINUTES Cancun Systems MemoryLake TM demonstrated significantly faster time-to-insight while also eliminating infrastructure inefficiencies. David Vennergrund Director, Data Science, CSRA Whether a MapReduce, Hive, or Spark cluster, most big data sets are much larger than the physical memory capacity of the cluster, causing a bottleneck in memory and storage I/O. This leads to poor application performance, inefficient architectures, and expensive scaling requirements. Cancun's MemoryLake TM SDML platform makes intelligent usage of available resources across memory and storage and allows analytics workloads to access data at the speed of memory but at the cost efficiency of disk. This allows organizations to query more data faster and more efficiently. Cancun MemoryLake TM : An In-Memory Software Platform for Accelerated Insights Cancun s MemoryLake software platform delivers an SDML that enables applications to run up to 10X faster, allowing customers to accelerate time to insights and enjoy tremendous infrastrucutre efficiencies. Cancun s MemoryLake TM provides immediate benefits in three areas: Faster time to Insights: By pooling and virtualizing available memory and storage resources within or across nodes, Cancun can create a software-defined memory lake. In-memory applications like Spark can now run significantly faster by accelerating and pipelining applications at memory speed, enabling workflows to complete in a fraction of the time. Infrastructure Efficiency and Savings: Whether deployed on-premises or in the cloud, Cancun s MemoryLake TM software delivers unprecedented infrastructure efficiency. Existing build-outs can run more jobs and query more data without having to purchase additional infrastructure. New build-outs only require a fraction of the expected infrastructure. For cloud deployments, customers can experience both faster insights and immediate savings because they are able to complete jobs and decommission clusters much faster Rev#071717

Deployment Simplicity and Flexibility: Cancun enables businesses to deploy MemoryLake TM software in private, public, or hybrid cloud environments, and ingest data directly from various sources (e.g. HDFS, NAS, cloud object stores) for richer insights.. Installation is simple and takes only minutes. And deployment is frictionless, requiring no changes to application code or underlying infrastructures. Figure 1 Cancun MemoryLake TM technology delivers massive speed, agility, and cost efficiency to existing Big Data frameworks Virtualizing Multiple Tiers of Memory Cancun abstracts physical memory and storage resources resident in a node to give the impression of a very large memory pool available for memory-speed data access. It can also pool memory and storage from a remote node which makes deployment very easy. For example, in existing deployments, customers can add a new memory/ssd-dense node and dramatically improve the performance of the entire cluster at once. Figure 2 Cancun MemoryLake leverages memory and different classes of storage (1) and uses simple policies (2) to manage data so applications access data at memory speed (3)

The Cancun MemoryLake TM platform automatically caches or evicts data using simple policies so that applications see orders-of-magnitude larger memory footprint. Cancun supports RAM, NVMe, SSD, HDD, and has built-in support for upcoming 3D Xpoint for even faster acceleration. Off Heap Memory Management for Big Data When large amounts of data are involved, issues with Java memory management can arise resulting in a significant hit to performance. Java s inability to handle large data sizes in JVM results in frequent, expensive garbage collection during which there is a significant drop in performance. In addition, if the JVM crashes, all data in memory is lost and must be rebuilt from disk a slow and cumbersome process. Figure 3 Garbage collection (1) slows applications; if JVM crashes (2), it must be rebuilt from disk Cancun MemoryLake TM technology significantly speeds up jobs by avoiding garbage collection. Data blocks are moved off heap to remove the load on garbage collection and a persistent distributed cache ensures that data can quickly be fetched. In addition, if the JVM crashes, data is quickly retrieved from off-heap memory without having to read from slow HDFS, because with Cancun MemoryLake TM the data remains in memory, making crash recovery much faster. Figure 4 Cancun avoids garbage collection (1) and data is quickly retrieved (2) if JVM crashes

Cancun MemoryLake TM Accomplishes Data Transfer via Fast, In-Memory File System Big data workloads are typically built as pipelines. The output of one stage is fed into the next stage and this output is written to disk, which becomes a chokepoint. Using Cancun MemoryLake TM, data transfer across stages is done via in-memory file system (see notation 1 in Figure 5), which is an order of magnitude faster than disk-based file systems. For disk access within a stage (see notation 2 in Figure 5), Cancun allows numerous intermediate writes to disk be done via in-memory file system so that pipelined jobs are completed dramatically faster. Figure 5 Disk access is done via in-memory file system for blazing performance Other Performance Enhancements Compression: Cancun applies advanced compression techniques on the data that it manages. These compression techniques reduce storage footprint and optimizes network traffic due to the smaller size of packets being transferred between the nodes. Dedicated mount point for shuffle traffic: Big Data jobs spend considerable time on shuffle stage. Cancun allows a high speed, dedicated mount point, for shuffle traffic which accelerates the shuffle stage of the job.

Proof Points Shown below are the results of Spark and Hadoop tests that were run with and without Cancun Software-Defined Memory Lake (SDML) technology. 1. Real Customer ETL Scenario 1.1 Performance The customer scenario was to JOIN 1B+ rows, spread in two data files, and then use the join -ed file for downstream analysis. The test was run in two scenarios one without Cancun technology and a second with Cancun technology. The performance on an infrastructure consisting of 8 nodes without Cancun was 14 minutes, while the run time with Cancun was 1.3 minutes more than 10x improvement in performance.

1.2 Infrastructure Efficiency Efficiency testing on this scenario also demonstrated significant improvements. The run time on an infrastructure consisting of 8 nodes without Cancun was 14 minutes, while the run time on an infrastructure consisting of only 4 nodes with Cancun was 2.3 minutes. In essence, the Cancun environment dramatically cut run time while using only a small portion of the existing infrastructure footprint. 2. HiBench Spark TeraSort Benchmark TeraSort is a popular benchmark that measures the amount of time to sort large, randomly distributed data on a given computer system. The following tests show the benchmark performance results on different environments. Characteristics of the test bed are: Software:. Apache Hadoop Nodes:. 1 master + 4 worker nodes Dataset Size:. 320GB EC2 Instances:. M4.4xlarge (64GB RAM, 16 vcpus per instance) Number of executors per node: 8

2.1 Amazon EMR On an EMR cluster, the Teragen without Cancun run completed in 7 minutes while the with Cancun run finished in 2m and 46s - more than 2.5x improvement in performance. The Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 320GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was 20 minutes, while the run time with Cancun was 9 minutes more than 2x improvement in performance.

2.2 Amazon EC2 + HDFS on EBS On an EC2 cluster with HDFS storage, the Teragen without Cancun run completed in 52 minutes while the with Cancun run finished in 3m and 16s - more than 15x improvement in performance. The HiBench Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 320GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was slightly over 2 hours, while the performance with Cancun was 9m 35s more than 12x improvement in performance.

2.3 Amazon EC2 + S3 On an EC2 cluster with S3 storage, the Teragen without Cancun run completed in 26 minutes while the with Cancun run finished in 2m and 34s - more than 10x improvement in performance. The HiBench Spark TeraSort test was run in both without Cancun and with Cancun scenarios with a 32GB dataset. The performance on an infrastructure consisting of 5 nodes without Cancun was 33 minutes, while the run time with Cancun was 9 minutes more than 3.6x improvement in performance. 3. TestDFSIO Benchmark TestDFSIO is a standard benchmark that is run on a cluster to test I/O performance to and from HDFS. Following tests show the benchmark performance results on different environments. Characteristics of the test bed are: Software:. Apache Hadoop Nodes:. 1 master + 4 worker nodes Dataset Size:. 800GB EC2 Instances:. M4.4xlarge (64GB RAM, 16 vcpus per instance) Number of executors per node: 2

3.1 Amazon EMR On an EMR cluster, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 15 minutes, while the run time with Cancun was 1m 15s - more than 12x improvement in performance. For reads, the run time was recorded at 28 minutes and 1m 34s, respectively, demonstrating a more than 18x improvement when using the Cancun MemoryLake TM platform.

3.2 Amazon EC2 + HDFS on EBS On an EC2 cluster with HDFS storage, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 2h 7m, while the run time with Cancun was 1m 16s - more than 98x improvement in performance. For reads, the run time was recorded at 14h 33m and 1m 54s, respectively, demonstrating a more than 452x improvement when using the Cancun MemoryLake TM platform.

3.3 Amazon EC2 + S3 On an EC2 cluster with HDFS storage, the Hadoop TestDFSIO test was run in various without Cancun and with Cancun scenarios. For a DFSIO write test on an infrastructure consisting of 5 nodes and a 800GB dataset, the run time without Cancun was 32m 10s, while the run time with Cancun was 1m 17s, - more than 25x improvement in performance. For reads, the run time was recorded at 12m 40s and 1m 54s, respectively, demonstrating a more than 6.6x improvement when using the Cancun MemoryLake TM platform. Conclusion Cancun Systems MemoryLake is a transparent software-defined memory lake layer that delivers memory-speed access to data. Optimized for big data workflows, Cancun MemoryLake TM solves the inefficiency of memory management and data IO in today s big data infrastructures. The innovative Cancun platform allows big data jobs to take advantage of fast, in-memory resources to deliver stunning performance that enables organizations to improve decision making, minimize risk, and increase profits. For more information or to request a demo, visit us at www.cancunsystems.net