Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Similar documents
NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA WHITE PAPER NOVEMBER 2017

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Big Data Performance on VMware Cloud on AWS

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Accelerating Digital Transformation with InterSystems IRIS and vsan

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Hadoop, Yarn and Beyond

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

DATA SCIENCE USING SPARK: AN INTRODUCTION

The Old School Cloud Is No More: Running Your Microsoft Applications on AWS

VMware Virtual SAN. Technical Walkthrough. Massimiliano Moschini Brand Specialist VCI - vexpert VMware Inc. All rights reserved.

Cloud Computing & Visualization

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Storage Strategies for vsphere 5.5 users

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

A Practitioner s Guide to Migrating Workloads to VMware Cloud on AWS

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Running VMware vsan Witness Appliance in VMware vcloudair First Published On: April 26, 2017 Last Updated On: April 26, 2017

Achieving Horizontal Scalability. Alain Houf Sales Engineer

WHITE PAPER SEPTEMBER VMWARE vsphere AND vsphere WITH OPERATIONS MANAGEMENT. Licensing, Pricing and Packaging

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved.

VMware vsphere Customized Corporate Agenda

Emerging Technologies for HPC Storage

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

2014 VMware Inc. All rights reserved.

Database Level 100. Rohit Rahi November Copyright 2018, Oracle and/or its affiliates. All rights reserved.

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Public Cloud Leverage For IT/Business Alignment Business Goals Agility to speed time to market, adapt to market demands Elasticity to meet demand whil

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Next Generation Storage for The Software-Defned World

Hedvig as backup target for Veeam

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Introducing SUSE Enterprise Storage 5

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Accelerate Big Data Insights

Modern Data Warehouse The New Approach to Azure BI

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Dell EMC Surveillance for IndigoVision Body-Worn Cameras

Flash Storage Complementing a Data Lake for Real-Time Insight

Performance Testing December 16, 2017

#techsummitch

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Albis: High-Performance File Format for Big Data Systems

Vmware 3V VMware Certified Advanced Professional Data Center Virtualization Design.

IBM Cloud for VMware Solutions

vsan Mixed Workloads First Published On: Last Updated On:

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Lecture 11 Hadoop & Spark

EMC Business Continuity for Microsoft Applications

Introduction to Virtualization. From NDG In partnership with VMware IT Academy

MapR Enterprise Hadoop

VMware vsphere 6.5 Boot Camp

Processing of big data with Apache Spark

VMware Virtual SAN Technology

StorMagic SvSAN 6.1. Product Announcement Webinar and Live Demonstration. Mark Christie Senior Systems Engineer

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Dell EMC. VxRack System FLEX Architecture Overview

The vsphere 6.0 Advantages Over Hyper- V

PERFORMANCE CHARACTERIZATION OF MICROSOFT SQL SERVER USING VMWARE CLOUD ON AWS PERFORMANCE STUDY JULY 2018

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Certified Big Data Hadoop and Spark Scala Course Curriculum

Virtualization of the MS Exchange Server Environment

Dell Technologies IoT Solution Surveillance with Genetec Security Center

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Rethink Storage: The Next Generation Of Scale- Out NAS

VMware vsphere with ESX 4.1 and vcenter 4.1

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

IOmark-VM. VMware VSAN Intel Servers + VMware VSAN Storage SW Test Report: VM-HC a Test Report Date: 16, August

vsan Remote Office Deployment January 09, 2018

MOHA: Many-Task Computing Framework on Hadoop

CIT 668: System Architecture. Amazon Web Services

IBM Emulex 16Gb Fibre Channel HBA Evaluation

EMC VSPEX END-USER COMPUTING

What's New in vsphere?

... IBM Advanced Technical Skills IBM Oracle International Competency Center September 2013

@joerg_schad Nightmares of a Container Orchestration System

HPE Synergy HPE SimpliVity 380

ATTACHMENT A SCOPE OF WORK IMPLEMENTATION SERVICES. Cisco Server and NetApp Storage Implementation

Dell EMC. VxBlock Systems for VMware NSX 6.3 Architecture Overview

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

VMware Cloud Provider Platform

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Transcription:

VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vsphere Justin Murray Mohan Potheri VMworld 2017 Content: Not for publication #VMworld #VIRT1351BE

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2

Agenda 1 Introductions 2 Existing and new Approaches in the Big Data World 3 Traditional Deployment Reference Architectures 4 New Architectures Changing the Paradigm 5 Proof of Concept: Testing in the VMware Solutions Lab 6 Introduction to Machine Learning 7 Conclusions 3

Why the Interest in Big Data? Enterprises want to get off existing costly data platforms Older data warehouse technology is not serving your needs Want to do queries and analytics against many different forms of data (structured, unstructured, streaming) Provide data access to our end customers Integrate systems that have been islands till now Single source of truth for the enterprise Exploit new application architectures for developer productivity Want to do data science, machine learning, deep learning VMworld 2017 Content: Not for publication 4

The Existing Hadoop Architecture Client ResourceManager Master Scheduler NameNode Master File System Index submit job Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Datanode Workers Nodemanager Datanode Nodemanager Container - 2 Container - 3 Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 5

High Level View of Spark 6

The Spark Architecture Standalone Driver Job Worker Node 1 Worker Node 2 Worker Node 3 Executor JVM Executor JVM Executor JVM Executor JVM Executor JVM Executor JVM 7

The Spark Architecture (on YARN) Job Namenode Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Datanode Resourcemanager Nodemanager Datanode Nodemanager Driver Executor Executor Container - 2 Container - 3 Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 8

Traditional Reference Architectures

Two Virtual Machines on a Host Server vsphere Host Server Hadoop Node 1 Virtual Machine Ext4 Nodemanager Ext4 Ext4 Datanode Ext4 Ext4 Ext4 Hadoop Node 2 Virtual Machine Ext4 Nodemanager Ext4 Datanode Ext4 Ext4 Ext4 Ext4 VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Local DAS disks/devices allocated to a Virtual Machine 10

Data/Compute Separation (with External Access to HDFS) Hadoop Virtual Node 1 Virtualization Host ResourceManager Ext4 Ext4 OS Image OS VMDK Image OS VMDK Image VMDK VMDK VMDK Hadoop Virtual Node 2 VMDK Temp NodeManager Ext4 Ext4 Hadoop Virtual Node 3 NodeManager HDFS requests Temp Ext4 NN NN NN NN NN NN Ext4 Isilon data node 11

Concerns with HDFS (The Hadoop Distributed File System) Difficult to separate compute from data storage concerns Three-way block replication for each 256MB data block (or 512MB block) Triples input data size at least - to achieve safety Re-balance of data when you add new data node processes Data must be ingested into HDFS from legacy systems (can be time consuming) Site-to-site replication not inherent NameNode process (which holds the central index of all files) can be sensitive to higher numbers of small files VMworld 2017 Content: Not for publication 12

Developers and Data Scientists Work on their code or on their data analysis model Don t need a multi-tenant cluster Don t care about job scheduling for other users Want to scale out to see the effect on their work Want to use the latest tools and newer versions (Python, R, Scala, ML kits) Experiment with different data models, code, algorithms, data sets Training the analysis model is separated from testing it interested in the time taken for each May not need the full Hadoop cluster set 13

New Architectures for Big Data

Key Trends in Big Data Infrastructure Decoupling of Compute and Storage Clusters Separate compute virtual machines from storage VMs Data is processed and scaled independently of compute Dynamic Scaling of compute nodes used for analysis from dozens to hundreds SPARK and other newer Big Data platforms can work with regular filesystems Newer platforms store and process data in memory New platforms can leverage Distributed Filesystems that can use local or shared storage Need for High Availability & Fault Tolerance for master components 15

Apache Spark Platform Capabilities Open-source cluster computing framework In Memory Data Processing Engine ETL, analytics, ML and graph processing Batch and streams processing Rich APIs for Scala, Python, Java, R, and SQL Distributed platform for complex multi-stage applications Reference: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html #VIRT1351BU CONFIDENTIAL 16

HDFS replacement needed for the next generation distributed file System What candidates present themselves? S3, Ceph, Gluster, etc. GlusterFS used in POC: Mature Solution Native GlusterFS filesystem for Linux Layers on top of any traditional storage Truly distributed and resilient distributed file system Supports many common client protocols 17

GlusterFS GlusterFS is a scale out distributed filesystem that can support thousands of clients File-system can run on DAS or Shared Storage Fault Tolerant Distributed File System. Provides multiprotocol support Native NFS CIFS HDFS S3 FTP https://www.slideshare.net/shubhendutripathi040980/glusterfs-hadoop 18

HDFS vs Ceph vs Gluster IOZONE Performance Comparison http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf 19

SPARK with GlusterFS POC Architecture on Pure FC SAN Spark Master Spark Worker Spark Worker Gluster Node Spark Worker Spark Worker GlusterFS Gluster Node Spark Worker Gluster Node Spark Worker VMware vsphere VMware vsphere VMware vsphere VMware vsphere VMworld 2017 Spark Worker Spark Worker Content: Not for publication Pure M50 Storage on Fibre-Channel 20

SPARK with GlusterFS POC Architecture on Virtual SAN Spark Master Spark Worker Spark Worker VMware vsphere + VSAN Gluster Node Spark Worker VMware vsphere + VSAN Spark Worker GlusterFS Gluster Node Spark Worker Gluster Node VMware vsphere + VSAN Spark Worker VMware vsphere + VSAN Spark Worker Spark Worker Clustered VSAN Datastore 21

TPC-DS on SPARK on GlusterFS

TPC-DS with Spark-SQL and Apache SPARK IBM has helped integrate the TPC-DS Benchmark (v2), into the spark-sql-perf The 99 queries were generated using the TPC-DS query generator and are based on the 100- GB scale factor. The spark-sql-perf test kit can be used to evaluate and compare the infrastructure for its performance. We leveraged a subset of TPC-DS queries to evaluate our POC and Solution 23

Test Setup SPARK Nodes: 1 Master and 8 Slave Nodes with 16 vcpu and 128 GB each 3 Node GlusterFS cluster with 2 TB shared Filesystem mount across all SPARK nodes Storage: (Two Use Cases) 1. GlusterFS backed by Pure Storage LUNS (16 GBPS FC Fabric with Pure M50 Array) 2. GlusterFS backed by vsan (Western Digital NVMe Cache, High Capacity Flash for persistence) TPC-DS Data Sets 5 TB Queries Interactive TPC-DS Queries Set (q19, q42, q52, q55, q63, q68, q73 & q98) 24

Apache SPARK Web Console 25

SPARK Job Details 26

TPC-DS test results ( 5TB Data Set) 3 2.5 2 1.5 1 0.5 Query Time Comparison between FC SAN and vsan 0 q19 q42 q52 q55 q63 q68 q73 q98 Pure VSAN 27

TPC-DS (vsan on Premises versus VMware Cloud on AWS) 3.5 3 2.5 2 1.5 1 0.5 TPC-DS On Premises vs VMware Cloud on AWS 0 q19 q42 q52 q55 q63 q68 q73 q98 On-Prem VMware Cloud on AWS 28

Demo #VIRT1351BU CONFIDENTIAL 29

Section-Conclusion Modern Big Data platforms like SPARK are mostly memory resident GlusterFS provides a high performance distributed filesystem for SPARK and newer big data workloads GlusterFS supports a wide range of protocols that make it the ideal storage platform for data lakes Layering GlusterFS on top of shared storage or VSAN helps leverage all the vsphere platform features Dedicated HW with local storage is no longer required for modern big data applications. TPC-DS testing showed similar performance for SPARK-SQL on VSAN and FC. 30

Introduction to Machine Learning

32

What Is Machine Learning? Training Data (Big) Samples from History training New Sample Transaction Data testing Mathematical Model Mathematical Model Mathematical Model Classification or Prediction Machine Learning algorithms try to make predictions based on training data that is given to a mathematical model (e.g. a linear regression algorithm) Find the minimum the difference between the model s prediction and the already known outcomes (minimize the loss or objective function) 33

Example: Machine Learning Model for A Customer Applies for Credit Training Data (Big) A new application for credit Mathematical Model Mathematical Model Mathematical Model Training data contains many features that have each been given a numeric value (e.g. zip code = 99) Several models are used against the training data and the best one is chosen (minimal loss or error) One kind of outcome is a binary classification (a good credit application or bad) Classification or Prediction 34

Training Data Examples x i Acct Number Txn ID Txn Location Code Knowns Age Home Zip Code Balance Annual Salary Passed Valid Check Computed/Learned Model s Estimate as Valid 1234 45 94312. 21 94304 100 80 Y N 1 5678 89 UK 31 12116 5000 110 N Y 1 9012 150 12126 61 31024 1400 50 Y Y 0 VMworld 2017 Content: Not for Error (Loss) publication Features or Feature Variables 35

Test Data Should Always Be Separated from Training Data Training Data Examples x i Test Data Acct Number Txn ID Txn Location Code Known Age Home Zip Code Balance Annual Salary Passed Valid Check Computed/Learned Model s Estimate as Valid 1234 45 94312. 21 94304 100 80 Y N 1 5678 89 UK 31 12116 5000 110 N Y 1 9012 150 12126 61 31024 1400 50 Y Y 0 VMworld 2017 Content: Not for Error (Loss) publication Features or Feature Variables GOLDEN RULE : Don t TEST on your TRAINING DATA 36

Example: A Linear Classifier f (x i, W, b) = Wx i + b x: Example data W: weights b: bias Source: Stanford University class cs231n 37

Deployment Platform for Machine Learning Training Data (Big) Spark A new application for credit Spark Spark Mathematical Model Mathematical Model Mathematical Model Spark is the runtime platform for the models and ingestion of the training data Different Machine Learning algorithms available from MLlib library that comes with Spark Application and Data is distributed out to many nodes (virtual machines) Classification or Prediction 38

Introducing vsphere Scale-Out for Big Data and HPC Workloads New package that provides all the core features required for scale-out workloads at an attractive price point Features Packaging Hypervisor, vmotion, vshield Endpoint, Storage vmotion, Storage APIs, Distributed Switch, I/O Controls & SR- IOV, Host Profiles / Auto Deploy and more Sold in Packs of 8 CPU at a cost-effective price point Licensing EULA enforced for use w/ Big Data/HPC workloads only 39

Conclusions New architectures for big data are emerging beyond the existing documented ones Spark changes the profile of I/O and persistence for the newer applications This lends itself well to virtualization and separation of compute from data Traditional values in vsphere can be used in a big data context We would like to explore how these new architectural ideas will fit in your environment 40

jmurray@vmware.com bigdata@vmware.com

BACKUP SLIDES NOT FOR PRESENTATION

Placeholder : Key Requirements for Big Data Architecture Subtitle Performance Scaling to dozens or hundreds of nodes (VMs) Robustness distributed file system, no one process is a single point of failure High Availability Fault Tolerance Capable of handling new workloads with new compute demands 44

Placeholder : Key Requirements for Big Data Architecture Can we use a distributed file system that is not HDFS? Use a lighter weight framework than full Hadoop e.g. Spark? Can we keep as much data in memory as possible and avoid I/O? Avoid spills Are shared file systems like VSAN useful? How to achieve the performance requirements without losing functionality? 45

vsan Optimization

Hardware Configuration All-Flash vsan (4) Node Dell R730XD (2) E5-2699V4 22-core 2.2GHz 1TB Memory (4) 10 Gb/s Ethernet connections PERC H730mini SDCard System Drive vsphere 6.5 Update 1 VMworld 2017 Content: Not for VSAN disk configuration (2) Disk groups per node (1) 1.6TB * Ultrastar SN100 cache drive (2) 3.84TB Optimus MAX capacity drive publication * 1TB=1,000GB, 1GB=1,000,000,000 bytes. Actual usable capacity less. 47

vsan Disk Group Configuration 48

vsan - Network Dual vsan VMKernel Adapters Port Group Virtual Switch Port Group VMworld 2017 These are not necessarily for redundancy (like an Air-Gap network with redundant physical interfaces routed to multiple VMKs) but for performance to pull from two physical interfaces at once. Content: Not for publication 49

vsan VMK Configuration 50

vsan Port Group Uplink Maps vds Contained 4 Uplinks 2 dedicated to normal operation 2 dedicated to vsan communication vds-comp01-private Active Uplink: dvuplink3 Standby Uplink: dvuplink4 vds-comp01-private2 Active Uplink: dvuplink4 Standby Uplink: dvuplink3 51

HCIBench Results Network 100% Read IOPs and Latency IOPs 700000 600000 500000 400000 300000 200000 100000 0 4K 8K 32K 64K Block Size vsan 6.6.1 Baseline Multiple vsan VMK 1500 MTU 10Gb Ethernet 10Gb Eth Multiple vsan VMK Baseline - Lat Multiple vsan VMK - Lat 1500 MTU - Lat 10Gb Ethernet - Lat 10Gb Eth Multiple vsan VMK - Lat 4 3.5 3 2.5 2 1.5 1 0.5 0 MS 52

What Have We Seen so Far? We can use a different file system for big data to HDFS With the right storage, we can use the vmotion/drs/ha/ft features of vsphere VSAN can provide the storage underpinning big data (particularly for newer workloads) A number of different workloads were exercised on this new architecture Analytical queries, batch jobs and machine learning Testing is still in progress on all the above more to come 53