Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vsphere Justin Murray Mohan Potheri VMworld 2017 Content: Not for publication #VMworld #VIRT1351BE

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2

Agenda 1 Introductions 2 Existing and new Approaches in the Big Data World 3 Traditional Deployment Reference Architectures 4 New Architectures Changing the Paradigm 5 Proof of Concept: Testing in the VMware Solutions Lab 6 Introduction to Machine Learning 7 Conclusions 3

Why the Interest in Big Data? Enterprises want to get off existing costly data platforms Older data warehouse technology is not serving your needs Want to do queries and analytics against many different forms of data (structured, unstructured, streaming) Provide data access to our end customers Integrate systems that have been islands till now Single source of truth for the enterprise Exploit new application architectures for developer productivity Want to do data science, machine learning, deep learning VMworld 2017 Content: Not for publication 4

The Existing Hadoop Architecture Client ResourceManager Master Scheduler NameNode Master File System Index submit job Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Datanode Workers Nodemanager Datanode Nodemanager Container - 2 Container - 3 Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 5

High Level View of Spark 6

The Spark Architecture Standalone Driver Job Worker Node 1 Worker Node 2 Worker Node 3 Executor JVM Executor JVM Executor JVM Executor JVM Executor JVM Executor JVM 7

The Spark Architecture (on YARN) Job Namenode Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Datanode Resourcemanager Nodemanager Datanode Nodemanager Driver Executor Executor Container - 2 Container - 3 Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 8

Traditional Reference Architectures

Two Virtual Machines on a Host Server vsphere Host Server Hadoop Node 1 Virtual Machine Ext4 Nodemanager Ext4 Ext4 Datanode Ext4 Ext4 Ext4 Hadoop Node 2 Virtual Machine Ext4 Nodemanager Ext4 Datanode Ext4 Ext4 Ext4 Ext4 VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Local DAS disks/devices allocated to a Virtual Machine 10

Data/Compute Separation (with External Access to HDFS) Hadoop Virtual Node 1 Virtualization Host ResourceManager Ext4 Ext4 OS Image OS VMDK Image OS VMDK Image VMDK VMDK VMDK Hadoop Virtual Node 2 VMDK Temp NodeManager Ext4 Ext4 Hadoop Virtual Node 3 NodeManager HDFS requests Temp Ext4 NN NN NN NN NN NN Ext4 Isilon data node 11

Concerns with HDFS (The Hadoop Distributed File System) Difficult to separate compute from data storage concerns Three-way block replication for each 256MB data block (or 512MB block) Triples input data size at least - to achieve safety Re-balance of data when you add new data node processes Data must be ingested into HDFS from legacy systems (can be time consuming) Site-to-site replication not inherent NameNode process (which holds the central index of all files) can be sensitive to higher numbers of small files VMworld 2017 Content: Not for publication 12

Developers and Data Scientists Work on their code or on their data analysis model Don t need a multi-tenant cluster Don t care about job scheduling for other users Want to scale out to see the effect on their work Want to use the latest tools and newer versions (Python, R, Scala, ML kits) Experiment with different data models, code, algorithms, data sets Training the analysis model is separated from testing it interested in the time taken for each May not need the full Hadoop cluster set 13

New Architectures for Big Data

Key Trends in Big Data Infrastructure Decoupling of Compute and Storage Clusters Separate compute virtual machines from storage VMs Data is processed and scaled independently of compute Dynamic Scaling of compute nodes used for analysis from dozens to hundreds SPARK and other newer Big Data platforms can work with regular filesystems Newer platforms store and process data in memory New platforms can leverage Distributed Filesystems that can use local or shared storage Need for High Availability & Fault Tolerance for master components 15

Apache Spark Platform Capabilities Open-source cluster computing framework In Memory Data Processing Engine ETL, analytics, ML and graph processing Batch and streams processing Rich APIs for Scala, Python, Java, R, and SQL Distributed platform for complex multi-stage applications Reference: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html #VIRT1351BU CONFIDENTIAL 16

HDFS replacement needed for the next generation distributed file System What candidates present themselves? S3, Ceph, Gluster, etc. GlusterFS used in POC: Mature Solution Native GlusterFS filesystem for Linux Layers on top of any traditional storage Truly distributed and resilient distributed file system Supports many common client protocols 17

GlusterFS GlusterFS is a scale out distributed filesystem that can support thousands of clients File-system can run on DAS or Shared Storage Fault Tolerant Distributed File System. Provides multiprotocol support Native NFS CIFS HDFS S3 FTP https://www.slideshare.net/shubhendutripathi040980/glusterfs-hadoop 18

HDFS vs Ceph vs Gluster IOZONE Performance Comparison http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf 19

SPARK with GlusterFS POC Architecture on Pure FC SAN Spark Master Spark Worker Spark Worker Gluster Node Spark Worker Spark Worker GlusterFS Gluster Node Spark Worker Gluster Node Spark Worker VMware vsphere VMware vsphere VMware vsphere VMware vsphere VMworld 2017 Spark Worker Spark Worker Content: Not for publication Pure M50 Storage on Fibre-Channel 20

SPARK with GlusterFS POC Architecture on Virtual SAN Spark Master Spark Worker Spark Worker VMware vsphere + VSAN Gluster Node Spark Worker VMware vsphere + VSAN Spark Worker GlusterFS Gluster Node Spark Worker Gluster Node VMware vsphere + VSAN Spark Worker VMware vsphere + VSAN Spark Worker Spark Worker Clustered VSAN Datastore 21

TPC-DS on SPARK on GlusterFS

TPC-DS with Spark-SQL and Apache SPARK IBM has helped integrate the TPC-DS Benchmark (v2), into the spark-sql-perf The 99 queries were generated using the TPC-DS query generator and are based on the 100- GB scale factor. The spark-sql-perf test kit can be used to evaluate and compare the infrastructure for its performance. We leveraged a subset of TPC-DS queries to evaluate our POC and Solution 23

Test Setup SPARK Nodes: 1 Master and 8 Slave Nodes with 16 vcpu and 128 GB each 3 Node GlusterFS cluster with 2 TB shared Filesystem mount across all SPARK nodes Storage: (Two Use Cases) 1. GlusterFS backed by Pure Storage LUNS (16 GBPS FC Fabric with Pure M50 Array) 2. GlusterFS backed by vsan (Western Digital NVMe Cache, High Capacity Flash for persistence) TPC-DS Data Sets 5 TB Queries Interactive TPC-DS Queries Set (q19, q42, q52, q55, q63, q68, q73 & q98) 24

Apache SPARK Web Console 25

SPARK Job Details 26

TPC-DS test results ( 5TB Data Set) 3 2.5 2 1.5 1 0.5 Query Time Comparison between FC SAN and vsan 0 q19 q42 q52 q55 q63 q68 q73 q98 Pure VSAN 27

TPC-DS (vsan on Premises versus VMware Cloud on AWS) 3.5 3 2.5 2 1.5 1 0.5 TPC-DS On Premises vs VMware Cloud on AWS 0 q19 q42 q52 q55 q63 q68 q73 q98 On-Prem VMware Cloud on AWS 28

Demo #VIRT1351BU CONFIDENTIAL 29

Section-Conclusion Modern Big Data platforms like SPARK are mostly memory resident GlusterFS provides a high performance distributed filesystem for SPARK and newer big data workloads GlusterFS supports a wide range of protocols that make it the ideal storage platform for data lakes Layering GlusterFS on top of shared storage or VSAN helps leverage all the vsphere platform features Dedicated HW with local storage is no longer required for modern big data applications. TPC-DS testing showed similar performance for SPARK-SQL on VSAN and FC. 30

Introduction to Machine Learning

What Is Machine Learning? Training Data (Big) Samples from History training New Sample Transaction Data testing Mathematical Model Mathematical Model Mathematical Model Classification or Prediction Machine Learning algorithms try to make predictions based on training data that is given to a mathematical model (e.g. a linear regression algorithm) Find the minimum the difference between the model s prediction and the already known outcomes (minimize the loss or objective function) 33

Example: Machine Learning Model for A Customer Applies for Credit Training Data (Big) A new application for credit Mathematical Model Mathematical Model Mathematical Model Training data contains many features that have each been given a numeric value (e.g. zip code = 99) Several models are used against the training data and the best one is chosen (minimal loss or error) One kind of outcome is a binary classification (a good credit application or bad) Classification or Prediction 34

Training Data Examples x i Acct Number Txn ID Txn Location Code Knowns Age Home Zip Code Balance Annual Salary Passed Valid Check Computed/Learned Model s Estimate as Valid 1234 45 94312. 21 94304 100 80 Y N 1 5678 89 UK 31 12116 5000 110 N Y 1 9012 150 12126 61 31024 1400 50 Y Y 0 VMworld 2017 Content: Not for Error (Loss) publication Features or Feature Variables 35

Test Data Should Always Be Separated from Training Data Training Data Examples x i Test Data Acct Number Txn ID Txn Location Code Known Age Home Zip Code Balance Annual Salary Passed Valid Check Computed/Learned Model s Estimate as Valid 1234 45 94312. 21 94304 100 80 Y N 1 5678 89 UK 31 12116 5000 110 N Y 1 9012 150 12126 61 31024 1400 50 Y Y 0 VMworld 2017 Content: Not for Error (Loss) publication Features or Feature Variables GOLDEN RULE : Don t TEST on your TRAINING DATA 36

Example: A Linear Classifier f (x i, W, b) = Wx i + b x: Example data W: weights b: bias Source: Stanford University class cs231n 37

Deployment Platform for Machine Learning Training Data (Big) Spark A new application for credit Spark Spark Mathematical Model Mathematical Model Mathematical Model Spark is the runtime platform for the models and ingestion of the training data Different Machine Learning algorithms available from MLlib library that comes with Spark Application and Data is distributed out to many nodes (virtual machines) Classification or Prediction 38

Introducing vsphere Scale-Out for Big Data and HPC Workloads New package that provides all the core features required for scale-out workloads at an attractive price point Features Packaging Hypervisor, vmotion, vshield Endpoint, Storage vmotion, Storage APIs, Distributed Switch, I/O Controls & SR- IOV, Host Profiles / Auto Deploy and more Sold in Packs of 8 CPU at a cost-effective price point Licensing EULA enforced for use w/ Big Data/HPC workloads only 39

Conclusions New architectures for big data are emerging beyond the existing documented ones Spark changes the profile of I/O and persistence for the newer applications This lends itself well to virtualization and separation of compute from data Traditional values in vsphere can be used in a big data context We would like to explore how these new architectural ideas will fit in your environment 40

jmurray@vmware.com bigdata@vmware.com

BACKUP SLIDES NOT FOR PRESENTATION

Placeholder : Key Requirements for Big Data Architecture Subtitle Performance Scaling to dozens or hundreds of nodes (VMs) Robustness distributed file system, no one process is a single point of failure High Availability Fault Tolerance Capable of handling new workloads with new compute demands 44

Placeholder : Key Requirements for Big Data Architecture Can we use a distributed file system that is not HDFS? Use a lighter weight framework than full Hadoop e.g. Spark? Can we keep as much data in memory as possible and avoid I/O? Avoid spills Are shared file systems like VSAN useful? How to achieve the performance requirements without losing functionality? 45

vsan Optimization

Hardware Configuration All-Flash vsan (4) Node Dell R730XD (2) E5-2699V4 22-core 2.2GHz 1TB Memory (4) 10 Gb/s Ethernet connections PERC H730mini SDCard System Drive vsphere 6.5 Update 1 VMworld 2017 Content: Not for VSAN disk configuration (2) Disk groups per node (1) 1.6TB * Ultrastar SN100 cache drive (2) 3.84TB Optimus MAX capacity drive publication * 1TB=1,000GB, 1GB=1,000,000,000 bytes. Actual usable capacity less. 47

vsan Disk Group Configuration 48

vsan - Network Dual vsan VMKernel Adapters Port Group Virtual Switch Port Group VMworld 2017 These are not necessarily for redundancy (like an Air-Gap network with redundant physical interfaces routed to multiple VMKs) but for performance to pull from two physical interfaces at once. Content: Not for publication 49

vsan VMK Configuration 50

vsan Port Group Uplink Maps vds Contained 4 Uplinks 2 dedicated to normal operation 2 dedicated to vsan communication vds-comp01-private Active Uplink: dvuplink3 Standby Uplink: dvuplink4 vds-comp01-private2 Active Uplink: dvuplink4 Standby Uplink: dvuplink3 51

HCIBench Results Network 100% Read IOPs and Latency IOPs 700000 600000 500000 400000 300000 200000 100000 0 4K 8K 32K 64K Block Size vsan 6.6.1 Baseline Multiple vsan VMK 1500 MTU 10Gb Ethernet 10Gb Eth Multiple vsan VMK Baseline - Lat Multiple vsan VMK - Lat 1500 MTU - Lat 10Gb Ethernet - Lat 10Gb Eth Multiple vsan VMK - Lat 4 3.5 3 2.5 2 1.5 1 0.5 0 MS 52

What Have We Seen so Far? We can use a different file system for big data to HDFS With the right storage, we can use the vmotion/drs/ha/ft features of vsphere VSAN can provide the storage underpinning big data (particularly for newer workloads) A number of different workloads were exercised on this new architecture Analytical queries, batch jobs and machine learning Testing is still in progress on all the above more to come 53