CONTAINERIZED SPARK ON KUBERNETES. William Benton Red Hat,

Similar documents
Some things you learn running Apache Spark in production for three years. William Benton Red Hat, Inc.

Deploying Applications on DC/OS

COURSE 20487B: DEVELOPING WINDOWS AZURE AND WEB SERVICES

[MS20487]: Developing Windows Azure and Web Services

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Processing of big data with Apache Spark

Oracle Big Data Fundamentals Ed 2

Cloud Computing & Visualization

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

LOG AGGREGATION. To better manage your Red Hat footprint. Miguel Pérez Colino Strategic Design Team - ISBU

Mesosphere and Percona Server for MongoDB. Jeff Sandstrom, Product Manager (Percona) Ravi Yadav, Tech. Partnerships Lead (Mesosphere)

Mesosphere and Percona Server for MongoDB. Peter Schwaller, Senior Director Server Eng. (Percona) Taco Scargo, Senior Solution Engineer (Mesosphere)

Developing Windows Azure and Web Services

Achieving Horizontal Scalability. Alain Houf Sales Engineer

MESOS A State-Of-The-Art Container Orchestrator Mesosphere, Inc. All Rights Reserved. 1

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Building a Data-Friendly Platform for a Data- Driven Future

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Developing Microsoft Azure and Web Services. Course Code: 20487C; Duration: 5 days; Instructor-led

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

IoT with Apache ActiveMQ, Camel and Spark

Distributed File Systems II

Modern Data Warehouse The New Approach to Azure BI

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

What s New in K8s 1.3

Networking & Security for Mesos

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

Hadoop. Introduction / Overview

Container Orchestration on Amazon Web Services. Arun

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

VOLTDB + HP VERTICA. page

Microsoft Developing Windows Azure and Web Services

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SCALING LIKE TWITTER WITH APACHE MESOS

How to Put Your AF Server into a Container

Hadoop, Yarn and Beyond

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Modernizing Business Intelligence and Analytics

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin, Solution Architect, JFrog, Inc.

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

POWERING THE INTERNET WITH APACHE MESOS

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Red Hat Roadmap for Containers and DevOps

Microservices with Red Hat. JBoss Fuse

Kontejneri u Azureu uz pomoć Kubernetesa što i kako? Tomislav Tipurić Partner Technology Strategist Microsoft

Making Data Integration Easy For Multiplatform Data Architectures With Diyotta 4.0. WEBINAR MAY 15 th, PM EST 10AM PST

Turning Relational Database Tables into Spark Data Sources

DATA SCIENCE USING SPARK: AN INTRODUCTION

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Spark Overview. Professor Sasu Tarkoma.

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Building a Microservices Platform with Kubernetes. Matthew Mark

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Przyspiesz tworzenie aplikacji przy pomocy Openshift Container Platform. Jarosław Stakuń Senior Solution Architect/Red Hat CEE

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Distributed CI: Scaling Jenkins on Mesos and Marathon. Roger Ignazio Puppet Labs, Inc. MesosCon 2015 Seattle, WA

Building/Running Distributed Systems with Apache Mesos

Cisco Container Platform

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks

SAMPLE CHAPTER IN ACTION. Roger Ignazio. FOREWORD BY Florian Leibert MANNING

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

CLOUD-NATIVE APPLICATION DEVELOPMENT/ARCHITECTURE

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Integrate MATLAB Analytics into Enterprise Applications

Big Data Security. Facing the challenge

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Earthdata Cloud Analytics Project

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

Overview of Container Management

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Cloud-Native Applications. Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Alexander Klein. #SQLSatDenmark. ETL meets Azure

Data Movement & Tiering with DMF 7

Certified Big Data Hadoop and Spark Scala Course Curriculum

What s New in K8s 1.3

Syncsort Incorporated, 2016

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Data Architectures in Azure for Analytics & Big Data

Lambda Architecture for Batch and Stream Processing. October 2018

Capture Business Opportunities from Systems of Record and Systems of Innovation

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

ovirt and Docker Integration

Transcription:

CONTAINERIZED SPARK ON KUBERNETES William Benton Red Hat, Inc. @willb willb@redhat.com

BACKGROUND

BACKGROUND

BACKGROUND

BACKGROUND

BACKGROUND

BACKGROUND

BACKGROUND

BACKGROUND

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 2 Spark executor Spark executor Spark executor Networked POSIX FS 3 Spark executor Spark executor 4 Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 1 Spark executor Networked 1 Spark executor POSIX FS 2 2 Spark executor 3 3 Spark executor 3 Spark executor 4 4 Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 1 Spark executor Networked 1 Spark executor POSIX FS 2 2 Spark executor 3 3 Spark executor 3 Spark executor 4 4 Spark executor

Analytics is no longer a separate workload.

Analytics is an essential component of modern datadriven applications.

OUR GOALS

OUR GOALS

OUR GOALS

OUR GOALS

OUR GOALS git

OUR GOALS git

OUR GOALS git

FORECAST Motivating containerized microservices Architectures for analytics and applications Spark clusters in containers: practicalities and pitfalls Play along at home Future work

MOTIVATING MICROSERVICES

A microservice architecture employs lightweight, modular, and typically stateless components with well-defined interfaces and contracts.

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2

BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2 5

BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2 5

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES?

BENEFITS OF MICROSERVICE ARCHITECTURES?

BENEFITS OF MICROSERVICE ARCHITECTURES?

MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 executor executor executor executor master

MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 λ x: x * 2 executor executor executor executor master

MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2

MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2

MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2

ARCHITECTURES FOR ANALYTICS AND APPLICATIONS

APPLICATION RESPONSIBILITIES events transform databases transform file, object storage transform

APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform

APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform databases transform aggregate archive file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform databases transform aggregate archive file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train

APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

LEGACY ARCHITECTURES

CONVENTIONAL DATA WAREHOUSE events

CONVENTIONAL DATA WAREHOUSE events transform

CONVENTIONAL DATA WAREHOUSE events transform UI

CONVENTIONAL DATA WAREHOUSE events transform UI business logic

CONVENTIONAL DATA WAREHOUSE events transform UI business logic RDBMS

CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction RDBMS processing

CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction RDBMS processing

CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform UI business logic analysis transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform reporting UI business logic analysis transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform reporting UI business logic analysis transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction RDBMS RDBMS processing

CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction analytic RDBMS RDBMS processing processing

HADOOP-STYLE DATA LAKE HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE HDFS HDFS HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS compute compute compute compute compute

HADOOP-STYLE DATA LAKE events HDFS HDFS compute compute

MODERN ARCHITECTURES

THE LAMBDA ARCHITECTURE transform (imprecise) analysis events

THE LAMBDA ARCHITECTURE transform (imprecise) analysis events DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer serving layer events transform (precise) analysis DFS batch layer

THE KAPPA ARCHITECTURE events

THE KAPPA ARCHITECTURE events message bus

THE KAPPA ARCHITECTURE events message bus queue for raw data topic

THE KAPPA ARCHITECTURE events message bus queue for raw data topic transform

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform analysis

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting end-user UI

DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN Resource manager Cluster scheduler app 1 app 2 Spark executor Shared FS Spark executor Spark executor app 3 app 4 Spark executor Spark executor Spark executor

SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN Resource manager Cluster scheduler app 1 app 2 Spark executor Shared FS Spark executor Spark executor app 3 app 4 Spark executor Spark executor Spark executor

ONE CLUSTER PER APPLICATION Resource manager app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases

ONE CLUSTER PER APPLICATION Resource manager app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases

PRACTICALITIES AND POTENTIAL PITFALLS

SCHEDULING Kubernetes app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs unnecessary semantic guarantees difficult to manage

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS app 4 app 5 app 6

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations inelastic stateful can t collocate compute and data

STORAGE Kubernetes app 1 app 2 app 3 object store app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance (?)

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance

in a cloud native architecture, the benefit of HDFS is actually very small and that is why many cloud-first organizations no longer run HDFS, or only run it as a caching layer for S3. Reynold Xin on Quora (http://qr.ae/taf4cn)

NETWORKING

NETWORKING

NETWORKING

NETWORKING http://app1:8080

NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!)

NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!)

NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!) http://app1:80

SECURITY

SECURITY

SECURITY

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root net

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root net

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root /tmp/foo net

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root /tmp/foo net

SECURITY

SECURITY k8s namespace

SECURITY k8s namespace *

NEXT STEPS: FUTURE WORK & PLAYING ALONG AT HOME

NEXT STEPS Further performance evaluation Better developer experience Improved scheduling of Spark tasks on Kubernetes

TRY IT OUT YOURSELF Kubernetes standalone Spark example: https://github.com/kubernetes/kubernetes/tree/master/examples/spark Enabling Spark on OpenShift: https://github.com/radanalyticsio Native Spark on Kubernetes proposal: https://github.com/kubernetes/kubernetes/issues/34377

THANKS! @willb willb@redhat.com https://chapeau.freevariable.com