CONTAINERIZED SPARK ON KUBERNETES. William Benton Red Hat,

CONTAINERIZED SPARK ON KUBERNETES William Benton Red Hat, Inc. @willb willb@redhat.com

BACKGROUND

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 2 Spark executor Spark executor Spark executor Networked POSIX FS 3 Spark executor Spark executor 4 Spark executor

WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 1 Spark executor Networked 1 Spark executor POSIX FS 2 2 Spark executor 3 3 Spark executor 3 Spark executor 4 4 Spark executor

Analytics is no longer a separate workload.

Analytics is an essential component of modern datadriven applications.

OUR GOALS

OUR GOALS git

FORECAST Motivating containerized microservices Architectures for analytics and applications Spark clusters in containers: practicalities and pitfalls Play along at home Future work

MOTIVATING MICROSERVICES

A microservice architecture employs lightweight, modular, and typically stateless components with well-defined interfaces and contracts.

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2

BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2 5

BENEFITS OF MICROSERVICE ARCHITECTURES

BENEFITS OF MICROSERVICE ARCHITECTURES?

MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 executor executor executor executor master

MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 λ x: x * 2 executor executor executor executor master

MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2

ARCHITECTURES FOR ANALYTICS AND APPLICATIONS

APPLICATION RESPONSIBILITIES events transform databases transform file, object storage transform

APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform

APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform databases transform aggregate archive file, object storage transform models train

APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train

APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

LEGACY ARCHITECTURES

CONVENTIONAL DATA WAREHOUSE events

CONVENTIONAL DATA WAREHOUSE events transform

CONVENTIONAL DATA WAREHOUSE events transform UI

CONVENTIONAL DATA WAREHOUSE events transform UI business logic

CONVENTIONAL DATA WAREHOUSE events transform UI business logic RDBMS

CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction RDBMS processing

CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform UI business logic analysis transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform reporting UI business logic analysis transaction processing RDBMS RDBMS

CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction RDBMS RDBMS processing

CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction analytic RDBMS RDBMS processing processing

HADOOP-STYLE DATA LAKE HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE HDFS HDFS HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS

HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS compute compute compute compute compute

HADOOP-STYLE DATA LAKE events HDFS HDFS compute compute

MODERN ARCHITECTURES

THE LAMBDA ARCHITECTURE transform (imprecise) analysis events

THE LAMBDA ARCHITECTURE transform (imprecise) analysis events DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS

THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer events transform (precise) analysis DFS batch layer

THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer serving layer events transform (precise) analysis DFS batch layer

THE KAPPA ARCHITECTURE events

THE KAPPA ARCHITECTURE events message bus

THE KAPPA ARCHITECTURE events message bus queue for raw data topic

THE KAPPA ARCHITECTURE events message bus queue for raw data topic transform

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform analysis

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting

THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting end-user UI

DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management

SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN Resource manager Cluster scheduler app 1 app 2 Spark executor Shared FS Spark executor Spark executor app 3 app 4 Spark executor Spark executor Spark executor

ONE CLUSTER PER APPLICATION Resource manager app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases

PRACTICALITIES AND POTENTIAL PITFALLS

SCHEDULING Kubernetes app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases

SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs

STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs unnecessary semantic guarantees difficult to manage

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS app 4 app 5 app 6

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations

STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations inelastic stateful can t collocate compute and data

STORAGE Kubernetes app 1 app 2 app 3 object store app 4 app 5 app 6

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance (?)

STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance

in a cloud native architecture, the benefit of HDFS is actually very small and that is why many cloud-first organizations no longer run HDFS, or only run it as a caching layer for S3. Reynold Xin on Quora (http://qr.ae/taf4cn)

NETWORKING

NETWORKING http://app1:8080

NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!)

NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!) http://app1:80

SECURITY

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root net

SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root /tmp/foo net

SECURITY

SECURITY k8s namespace

SECURITY k8s namespace *

NEXT STEPS: FUTURE WORK & PLAYING ALONG AT HOME

NEXT STEPS Further performance evaluation Better developer experience Improved scheduling of Spark tasks on Kubernetes

TRY IT OUT YOURSELF Kubernetes standalone Spark example: https://github.com/kubernetes/kubernetes/tree/master/examples/spark Enabling Spark on OpenShift: https://github.com/radanalyticsio Native Spark on Kubernetes proposal: https://github.com/kubernetes/kubernetes/issues/34377

THANKS! @willb willb@redhat.com https://chapeau.freevariable.com