CONTAINERIZED SPARK ON KUBERNETES William Benton Red Hat, Inc. @willb willb@redhat.com
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Spark executor Spark executor Spark executor Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos Spark executor Spark executor Networked POSIX FS Spark executor Spark executor Spark executor Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 2 Spark executor Spark executor Spark executor Networked POSIX FS 3 Spark executor Spark executor 4 Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 1 Spark executor Networked 1 Spark executor POSIX FS 2 2 Spark executor 3 3 Spark executor 3 Spark executor 4 4 Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014 Mesos 1 1 Spark executor Networked 1 Spark executor POSIX FS 2 2 Spark executor 3 3 Spark executor 3 Spark executor 4 4 Spark executor
Analytics is no longer a separate workload.
Analytics is an essential component of modern datadriven applications.
OUR GOALS
OUR GOALS
OUR GOALS
OUR GOALS
OUR GOALS git
OUR GOALS git
OUR GOALS git
FORECAST Motivating containerized microservices Architectures for analytics and applications Spark clusters in containers: practicalities and pitfalls Play along at home Future work
MOTIVATING MICROSERVICES
A microservice architecture employs lightweight, modular, and typically stateless components with well-defined interfaces and contracts.
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2
BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2 5
BENEFITS OF MICROSERVICE ARCHITECTURES 2 + 2 5
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES?
BENEFITS OF MICROSERVICE ARCHITECTURES?
BENEFITS OF MICROSERVICE ARCHITECTURES?
MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 executor executor executor executor master
MICROSERVICES AND SPARK 1 2 3 4 5 6 7 8 9 10 11 12 λ x: x * 2 executor executor executor executor master
MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
MICROSERVICES AND SPARK 12 24 36 48 10 5 12 6 14 7 16 8 18 9 10 20 11 22 12 24 λ x: x * 2 executor executor executor executor master λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
ARCHITECTURES FOR ANALYTICS AND APPLICATIONS
APPLICATION RESPONSIBILITIES events transform databases transform file, object storage transform
APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform
APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform models train
APPLICATION RESPONSIBILITIES events transform databases transform aggregate file, object storage transform models train
APPLICATION RESPONSIBILITIES events transform databases transform aggregate archive file, object storage transform models train
APPLICATION RESPONSIBILITIES events transform databases transform aggregate archive file, object storage transform models train
APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train
APPLICATION RESPONSIBILITIES events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management
LEGACY ARCHITECTURES
CONVENTIONAL DATA WAREHOUSE events
CONVENTIONAL DATA WAREHOUSE events transform
CONVENTIONAL DATA WAREHOUSE events transform UI
CONVENTIONAL DATA WAREHOUSE events transform UI business logic
CONVENTIONAL DATA WAREHOUSE events transform UI business logic RDBMS
CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction RDBMS processing
CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction RDBMS processing
CONVENTIONAL DATA WAREHOUSE events transform UI business logic transaction processing RDBMS RDBMS
CONVENTIONAL DATA WAREHOUSE events transform UI business logic analysis transaction processing RDBMS RDBMS
CONVENTIONAL DATA WAREHOUSE events transform reporting UI business logic analysis transaction processing RDBMS RDBMS
CONVENTIONAL DATA WAREHOUSE events transform reporting UI business logic analysis transaction processing RDBMS RDBMS
CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction RDBMS RDBMS processing
CONVENTIONAL DATA WAREHOUSE events transform reporting interactive query business UI analysis logic transaction analytic RDBMS RDBMS processing processing
HADOOP-STYLE DATA LAKE HDFS HDFS HDFS
HADOOP-STYLE DATA LAKE HDFS HDFS HDFS HDFS HDFS
HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS
HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS
HADOOP-STYLE DATA LAKE events HDFS HDFS HDFS HDFS HDFS compute compute compute compute compute
HADOOP-STYLE DATA LAKE events HDFS HDFS compute compute
MODERN ARCHITECTURES
THE LAMBDA ARCHITECTURE transform (imprecise) analysis events
THE LAMBDA ARCHITECTURE transform (imprecise) analysis events DFS
THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events DFS
THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS
THE LAMBDA ARCHITECTURE transform (imprecise) analysis speed layer events transform (precise) analysis DFS batch layer
THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate speed layer events transform (precise) analysis DFS batch layer
THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer events transform (precise) analysis DFS batch layer
THE LAMBDA ARCHITECTURE transform (imprecise) analysis federate UI speed layer serving layer events transform (precise) analysis DFS batch layer
THE KAPPA ARCHITECTURE events
THE KAPPA ARCHITECTURE events message bus
THE KAPPA ARCHITECTURE events message bus queue for raw data topic
THE KAPPA ARCHITECTURE events message bus queue for raw data topic transform
THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform
THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic transform analysis
THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis
THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting
THE KAPPA ARCHITECTURE events message bus queue for raw data topic queue for preprocessed data topic queue for analysis results topic transform analysis reporting end-user UI
DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management
DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management
DATA FEDERATION IN THE COMPUTE LAYER events transform developer UI web and mobile databases transform aggregate archive file, object storage transform reporting models train management
SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN Resource manager Cluster scheduler app 1 app 2 Spark executor Shared FS Spark executor Spark executor app 3 app 4 Spark executor Spark executor Spark executor
SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN Resource manager Cluster scheduler app 1 app 2 Spark executor Shared FS Spark executor Spark executor app 3 app 4 Spark executor Spark executor Spark executor
ONE CLUSTER PER APPLICATION Resource manager app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases
ONE CLUSTER PER APPLICATION Resource manager app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases
PRACTICALITIES AND POTENTIAL PITFALLS
SCHEDULING Kubernetes app 1 app 2 app 3 Object stores app 4 app 5 app 6 Databases
SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6
SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6
SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6
SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6
SCHEDULING Kubernetes app 1 app 2 app 1 app 3 app 4 app 5 app 6
STORAGE Kubernetes app 1 app 2 app 3 app 4 app 5 app 6
STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem app 4 app 5 app 6
STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs
STORAGE Kubernetes app 1 app 2 app 3 POSIX filesystem familiar interface app 4 app 5 app 6 interoperability with other programs unnecessary semantic guarantees difficult to manage
STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS app 4 app 5 app 6
STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations
STORAGE Kubernetes HDFS HDFS HDFS app 1 app 2 app 3 HDFS HDFS HDFS support for legacy app 4 app 5 app 6 Hadoop installations inelastic stateful can t collocate compute and data
STORAGE Kubernetes app 1 app 2 app 3 object store app 4 app 5 app 6
STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations
STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance (?)
STORAGE Kubernetes app 1 app 2 app 3 object store interoperability app 4 app 5 app 6 fine-grained AC many implementations consistency model performance
in a cloud native architecture, the benefit of HDFS is actually very small and that is why many cloud-first organizations no longer run HDFS, or only run it as a caching layer for S3. Reynold Xin on Quora (http://qr.ae/taf4cn)
NETWORKING
NETWORKING
NETWORKING
NETWORKING http://app1:8080
NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!)
NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!)
NETWORKING http://app1:8080 can t access worker web UI (but wait for Spark 2.1!) http://app1:80
SECURITY
SECURITY
SECURITY
SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root net
SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root net
SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root /tmp/foo net
SECURITY $SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.worker \ master:7077 pid root /tmp/foo net
SECURITY
SECURITY k8s namespace
SECURITY k8s namespace *
NEXT STEPS: FUTURE WORK & PLAYING ALONG AT HOME
NEXT STEPS Further performance evaluation Better developer experience Improved scheduling of Spark tasks on Kubernetes
TRY IT OUT YOURSELF Kubernetes standalone Spark example: https://github.com/kubernetes/kubernetes/tree/master/examples/spark Enabling Spark on OpenShift: https://github.com/radanalyticsio Native Spark on Kubernetes proposal: https://github.com/kubernetes/kubernetes/issues/34377
THANKS! @willb willb@redhat.com https://chapeau.freevariable.com