BUILDING CLOUD NATIVE APACHE SPARK APPLICATIONS WITH OPENSHIFT. Michael McCune 11 January 2017

Similar documents
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Code: Slides:

DevOps + Infrastructure TRACK SUPPORTED BY

Creating a Multi-Container Pod

So, I have all these containers! Now what?

/ Cloud Computing. Recitation 5 February 14th, 2017

Important DevOps Technologies (3+2+3days) for Deployment

Red Hat Mobile Application Platform 4.2 Operations Guide

What s New in Red Hat OpenShift Container Platform 3.4. Torben Jäger Red Hat Solution Architect

Introduction to Kubernetes Storage Primitives for Stateful Workloads

Accelerate at DevOps Speed With Openshift v3. Alessandro Vozza & Samuel Terburg Red Hat

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

BIG DATA COURSE CONTENT

Get Data, Build Apps and Analyze Data Using IBM Bluemix Data and Analytics (Session 6748)

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

StreamSets Control Hub Installation Guide

CSE 444: Database Internals. Lecture 23 Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

AGILE DEVELOPMENT AND PAAS USING THE MESOSPHERE DCOS

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Open Service Broker API: Creating a Cross-Platform Standard Doug Davis IBM Shannon Coen Pivotal

OpenShift 3 Technical Architecture. Clayton Coleman, Dan McPherson Lead Engineers

Kuber-what?! Learn about Kubernetes

Mesosphere and Percona Server for MongoDB. Jeff Sandstrom, Product Manager (Percona) Ravi Yadav, Tech. Partnerships Lead (Mesosphere)

Apache Spark and Scala Certification Training

Scaling Jenkins with Docker and Kubernetes Carlos

UNDER THE HOOD. ROGER NUNN Principal Architect/EMEA Solution Manager 21/01/2015

Mesosphere and Percona Server for MongoDB. Peter Schwaller, Senior Director Server Eng. (Percona) Taco Scargo, Senior Solution Engineer (Mesosphere)

Container Orchestration on Amazon Web Services. Arun

OpenShift Container Platform 3.7

Taming your heterogeneous cloud with Red Hat OpenShift Container Platform.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

An Introduction to Kubernetes

SAP API Management and API Business Hub Overview

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Ruby in the Sky with Diamonds. August, 2014 Sao Paulo, Brazil

OpenShift Dedicated 3 Release Notes

LAB EXERCISE: RedHat OpenShift with Contrail 5.0

INTRODUCING CONTAINER-NATIVE VIRTUALIZATION

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Processing of big data with Apache Spark

Introduction to the Open Service Broker API. Doug Davis

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Openshift: Key to modern DevOps

Microservices and Container Development

Introduction to Kubernetes

Vitess on Kubernetes. followed by a demo of VReplication. Jiten Vaidya

CONTAINERS AND MICROSERVICES WITH CONTRAIL

Understanding the latent value in all content

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

OpenShift is FanPaaStic. Linqing Lu PaaS Dragon

Upcoming Services in OpenStack Rohit Agarwalla, Technical DEVNET-1102

Go Faster: Containers, Platforms and the Path to Better Software Development (Including Live Demo)

Putting A Spark in Web Apps

Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Infoblox IPAM Driver for Kubernetes User's Guide

Heiðrun. Building DPLA s New Metadata Ingestion System. Mark A. Matienzo Digital Public Library of America

Goals Pragmukko features:

OpenShift Container Platform 3.9

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Infoblox IPAM Driver for Kubernetes. Page 1

/ Cloud Computing. Recitation 5 September 26 th, 2017

Using DC/OS for Continuous Delivery

CREATING A CLOUD STRONGHOLD: Strategies and Methods to Manage and Secure Your Cloud

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Unifying Big Data Workloads in Apache Spark

Kubernetes 1.9 Features and Future

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

OpenStack Mitaka Release Overview

Fixing the "It works on my machine!" Problem with Docker

Back-end architecture

SciSpark 201. Searching for MCCs

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Kubernetes deep dive

NGINX: From North/South to East/West

MODERN APPLICATION ARCHITECTURE DEMO. Wanja Pernath EMEA Partner Enablement Manager, Middleware & OpenShift

OpenShift Dedicated 3

Nulecule. Packaging, Distributing & Deploying Container Applications the Cloud Way Ghent, Belgium

Continuous delivery while migrating to Kubernetes

Red Hat OpenShift Roadmap Q4 CY16 and H1 CY17 Releases. Lutz Lange Solution

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Secure Kubernetes Container Workloads

Red Hat Mobile Application Platform 4.1 MBaaS Administration and Installation Guide

Spark Streaming. Big Data Analysis with Scala and Spark Heather Miller

CloudCenter for Developers

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

@briandorsey #kubernetes #GOTOber

The Evolution of Big Data Platforms and Data Science

Cloud Computing & Visualization

MSB to Support for Carrier Grade ONAP Microservice Architecture. Huabing Zhao, PTL of MSB Project, ZTE

OpenShift Cheat Sheet

Declarative Modeling for Cloud Deployments

Cisco Spark API Workshop

IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM s sole discretion.

Real Life Web Development. Joseph Paul Cohen

BI ENVIRONMENT PLANNING GUIDE

Transcription:

BUILDING CLOUD NATIVE APACHE SPARK APPLICATIONS WITH OPENSHIFT Michael McCune 11 January 2017 1

INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for OpenShift 2

FORECAST Building Application Pipelines Case Study: Ophicleide Demonstration Lessons Learned Next Steps 3

INSPIRATION Larger themes Developer empowerment Improved collaboration Operational freedom 4

CLOUD APPLICATIONS What are we talking about? Multiple disparate components Require deployment flexibility Challenging to debug Ruby Spark MongoDB HTTP Node.js MySQL Kafka HDFS Python ActiveMQ PostgreSQL 5

PLANNING Before you begin engineering Identify moving pieces Storyboard the data flow Visualize success and failure Node.js Python Spark MongoDB HTTP 6

PLANNING Insightful analytics What dataset? How to process? Where are the results? Node.js Python Spark MongoDB HTTP Ingest Process Publish 7

BUILDING Decompose application components Natural breakpoints Build for modularity Stateless versus stateful Node.js Python Spark MongoDB HTTP 8

BUILDING Focus on the communication Coordinate in the middle Network resiliency Kubernetes DNS 9

COLLABORATING Building as a team The right tools Modular projects Iterative improvements Coordinating actions Node.js Python Spark MongoDB HTTP 10

CASE STUDY: OPHICLEIDE 11

CASE STUDY: OPHICLEIDE What does it do? Word2Vec models HTTP available data Similarity queries Browser Node.js Python Spark Spark Text Data Text Data Text Data Kubernetes MongoDB Spark Spark Spark 12

CASE STUDY: OPHICLEIDE Building blocks Apache Spark Word2Vec Kubernetes OpenShift Node.js Flask MongoDB OpenAPI 13

DEEP DIVE OpenAPI Schema for REST APIs Wealth of tooling Central discussion point 14

OPENAPI paths: /: get: description: - Returns information about the server version responses: "200": description: - Valid server info response schema: app = connexion.app( name, specification_dir='./swagger/') app.add_api('swagger.yaml', arguments={'title': 'The REST API for the Ophicleide ' 'Word2Vec server'}) app.run(port=8080) 15

DEEP DIVE Configuration Data What is needed? How to deliver? MONGO=mongodb://admin:admin@mongodb REST_ADDR=127.0.0.1 REST_PORT=8080 Python Node.js Kubernetes 16

CONFIGURATION DATA spec: containers: - name: ${WEBNAME} image: ${WEBIMAGE} env: - name: OPH_TRAINING_ADDR value: ${OPH_ADDR} - name: OPH_TRAINING_PORT value: ${OPH_PORT} - name: OPH_WEB_PORT value: "8081" ports: - containerport: 8081 protocol: TCP 17

CONFIGURATION DATA var training_addr = process.env.oph_training_addr '127.0.0.1'; var training_port = process.env.oph_training_port '8080'; var web_port = process.env.oph_web_port 8080; app.get("/api/models", function(req, res) { var url = `http://${training_addr}:${training_port}/models`; request.get(url).pipe(res); }); app.get("/api/queries", function(req, res) { var url = `http://${training_addr}:${training_port}/queries`; request.get(url).pipe(res); }); app.listen(ophicleide_web_port, function() { console.log(`ophicleide-web listening on ${web_port}`); }); 18

SECRETS Not used in Ophicleide, but worth mentioning volumes: - name: mongo-secret-volume secret: secretname: mongo-secret containers: - name: shiny-squirrel image: elmiko/shiny_squirrel args: ["mongodb"] volumemounts: - name: mongo-secret-volume mountpath: /etc/mongo-secret readonly: true 19

SECRETS Each secret exposed as a file in the container MONGO_USER=$(cat /etc/mongo-secret/username) MONGO_PASS=$(cat /etc/mongo-secret/password) /usr/bin/python /opt/shiny_squirrel/shiny_squirrel.py \ --mongo \ mongodb://${mongo_user}:${mongo_pass}@${mongo_host_port} 20

DEEP DIVE Spark processing Read text from URL Split words Create vectors 21

SPARK PROCESSING def workloop(master, inq, outq, dburl): sconf = SparkConf().setAppName( "ophicleide-worker").setmaster(master) sc = SparkContext(conf=sconf) if dburl is not None: db = pymongo.mongoclient(dburl).ophicleide outq.put("ready") while True: job = inq.get() urls = job["urls"] mid = job["_id"] model = train(sc, urls) items = model.getvectors().items() words, vecs = zip(*[(w, list(v)) for w, v in items]) 22

SPARK PROCESSING def train(sc, urls): w2v = Word2Vec() rdds = reduce(lambda a, b: a.union(b), [url2rdd(sc, url) for url in urls]) return w2v.fit(rdds) def url2rdd(sc, url): response = urlopen(url) corpus_bytes = response.read() text = str( corpus_bytes).replace("\\r", "\r").replace("\\n", "\n") rdd = sc.parallelize(text.split("\r\n\r\n")) return rdd.map(lambda l: cleanstr(l).split(" ")) 23

SPARK PROCESSING def create_query(newquery) -> str: mid = newquery["model"] word = newquery["word"] model = model_cache_find(mid) if model is None: msg = (("no trained model with ID %r available; " % mid) + "check /models to see when one is ready") return json_error("not Found", 404, msg) else: # XXX w2v = model["w2v"] qid = uuid4() try: syns = w2v.findsynonyms(word, 5) q = { "_id": qid, "word": word, "results": syns, "modelname": model["name"], "model": mid } (query_collection()).insert_one(q) 24

DEMONSTRATION see the demonstration at https://vimeo.com/189710503 25

LESSONS LEARNED Things that went smoothly OpenAPI Dockerfiles Kuberenetes templates 26

LESSONS LEARNED Things that require greater coordination API construction Compute resources Persistent storage Spark configurations 27

LESSONS LEARNED Compute resources CPU and memory constraints Label selectors Kubelet Kubelet Kubelet Pod Pod Pod Pod Pod Pod Pod Pod Pod Node Node Node 28

NEXT STEPS Where to take this project? More Spark! Separate query service Development versus production 29

PROJECT LINKS Ophicleide - https://github.com/ophicleide Apache Spark - https://spark.apache.org Kubernetes - https://kubernetes.io OpenShift - https://openshift.org Oshinko - https://github.com/radanalyticsio Radanalytics developer portal - http://radanalytics.io 30

THANKS! elmiko @FOSSjunkie https://elmiko.github.io 31