Simplifying ML Workflows with Apache Beam & TensorFlow Extended

Size: px
Start display at page:

Download "Simplifying ML Workflows with Apache Beam & TensorFlow Extended"

Transcription

1 Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Software Engineer at Google Apache Beam PMC

2 Apache Beam Portable data-processing pipelines

3 Example pipelines Python Java

4 Cross-language Portability Framework Language A SDK Language B SDK Language C SDK The Beam Model Runner 1 Runner 2 Runner 3 The Beam Model Language A Language B Language C

5 Python compatible runners Direct runner (local machine): Now Google Cloud Dataflow: Now Apache Flink: Q2-Q3 Apache Spark: Q3-Q4

6 TensorFlow Extended End-to-end machine learning in production

7 Doing ML in production is hard. -Everyone who has ever tried

8 Because, in addition to the actual ML... ML Code

9 ...you have to worry about so much more. Data Verification Monitoring Configuration Data Collection ML Code Serving Infrastructure Analysis Tools Process Management Tools Feature Extraction Machine Resource Management Source: Sculley et al.: Hidden Technical Debt in Machine Learning Systems

10 In this talk, I will...

11 In this talk, I will... Show you how to apply transformations... TensorFlow Transform

12 In this talk, we will... Show you how to apply transformations consistently between Training and Serving TensorFlow Transform TensorFlow Estimators TensorFlow Serving

13 In this talk, we will... Introduce something new... TensorFlow Transform TensorFlow Estimators TensorFlow Model Analysis TensorFlow Serving

14 TensorFlow Transform Consistent In-Graph Transformations in Training and Serving

15 Typical ML Pipeline During training During serving data request batch processing live processing

16 Typical ML Pipeline During training During serving data request batch processing live processing

17 TensorFlow Transform During training During serving data request tf.transform batch processing transform as tf.graph

18 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } Many operations available for dealing with text and numeric features, user can define their own. Y Z

19 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } Many operations available for dealing with text and numeric features, user can define their own. mean stddev normalize Y Z

20 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } Many operations available for dealing with text and numeric features, user can define their own. mean stddev normalize multiply Y Z

21 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } mean stddev normalize multiply quantiles Many operations available for dealing with text and numeric features, user can define their own. bucketize A Y Z

22 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } mean Z Y stddev normalize multiply quantiles Many operations available for dealing with text and numeric features, user can define their own. bucketize A B

23 Defining a preprocessing function in TF Transform X def preprocessing_fn(inputs): x = inputs['x']... return { "A": tft.bucketize( tft.normalize(x) * y), "B": tensorflow_fn(y, z), "C": tft.ngrams(z) } mean Y Z stddev normalize multiply quantiles Many operations available for dealing with text and numeric features, user can define their own. bucketize A B C

24 Analyzers mean stddev Transforms normalize Reduce (full pass) Implemented as a distributed data pipeline multiply quantiles bucketize Instance-to-instance (don t change batch dimension) Pure TensorFlow

25 data constant tensors mean stddev normalize multiply Analyze normalize multiply quantiles bucketize bucketize

26 What can be done with TF Transform? tf.transform batch processing Pretty much anything.

27 What can be done with TF Transform? tf.transform batch processing Pretty much anything. Serving Graph Anything that can be expressed as a TensorFlow Graph

28 Some common use-cases... Scale to... Bag of Words / N-Grams Bucketization Feature Crosses

29 Some common use-cases... Scale to... Bag of Words / N-Grams Bucketization Feature Crosses Apply another TensorFlow Model

30 github.com/tensorflow/transform

31 Introducing TensorFlow Model Analysis Scaleable, sliced, and full-pass metrics

32 Let s Talk about Metrics... How accurate? Converged model? What about my TB sized eval set? Slices / subsets? Across model versions?

33 ML Fairness: analyzing model mistakes by subgroup ROC Curve Sensitivity (True Positive Rate) All groups Specificity (False Positive Rate) Learn more at ml-fairness.com

34 ML Fairness: analyzing model mistakes by subgroup ROC Curve All groups Sensitivity (True Positive Rate) Group A Group B Specificity (False Positive Rate) Learn more at ml-fairness.com

35 ML Fairness: understand the failure modes of your models

36 ML Fairness: Learn More ml-fairness.com

37 How does it work? Inference Graph (SavedModel)... estimator = DNNLinearCombinedClassifier(...) estimator.train(...) estimator.export_savedmodel( serving_input_receiver_fn=serving_input_fn) tfma.export.export_eval_savedmodel ( estimator=estimator, eval_input_receiver_fn=eval_input_fn)... SignatureDef

38 How does it work? Inference Graph (SavedModel)... estimator = DNNLinearCombinedClassifier(...) estimator.train(...) SignatureDef estimator.export_savedmodel( serving_input_receiver_fn=serving_input_fn) tfma.export.export_eval_savedmodel ( estimator=estimator, eval_input_receiver_fn=eval_input_fn)... Eval Graph (SavedModel) Eval SignatureDef Metadata

39 github.com/tensorflow/model-analysis

40 Summary Apache Beam: Data-processing framework the runs locally and scales to massive data, in the Cloud (now) and soon on-premise via Flink (Q2-Q3) and Spark (Q3-Q4). Powers large-scale data processing in the TF libraries below. tf.transform: Consistent in-graph transformations in training and serving. tf.modelanalysis: Scalable, sliced, and full-pass metrics.

Fundamentals of Stream Processing with Apache Beam (incubating)

Fundamentals of Stream Processing with Apache Beam (incubating) Google Docs version of slides (including animations): https://goo.gl/yzvlxe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache

More information

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM TOWARDS PORTABILITY AND BEYOND Maximilian Michels mxm@apache.org DATA PROCESSING WITH APACHE BEAM @stadtlegende maximilianmichels.com !2 BEAM VISION Write Pipeline Execute SDKs Runners Backends !3 THE

More information

Portable stateful big data processing in Apache Beam

Portable stateful big data processing in Apache Beam Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google klk@google.com / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San

More information

Apache Beam: portable and evolutive data-intensive applications

Apache Beam: portable and evolutive data-intensive applications Apache Beam: portable and evolutive data-intensive applications Ismaël Mejía - @iemejia Talend Who am I? @iemejia Software Engineer Apache Beam PMC / Committer ASF member Integration Software Big Data

More information

Real-Time Decisions Using ML on the Google Cloud Platform. Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018

Real-Time Decisions Using ML on the Google Cloud Platform. Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018 Real-Time Decisions Using ML on the Google Cloud Platform Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018 How many of you are interested in machine learning? but how many of you are running

More information

Data Analysis and Validation for ML

Data Analysis and Validation for ML Analysis and for ML Neoklis (Alkis) Polyzotis, Google Research Collaborators: Eric Breck, Sudip Roy, Steven Whang, Martin Zinkevich Outline ML in production is hard, and a big part of hardness is related

More information

An Introduction to The Beam Model

An Introduction to The Beam Model An Introduction to The Beam Model Apache Beam (incubating) Slides by Tyler Akidau & Frances Perry, April 2016 Agenda 1 Infinite, Out-of-order Data Sets 2 The Evolution of the Beam Model 3 What, Where,

More information

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor

More information

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Flexible Network Analytics in the Cloud Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Introduction Harsh realities of network analytics netbeam Demo

More information

Introduction to Apache Beam

Introduction to Apache Beam Introduction to Apache Beam Dan Halperin JB Onofré Google Beam podling PMC Talend Beam Champion & PMC Apache Member Apache Beam is a unified programming model designed to provide efficient and portable

More information

Apache Beam. Modèle de programmation unifié pour Big Data

Apache Beam. Modèle de programmation unifié pour Big Data Apache Beam Modèle de programmation unifié pour Big Data Who am I? Jean-Baptiste Onofre @jbonofre http://blog.nanthrax.net Member of the Apache Software Foundation

More information

A CD Framework For Data Pipelines. Yaniv

A CD Framework For Data Pipelines. Yaniv A CD Framework For Data Pipelines Yaniv Rodenski @YRodenski yaniv@apache.org Archetypes of Data Pipelines Builders Data People (Data Scientist/ Analysts/BI Devs) Exploratory workloads Code centric Software

More information

Data Processing with Apache Beam (incubating) and Google Cloud Dataflow

Data Processing with Apache Beam (incubating) and Google Cloud Dataflow Data Processing with Apache Beam (incubating) and Google Cloud Dataflow Jelena Pjesivac-Grbovic Staff software engineer Cloud Big Data In collaboration with Frances Perry, Tayler Akidau, and Dataflow team

More information

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar Luigi Build Data Pipelines of batch jobs - Pramod Toraskar I am a Principal Solution Engineer & Pythonista with more than 8 years of work experience, Works for a Red Hat India an open source solutions

More information

FROM ZERO TO PORTABILITY

FROM ZERO TO PORTABILITY FROM ZERO TO PORTABILITY? Maximilian Michels mxm@apache.org APACHE BEAM S JOURNEY TO CROSS-LANGUAGE DATA PROCESSING @stadtlegende maximilianmichels.com FOSDEM 2019 What is Beam? What does portability mean?

More information

Personalizing Netflix with Streaming datasets

Personalizing Netflix with Streaming datasets Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about? Helping you decide if a streaming pipeline fits your ETL problem

More information

Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google

Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time

More information

Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care

Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care Kuassi Mensah Jean De Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 Safe Harbor

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

IoT Sensor Analytics with Apache Kafka, KSQL and TensorFlow

IoT Sensor Analytics with Apache Kafka, KSQL and TensorFlow 1 IoT Sensor Analytics with Apache Kafka, KSQL and TensorFlow Kafka-Native End-to-End IoT Data Integration and Processing Kai Waehner - Technology Evangelist kontakt@kai-waehner.de - LinkedIn Twitter :

More information

Onto Petaflops with Kubernetes

Onto Petaflops with Kubernetes Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes

More information

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team What is TensorFlowOnSpark Why TensorFlowOnSpark at Yahoo? Major contributor to open-source

More information

Deploying Machine Learning Models in Practice

Deploying Machine Learning Models in Practice Deploying Machine Learning Models in Practice Nick Pentreath Principal Engineer @MLnick About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies

More information

Streaming Auto-Scaling in Google Cloud Dataflow

Streaming Auto-Scaling in Google Cloud Dataflow Streaming Auto-Scaling in Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game https://commons.wikimedia.org/wiki/file:globe_centered_in_the_atlantic_ocean_(green_and_grey_globe_scheme).svg

More information

Processing Data Like Google Using the Dataflow/Beam Model

Processing Data Like Google Using the Dataflow/Beam Model Todd Reedy Google for Work Sales Engineer Google Processing Data Like Google Using the Dataflow/Beam Model Goals: Write interesting computations Run in both batch & streaming Use custom timestamps Handle

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models

More information

Developing Machine Learning Models. Kevin Tsai

Developing Machine Learning Models. Kevin Tsai Developing Machine Learning Models Kevin Tsai GOTO Chicago 2018 Developing Machine Learning Models Kevin Tsai, Google Agenda Machine Learning with tf.estimator Performance pipelines with TFRecords and

More information

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017. Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Lyamine Hedjazi 2015 The MathWorks, Inc. 1 Data Analytics Workflow Preprocessing Data Business Systems Build Algorithms Smart Connected Systems Take Decisions

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS 1 Why GPUs? A Tale of Numbers 100x Performance Increase Infrastructure Cost Savings Performance 100x gains over traditional

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

Azure Data Factory. Data Integration in the Cloud

Azure Data Factory. Data Integration in the Cloud Azure Data Factory Data Integration in the Cloud 2018 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and

More information

Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis

Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis *1 Bansri Kotecha, 2 Hetal Joshiyara 1 PG Scholar, 2 Assistant Professor 1, 2 Computer science Department,

More information

Introduction to TensorFlow. Mor Geva, Apr 2018

Introduction to TensorFlow. Mor Geva, Apr 2018 Introduction to TensorFlow Mor Geva, Apr 2018 Introduction to TensorFlow Mor Geva, Apr 2018 Plan Why TensorFlow Basic Code Structure Example: Learning Word Embeddings with Skip-gram Variable and Name Scopes

More information

Building a Data-Friendly Platform for a Data- Driven Future

Building a Data-Friendly Platform for a Data- Driven Future Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,

More information

Grow your cloud knowledge.

Grow your cloud knowledge. Grow your cloud knowledge. Bootcamps. Dates: Monday, July 23 Friday, July 27. Price: $495 Full Day $249 Half Day. Build your practical skills and theoretical knowledge at in-person training courses that

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring

Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring Migrating massive monitoring to Bigtable without downtime Martin Parm, Infrastructure Engineer for Monitoring This is a big deal. -- Nicholas Harteau/VP, Engineering & Infrastructure https://news.spotify.com/dk/2016/02/23/announcing-spotify-infrastructures-googley-future/

More information

Processing Data of Any Size with Apache Beam

Processing Data of Any Size with Apache Beam Processing Data of Any Size with Apache Beam 1 / 19 Chapter 1 Introducing Apache Beam 2 / 19 Introducing Apache Beam What Is Beam? Why Use Beam? Using Beam 3 / 19 Apache Beam Apache Beam is a unified model

More information

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing

More information

Experiences with Apache Beam. Dan Debrunner Programming Model Architect IBM Streams STSM, IBM

Experiences with Apache Beam. Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Experiences with Apache Beam Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Background To define my point of view IBM Streams brief history 2002 IBM Research/DoD joint research project

More information

Python lab session 1

Python lab session 1 Python lab session 1 Dr Ben Dudson, Department of Physics, University of York 28th January 2011 Python labs Before we can start using Python, first make sure: ˆ You can log into a computer using your username

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Pursuit of stability. Growing AWS ECS in production. Alexander Köhler Frankfurt, September 2018

Pursuit of stability. Growing AWS ECS in production. Alexander Köhler Frankfurt, September 2018 Pursuit of stability Growing AWS ECS in production Alexander Köhler Frankfurt, September 2018 Alexander Köhler DevOps Engineer Systems Engineer Big Data Engineer Application Developer 2 @la3mmchen inovex

More information

How Apache Beam Will Change Big Data

How Apache Beam Will Change Big Data How Apache Beam Will Change Big Data 1 / 21 About Big Data Institute Mentoring, training, and high-level consulting company focused on Big Data, NoSQL and The Cloud Founded in 2008 We help make companies

More information

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Course overview Part 1 Challenges Fundamentals and challenges

More information

How to go serverless with AWS Lambda

How to go serverless with AWS Lambda How to go serverless with AWS Lambda Roman Plessl, nine (AWS Partner) Zürich, AWSomeDay 12. September 2018 About myself and nine Roman Plessl Working for nine as a Solution Architect, Consultant and Leader.

More information

Gabriel Villa. Architecting an Analytics Solution on AWS

Gabriel Villa. Architecting an Analytics Solution on AWS Gabriel Villa Architecting an Analytics Solution on AWS Cloud and Data Architect Skilled leader, solution architect, and technical expert focusing primarily on Microsoft technologies and AWS. Passionate

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Machine Learning. Bridging the OT IT Gap for Machine Learning with Ignition and AWS Greengrass

Machine Learning. Bridging the OT IT Gap for Machine Learning with Ignition and AWS Greengrass Machine Learning with Ignition and AWS Greengrass Bridging B the OT IT Gap for Machine Learning Simply Connect Ignition & AWS Greengrass for Machine Learning! Bridging the OT IT gaps is now easier using

More information

Database infrastructure for electronic structure calculations

Database infrastructure for electronic structure calculations Database infrastructure for electronic structure calculations Fawzi Mohamed fawzi.mohamed@fhi-berlin.mpg.de 22.7.2015 Why should you be interested in databases? Can you find a calculation that you did

More information

Enabling High-Quality Printing in Web Applications. Tanu Hoque & Craig Williams

Enabling High-Quality Printing in Web Applications. Tanu Hoque & Craig Williams Enabling High-Quality Printing in Web Applications Tanu Hoque & Craig Williams New Modern Print Service with ArcGIS Enterprise 10.6 Quality Improvements: Support for true color level transparency PDF produced

More information

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Down the event-driven road: Experiences of integrating streaming into analytic data platforms Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate

More information

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating) Crossing the AI Chasm What We Learned from building Apache PredictionIO (incubating) Simon Chan Sr. Director, Product Management, Salesforce Co-founder, PredictionIO PhD, University College London simon@salesforce.com

More information

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018 Machine Learning in Python Rohith Mohan GradQuant Spring 2018 What is Machine Learning? https://twitter.com/myusuf3/status/995425049170489344 Traditional Programming Data Computer Program Output Getting

More information

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS @unterstein @dcos @bedcon #bedcon Operating microservices with Apache Mesos and DC/OS 1 Johannes Unterstein Software Engineer @Mesosphere @unterstein @unterstein.mesosphere 2017 Mesosphere, Inc. All Rights

More information

The Future of Analytics or The New SQL

The Future of Analytics or The New SQL The Future of Analytics or The New SQL Gerhard Otterbach, Sales Manager Teradata Germany Hanau, Feb. 28th, 2018 Teradata At A Glance: 39 Years Ago Teradata was big data before there was big data Donald

More information

MLeap: Release Spark ML Pipelines

MLeap: Release Spark ML Pipelines MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook

More information

Understanding the latent value in all content

Understanding the latent value in all content Understanding the latent value in all content John F. Kennedy (JFK) November 22, 1963 INGEST ENRICH EXPLORE Cognitive skills Data in any format, any Azure store Search Annotations Data Cloud Intelligence

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc Overview Supported Technologies Cray PE Python Support Shifter Urika-XC Anaconda Python Spark Intel BigDL machine

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER ABOUT ME Apache Flink PMC member & ASF member Contributing since day 1 at TU Berlin Focusing on

More information

Moven. Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure. Sergio Fernández (Redlink GmbH) November 14th, Sevilla

Moven. Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure. Sergio Fernández (Redlink GmbH) November 14th, Sevilla Moven Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure Sergio Fernández (Redlink GmbH) November 14th, 2016 - Sevilla SSIX aims to exploit the predictive power of Social Media

More information

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018 Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable

More information

Transitioning From SSIS to Azure Data Factory. Meagan Longoria, Solution Architect, BlueGranite

Transitioning From SSIS to Azure Data Factory. Meagan Longoria, Solution Architect, BlueGranite Transitioning From SSIS to Azure Data Factory Meagan Longoria, Solution Architect, BlueGranite Microsoft Data Platform MVP I enjoy contributing to and learning from the Microsoft data community. Blogger

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

DECISION SUPPORT SYSTEM USING TENSOR FLOW

DECISION SUPPORT SYSTEM USING TENSOR FLOW Volume 118 No. 24 2018 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ DECISION SUPPORT SYSTEM USING TENSOR FLOW D.Anji Reddy 1, G.Narasimha 2, K.Srinivas

More information

Intro. Scheme Basics. scm> 5 5. scm>

Intro. Scheme Basics. scm> 5 5. scm> Intro Let s take some time to talk about LISP. It stands for LISt Processing a way of coding using only lists! It sounds pretty radical, and it is. There are lots of cool things to know about LISP; if

More information

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Container 2.0. Container: check! But what about persistent data, big data or fast data?! @unterstein @joerg_schad @dcos @jaxdevops Container 2.0 Container: check! But what about persistent data, big data or fast data?! 1 Jörg Schad Distributed Systems Engineer @joerg_schad Johannes Unterstein

More information

Integrate MATLAB Analytics into Enterprise Applications

Integrate MATLAB Analytics into Enterprise Applications Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business

More information

Introduction to MATLAB application deployment

Introduction to MATLAB application deployment Introduction to application deployment Antti Löytynoja, Application Engineer 2015 The MathWorks, Inc. 1 Technical Computing with Products Access Explore & Create Share Options: Files Data Software Data

More information

BlackPearl Customer Created Clients Using Free & Open Source Tools

BlackPearl Customer Created Clients Using Free & Open Source Tools BlackPearl Customer Created Clients Using Free & Open Source Tools December 2017 Contents A B S T R A C T... 3 I N T R O D U C T I O N... 3 B U L D I N G A C U S T O M E R C R E A T E D C L I E N T...

More information

Designing for the Cloud

Designing for the Cloud Design Page 1 Designing for the Cloud 3:47 PM Designing for the Cloud So far, I've given you the "solution" and asked you to "program a piece of it". This is a short-term problem! In fact, 90% of the real

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression

More information

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually

More information

Modern Stream Processing with Apache Flink

Modern Stream Processing with Apache Flink 1 Modern Stream Processing with Apache Flink Till Rohrmann GOTO Berlin 2017 2 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager 3 What changes faster? Data

More information

Distributed systems for stream processing

Distributed systems for stream processing Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine

More information

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly Eagle Eye Sommersemester 2017 Big Data Science Praktikum Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly 1 Sommersemester Agenda 2009 Brief Introduction Pre-processiong of dataset Front-end

More information

WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG

WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG The #1 Challenge in Successfully Deploying a Data Catalog The data cataloging space is relatively new. As a result, many organizations don

More information

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect Serverless Architecture Hochskalierbare Anwendungen ohne Server Sascha Möllering, Solutions Architect Agenda Serverless Architecture AWS Lambda Amazon API Gateway Amazon DynamoDB Amazon S3 Serverless Framework

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Central platform for business-wide, standardized Conversion of documents to PDF and PDF/A

Central platform for business-wide, standardized Conversion of documents to PDF and PDF/A PDF Days Europe 2017 Central platform for business-wide, standardized Conversion of documents to PDF and PDF/A A PDF Association Presentation 2017 by PDF Association 1 Sources of PDF-files in larger organizations

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Software Quality Assurance. David Janzen

Software Quality Assurance. David Janzen Software Quality Assurance David Janzen What is quality? Crosby: Conformance to requirements Issues: who establishes requirements? implicit requirements Juran: Fitness for intended use Issues: Who defines

More information

CloudMan cloud clusters for everyone

CloudMan cloud clusters for everyone CloudMan cloud clusters for everyone Enis Afgan usecloudman.org This is accessibility! But only sometimes So, there are alternatives BUT WHAT IF YOU WANT YOUR OWN, QUICKLY The big picture A. Users in different

More information

For this class we are going to create a file in Microsoft Word. Open Word on the desktop.

For this class we are going to create a file in Microsoft Word. Open Word on the desktop. File Management Windows 10 What is File Management? As you use your computer and create files you may need some help in storing and retrieving those files. File management shows you how to create, move,

More information

Google Cloud Dataflow

Google Cloud Dataflow Google Cloud Dataflow A Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic STREAM 2015 Agenda 1 Data Shapes 2 Data Processing Tradeoffs 3 Google s Data Processing Story 4 Google

More information