Faster Workflows, Faster. Ken Krugler President, Scale Unlimited

Similar documents
Practical Big Data Processing An Overview of Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

Processing of big data with Apache Spark

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Portable stateful big data processing in Apache Beam

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

CSE 444: Database Internals. Lecture 23 Spark

Hadoop Map Reduce 10/17/2018 1

DATA SCIENCE USING SPARK: AN INTRODUCTION

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

Spark Overview. Professor Sasu Tarkoma.

Unifying Big Data Workloads in Apache Spark

Big Data Infrastructures & Technologies

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big data systems 12/8/17

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

The Evolution of Big Data Platforms and Data Science

Databricks, an Introduction

Apache Beam. Modèle de programmation unifié pour Big Data

Cascading 2.1 User Guide

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Architecture of Flink's Streaming Runtime. Robert

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

New Developments in Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Apache Flink- A System for Batch and Realtime Stream Processing

An Introduction to Apache Spark

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

_APP A_541_10/31/06. Appendix A. Backing Up Your Project Files

Big Data with R and Hadoop

How Apache Beam Will Change Big Data

Bringing Data to Life

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

The Evolution of a Data Project

Prototyping Data Intensive Apps: TrendingTopics.org

Deep Learning Frameworks with Spark and GPUs

Real-time data processing with Apache Flink

Hadoop. Introduction / Overview

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Analysis in the Big Data Era

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

An Introduction to Big Data Analysis using Spark

Chapter 4: Apache Spark

Apache Flink. Alessandro Margara

Introduction to Apache Beam

Approaching the Petabyte Analytic Database: What I learned

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Apache Flink Big Data Stream Processing

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Expert Lecture plan proposal Hadoop& itsapplication

Resilient Distributed Datasets

Scaling Marketplaces at Thumbtack QCon SF 2017

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Hadoop, Yarn and Beyond

Big Data Analytics using Apache Hadoop and Spark with Scala

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Batch & Stream Graph Processing with Apache Flink. Vasia

Lambda Architecture with Apache Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Improving the MapReduce Big Data Processing Framework

CS 220: Introduction to Parallel Computing. Arrays. Lecture 4

Hadoop Online Training

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Spark, Shark and Spark Streaming Introduction

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

ELTMaestro for Spark: Data integration on clusters

Data-Intensive Distributed Computing

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Faster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni,

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Parallel Nested Loops

Hadoop Execution Environment

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Transcription:

Faster Workflows, Faster Ken Krugler President, Scale Unlimited

The Twitter Pitch Cascading is a solid, established workflow API Good for complex custom ETL workflows Flink is a new streaming dataflow engine 50% better performance by leveraging memory

Perils of Comparisons Performance comparisons are clickbait for devs You won t believe the speed of Flink! Really, really hard to do well I did moderate tuning of existing code base Somewhat complex workflow in EMR Dataset bigger than memory (100M 1B records)

TL;DR Flink gets faster by minimizing disk I/O Map-Reduce job always has write/read at job break Can also spill map output, reduce merge-sort Flink has no job boundaries And no map-side spills So only reduce merge-sort is extra I/O

TOC A very short intro to Cascading An even shorter intro to Flink An example of converting a workflow More in-depth results

In the beginning There was Hadoop, and it was good But life is too short to write M-R jobs with K-V data Then along came Cascading

Cascading

What is Cascading? A thin Java library on top of Hadoop An open source project (APL, 8 years old) An API for defining and running ETL workflows

30,000ft View Records (Tuples) flow through Pipes [ clay, boxing, 5 ]

30,000ft View Pipes connect Operations [ term, link, distance ] [ clay, boxing, 5 ] Filter by Distance

30,000ft View You do Operations on Tuples [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Filter by Distance [ term, link ] [ clay, boxing ]

30,000ft View Tuples flow into Pipes from Source Taps Hfs Tap [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ]

30,000ft View Tuples flow from Pipes into Sink Taps [ term, link, weight ] [ clay, boxing, 2.57 ] [ clay, pottery, 5.32 ] Hfs Tap

30,000ft View This is a data processing workflow (Flow) Hfs Tap [ term, link ] [ clay, boxing ] [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Group By Term/Link Count [ term, link, count ] Filter by Distance [ clay, boxing, 237 ] Hfs Tap

Java API to Define Flow Pipe ipdatapipe = new Pipe("ip data pipe"); RegexParser ipdataparser = new RegexParser(new Fields("Data IP", "Country"), ^([\\d\\.]+)\t(.*)); ipdatapipe = new Each(ipDataPipe, new Fields("line"), ipdataparser); Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do loganalysispipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); loganalysispipe = new Every(logAnalysisPipe, new Count(new Fields( Count"))); loganalysispipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logdatatap = new Hfs(new TextLine(), "access.log"); Tap ipdatatap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputtap = new Hfs(new TextLine(), "results"); FlowDef flowdef = new FlowDef().setName("log analysis flow").addsource(logdatapipe, logdatatap).addsource(ipdatapipe, ipdatatap).addtailsink(loganalysispipe, outputtap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);

Visualizing the DAG Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/access.log']'] Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/ip-map.tsv']'] [{2}:'offset', 'line'] [{2}:'offset', 'line']] Each('log data pipe')[regexparser[decl:'logip', 'status'][args:1]] Each('ip data pipe')[regexparser[decl:'ip', 'country'][args:1]] [{2}:'logip', 'status'] [{2}:'ip', 'country'] CoGroup('log data pipe*ip data pipe')[by:ip data pipe:['ip']log data pipe:['logip']] [{4}:'logip', 'status', 'ip', 'country'] TempHfs['SequenceFile[['logip', 'status', 'ip', 'country']]'][log_data_pipe_ip_data_pip/20824/] [{4}:'logip', 'status', 'ip', 'country'] GroupBy('log data pipe*ip data pipe')[by:['country', 'status']] [{4}:'logip', 'status', 'ip', 'country'] File read or write Every('log data pipe*ip data pipe')[count[decl:'count']] [{3}:'country', 'status', 'count'] Map operation Each('log data pipe*ip data pipe')[filternullvalues[decl:'country', 'status', 'count']] [{3}:'country', 'status', 'count'] Reduce operation Hfs['TextLine[['offset', 'line']->[all]]']['build/test/']']

Things I Like Stream Shaping - easy to add, drop fields Field consistency checking in DAG Building blocks - Operations, SubAssemblies Flexible planner - MR, local, Tez

Time for a Change I ve used Cascading for 100s of projects And made a lot of money consulting on ETL But it was getting kind of boring

Flink

Elevator Pitch High throughput/low latency stream processing Also supports batch (bounded streams) Runs locally, stand-alone, or in YARN Super-awesome team

Versus Spark? Sigh OK Very similar in many ways Natively streaming, vs. natively batch Not as mature, smaller community/ecosystem

Similar to Cascading You define a DAG with Java (or Scala) code You have data sources and sinks Data flows through streams to operators The planner turns this into a bunch of tasks

It s Faster, But I don t want to rewrite my code I use lots of custom Cascading schemes I don t really know Scala And POJOs ad nauseam are no fun Same for Tuple21<Integer, String, String, >

Scala is the New APL val input = env.readfilestream(filename,100).flatmap { _.tolowercase.split("\\w+") filter { _.nonempty } }.timewindowall(time.of(60, TimeUnit.SECONDS)).trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))).fold(Set[String]()){(r,i) => { r + i}}.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}

Tuple21 Hell public void reduce(iterable<tuple2<tuple3<string, String, Integer>, Tuple2<String, Integer>>> tuplegroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tuplegroup) { Tuple3<String, String, Integer> idwc = tuple.f0; Tuple2<String, Integer> idtw = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idwc.f1, (double)idwc.f2 / idtw.f1)); } }

Cascading-Flink

Cascading 3 Planner Converts the Cascading DAG into a Flink DAG Around 5K lines of code DAG it plans looks like Cascading local mode

Boundaries for Data Sets Hfs : 4A668 ['SequenceFile[['fnWikiTermDatum_articleName', 'fnwikitermdatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, String, int]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/terms'] Hfs : CFCE4 ['SequenceFile[['fnWikiCategoryDatum_articleName', 'fnwikicategorydatum_category' String, String]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/categories'] Each : 1FAE7 ('term TF*IDF pipe') [Identity[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, int]]] Each : 867C6 ('term TF*IDF pipe') [ExpressionFunction[decl:[{1}:'term_count']] [args:1]] Each : 8512C Speed == no spill to disk ('term TF*IDF pipe') [ExpressionFilter[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'term_count']] [args:1]] Each : 90EF7 ('term TF*IDF pipe') [Identity[decl:[{2}:'term', 'doc']]] Each : 4B9CE Each : A1BC8 ('term count per doc pipe') ('doc count per term pipe') [CompositeFunction[decl:[{3}:'doc', 'term', 'term_count_per_doc']]] [Identity[decl:[{2}:'doc', 'term']]] Task CPU is the same GroupBy : C106E Each : 59A0E ('term count per doc pipe') ('doc count per term pipe') [by:['doc', 'term']] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] Every : 7FA5A GroupBy : 8D4A8 Each : 99C62 ('term count per doc pipe') ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [Sum[decl:[{1}:'term_count_per_doc' [by:['doc', 'term']] Every : BFB12 [NoOp[decl:[{?}:NONE]]] Each : F36A9 Other than serde time ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [FirstNBuffer[decl:[{2}:'doc', 'term']]] [ExpressionFunction[decl:[{1}:'tf-idf']] Each : 2929A Each : EAD02 Each : 51672 Each : CEC08 ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [CompositeFunction[decl:[{2}:'doc', 'total_term_count_per_doc']]] [CompositeFunction[decl:[{2}:'term', 'doc_count_per_term']]] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] [ExpressionFilter[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']] GroupBy : 80ECC GroupBy : 3D592 GroupBy : C5FCF GroupBy : 6497F ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') [by:['doc']] [by:['term']] [by:['doc']] [by:['term']] Every : B50FC Every : 2EB3B Every : D5647 Every : 09850 CoGroup : AFCDA ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') ('term to category') [Sum[decl:[{1}:'total_term_count_per_doc' [Sum[decl:[{1}:'doc_count_per_term' [FirstNBuffer[decl:[{2}:'doc', 'term']]] [First[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']]] [by:categories:['fnwikicategorydatum_articlename' Integer]] long]] GroupBy : 5456C Each : 7A027 Each : 6A732 Each : FC012 ('total doc count pipe') ('top terms') ('term to category') ('total term count per doc pipe') [by:[none]] [Identity[decl:[{4}:'term', 'doc', 'tf-idf', 'term_count_per_doc']]] [Identity[decl:[{3}:'term', 'fnwikicategorydatum_category', 'tf-idf']]] [Identity[decl:[{1}:'temp_doc']]] CoGroup : AE3CF ('term frequency pipe') [by:term count per doc pipe:['doc']total term count per doc pipe:['temp_doc']] Every : C7967 ('total doc count pipe') [Count[decl:[{1}:'total_doc_count']]] Each : EAA7E ('term to category') [CompositeFunction[decl:[{3}:'term', 'fnwikicategorydatum_category', 'category_score']]] Each : 426CB CoGroup : B1688 Hfs : F12B3 GroupBy : FD7CC ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ['TextLine[['offset', 'line' ('term to category') [NoOp[decl:[{?}:NONE]]] [by:doc count per term pipe:[none]total doc count pipe:[none]] long, String]->[ALL]]'] [by:['term', 'fnwikicategorydatum_category']] ['/Users/kenkrugler/Downloads/wikiwords-working/term_scores'] Each : B1AF7 Each : 9F422 Every : 19D98 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ('term to category') [ExpressionFunction[decl:[{1}:'tf']] [ExpressionFunction[decl:[{1}:'idf']] [Sum[decl:[{1}:'category_score' Each : C9E3A Each : E0099 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') [NoOp[decl:[{?}:NONE]]] [Identity[decl:[{1}:'temp_term']]] Each : A1F82 ('term frequency pipe') Hfs : A8BFD [ExpressionFunction[decl:[{1}:'tf']] ['TextLine[['offset', 'line' long, String]->[ALL]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/term_categories'] CoGroup : A4414 ('term frequency pipe*doc count per term pipe*total doc count pipe') [by:term frequency pipe:['term']doc count per term pipe*total doc count pipe:['temp_term']]

How Painful? Use the FlinkFlowConnector Flink Flow planner to convert DAG to job Uber jar vs. classic Hadoop jar Grungy details of submitting jobs to EMR cluster

Wikiwords Workflow Find association between terms and categories For every page in Wikipedia, for every term Find distance from term to intra-wiki links Then calc statistics to find strong association Prob unusually high that term is close to link

Timing Test Details EMR cluster with 5 i2.xlarge slaves ~1 billion input records (term, article ref, distance) Hadoop MapReduce took 148 minutes Flink took 98 minutes So 1.5x faster - nice but not great Mostly due to spillage in many boundaries

Summary

If You re a Java ELT Dev And you have to deal with batch big data Then the Cascading API is a good fit And using Flink typically gives better performance While still using a standard Hadoop/YARN cluster

Status of Cascading-Flink Still young, but surprisingly robust Doesn t support Full or RightOuter HashJoins Pending optimizations Tuple serialization Flink improvements

Better Planning Data Source 1 Data Source 2 Operation 1 Group Defer late-stage join Operation 2 Avoid premature resource usage Join

More questions? Feel free to contact me http://www.scaleunlimited.com/contact/ ken@scaleunlimited.com Check out Cascading, Flink, and Cascading-Flink http://www.cascading.org http://flink.apache.org http://github.com/dataartisans/cascading-flink