Faster Workflows, Faster. Ken Krugler President, Scale Unlimited

Size: px
Start display at page:

Download "Faster Workflows, Faster. Ken Krugler President, Scale Unlimited"

Transcription

1 Faster Workflows, Faster Ken Krugler President, Scale Unlimited

2 The Twitter Pitch Cascading is a solid, established workflow API Good for complex custom ETL workflows Flink is a new streaming dataflow engine 50% better performance by leveraging memory

3 Perils of Comparisons Performance comparisons are clickbait for devs You won t believe the speed of Flink! Really, really hard to do well I did moderate tuning of existing code base Somewhat complex workflow in EMR Dataset bigger than memory (100M 1B records)

4 TL;DR Flink gets faster by minimizing disk I/O Map-Reduce job always has write/read at job break Can also spill map output, reduce merge-sort Flink has no job boundaries And no map-side spills So only reduce merge-sort is extra I/O

5 TOC A very short intro to Cascading An even shorter intro to Flink An example of converting a workflow More in-depth results

6 In the beginning There was Hadoop, and it was good But life is too short to write M-R jobs with K-V data Then along came Cascading

7 Cascading

8 What is Cascading? A thin Java library on top of Hadoop An open source project (APL, 8 years old) An API for defining and running ETL workflows

9 30,000ft View Records (Tuples) flow through Pipes [ clay, boxing, 5 ]

10 30,000ft View Pipes connect Operations [ term, link, distance ] [ clay, boxing, 5 ] Filter by Distance

11 30,000ft View You do Operations on Tuples [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Filter by Distance [ term, link ] [ clay, boxing ]

12 30,000ft View Tuples flow into Pipes from Source Taps Hfs Tap [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ]

13 30,000ft View Tuples flow from Pipes into Sink Taps [ term, link, weight ] [ clay, boxing, 2.57 ] [ clay, pottery, 5.32 ] Hfs Tap

14 30,000ft View This is a data processing workflow (Flow) Hfs Tap [ term, link ] [ clay, boxing ] [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Group By Term/Link Count [ term, link, count ] Filter by Distance [ clay, boxing, 237 ] Hfs Tap

15 Java API to Define Flow Pipe ipdatapipe = new Pipe("ip data pipe"); RegexParser ipdataparser = new RegexParser(new Fields("Data IP", "Country"), ^([\\d\\.]+)\t(.*)); ipdatapipe = new Each(ipDataPipe, new Fields("line"), ipdataparser); Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do loganalysispipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); loganalysispipe = new Every(logAnalysisPipe, new Count(new Fields( Count"))); loganalysispipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logdatatap = new Hfs(new TextLine(), "access.log"); Tap ipdatatap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputtap = new Hfs(new TextLine(), "results"); FlowDef flowdef = new FlowDef().setName("log analysis flow").addsource(logdatapipe, logdatatap).addsource(ipdatapipe, ipdatatap).addtailsink(loganalysispipe, outputtap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);

16 Visualizing the DAG Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/access.log']'] Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/ip-map.tsv']'] [{2}:'offset', 'line'] [{2}:'offset', 'line']] Each('log data pipe')[regexparser[decl:'logip', 'status'][args:1]] Each('ip data pipe')[regexparser[decl:'ip', 'country'][args:1]] [{2}:'logip', 'status'] [{2}:'ip', 'country'] CoGroup('log data pipe*ip data pipe')[by:ip data pipe:['ip']log data pipe:['logip']] [{4}:'logip', 'status', 'ip', 'country'] TempHfs['SequenceFile[['logip', 'status', 'ip', 'country']]'][log_data_pipe_ip_data_pip/20824/] [{4}:'logip', 'status', 'ip', 'country'] GroupBy('log data pipe*ip data pipe')[by:['country', 'status']] [{4}:'logip', 'status', 'ip', 'country'] File read or write Every('log data pipe*ip data pipe')[count[decl:'count']] [{3}:'country', 'status', 'count'] Map operation Each('log data pipe*ip data pipe')[filternullvalues[decl:'country', 'status', 'count']] [{3}:'country', 'status', 'count'] Reduce operation Hfs['TextLine[['offset', 'line']->[all]]']['build/test/']']

17 Things I Like Stream Shaping - easy to add, drop fields Field consistency checking in DAG Building blocks - Operations, SubAssemblies Flexible planner - MR, local, Tez

18 Time for a Change I ve used Cascading for 100s of projects And made a lot of money consulting on ETL But it was getting kind of boring

19 Flink

20 Elevator Pitch High throughput/low latency stream processing Also supports batch (bounded streams) Runs locally, stand-alone, or in YARN Super-awesome team

21 Versus Spark? Sigh OK Very similar in many ways Natively streaming, vs. natively batch Not as mature, smaller community/ecosystem

22 Similar to Cascading You define a DAG with Java (or Scala) code You have data sources and sinks Data flows through streams to operators The planner turns this into a bunch of tasks

23 It s Faster, But I don t want to rewrite my code I use lots of custom Cascading schemes I don t really know Scala And POJOs ad nauseam are no fun Same for Tuple21<Integer, String, String, >

24 Scala is the New APL val input = env.readfilestream(filename,100).flatmap { _.tolowercase.split("\\w+") filter { _.nonempty } }.timewindowall(time.of(60, TimeUnit.SECONDS)).trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))).fold(Set[String]()){(r,i) => { r + i}}.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}

25 Tuple21 Hell public void reduce(iterable<tuple2<tuple3<string, String, Integer>, Tuple2<String, Integer>>> tuplegroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tuplegroup) { Tuple3<String, String, Integer> idwc = tuple.f0; Tuple2<String, Integer> idtw = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idwc.f1, (double)idwc.f2 / idtw.f1)); } }

26 Cascading-Flink

27 Cascading 3 Planner Converts the Cascading DAG into a Flink DAG Around 5K lines of code DAG it plans looks like Cascading local mode

28 Boundaries for Data Sets Hfs : 4A668 ['SequenceFile[['fnWikiTermDatum_articleName', 'fnwikitermdatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, String, int]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/terms'] Hfs : CFCE4 ['SequenceFile[['fnWikiCategoryDatum_articleName', 'fnwikicategorydatum_category' String, String]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/categories'] Each : 1FAE7 ('term TF*IDF pipe') [Identity[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, int]]] Each : 867C6 ('term TF*IDF pipe') [ExpressionFunction[decl:[{1}:'term_count']] [args:1]] Each : 8512C Speed == no spill to disk ('term TF*IDF pipe') [ExpressionFilter[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'term_count']] [args:1]] Each : 90EF7 ('term TF*IDF pipe') [Identity[decl:[{2}:'term', 'doc']]] Each : 4B9CE Each : A1BC8 ('term count per doc pipe') ('doc count per term pipe') [CompositeFunction[decl:[{3}:'doc', 'term', 'term_count_per_doc']]] [Identity[decl:[{2}:'doc', 'term']]] Task CPU is the same GroupBy : C106E Each : 59A0E ('term count per doc pipe') ('doc count per term pipe') [by:['doc', 'term']] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] Every : 7FA5A GroupBy : 8D4A8 Each : 99C62 ('term count per doc pipe') ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [Sum[decl:[{1}:'term_count_per_doc' [by:['doc', 'term']] Every : BFB12 [NoOp[decl:[{?}:NONE]]] Each : F36A9 Other than serde time ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [FirstNBuffer[decl:[{2}:'doc', 'term']]] [ExpressionFunction[decl:[{1}:'tf-idf']] Each : 2929A Each : EAD02 Each : Each : CEC08 ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [CompositeFunction[decl:[{2}:'doc', 'total_term_count_per_doc']]] [CompositeFunction[decl:[{2}:'term', 'doc_count_per_term']]] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] [ExpressionFilter[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']] GroupBy : 80ECC GroupBy : 3D592 GroupBy : C5FCF GroupBy : 6497F ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') [by:['doc']] [by:['term']] [by:['doc']] [by:['term']] Every : B50FC Every : 2EB3B Every : D5647 Every : CoGroup : AFCDA ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') ('term to category') [Sum[decl:[{1}:'total_term_count_per_doc' [Sum[decl:[{1}:'doc_count_per_term' [FirstNBuffer[decl:[{2}:'doc', 'term']]] [First[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']]] [by:categories:['fnwikicategorydatum_articlename' Integer]] long]] GroupBy : 5456C Each : 7A027 Each : 6A732 Each : FC012 ('total doc count pipe') ('top terms') ('term to category') ('total term count per doc pipe') [by:[none]] [Identity[decl:[{4}:'term', 'doc', 'tf-idf', 'term_count_per_doc']]] [Identity[decl:[{3}:'term', 'fnwikicategorydatum_category', 'tf-idf']]] [Identity[decl:[{1}:'temp_doc']]] CoGroup : AE3CF ('term frequency pipe') [by:term count per doc pipe:['doc']total term count per doc pipe:['temp_doc']] Every : C7967 ('total doc count pipe') [Count[decl:[{1}:'total_doc_count']]] Each : EAA7E ('term to category') [CompositeFunction[decl:[{3}:'term', 'fnwikicategorydatum_category', 'category_score']]] Each : 426CB CoGroup : B1688 Hfs : F12B3 GroupBy : FD7CC ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ['TextLine[['offset', 'line' ('term to category') [NoOp[decl:[{?}:NONE]]] [by:doc count per term pipe:[none]total doc count pipe:[none]] long, String]->[ALL]]'] [by:['term', 'fnwikicategorydatum_category']] ['/Users/kenkrugler/Downloads/wikiwords-working/term_scores'] Each : B1AF7 Each : 9F422 Every : 19D98 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ('term to category') [ExpressionFunction[decl:[{1}:'tf']] [ExpressionFunction[decl:[{1}:'idf']] [Sum[decl:[{1}:'category_score' Each : C9E3A Each : E0099 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') [NoOp[decl:[{?}:NONE]]] [Identity[decl:[{1}:'temp_term']]] Each : A1F82 ('term frequency pipe') Hfs : A8BFD [ExpressionFunction[decl:[{1}:'tf']] ['TextLine[['offset', 'line' long, String]->[ALL]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/term_categories'] CoGroup : A4414 ('term frequency pipe*doc count per term pipe*total doc count pipe') [by:term frequency pipe:['term']doc count per term pipe*total doc count pipe:['temp_term']]

29 How Painful? Use the FlinkFlowConnector Flink Flow planner to convert DAG to job Uber jar vs. classic Hadoop jar Grungy details of submitting jobs to EMR cluster

30 Wikiwords Workflow Find association between terms and categories For every page in Wikipedia, for every term Find distance from term to intra-wiki links Then calc statistics to find strong association Prob unusually high that term is close to link

31 Timing Test Details EMR cluster with 5 i2.xlarge slaves ~1 billion input records (term, article ref, distance) Hadoop MapReduce took 148 minutes Flink took 98 minutes So 1.5x faster - nice but not great Mostly due to spillage in many boundaries

32 Summary

33 If You re a Java ELT Dev And you have to deal with batch big data Then the Cascading API is a good fit And using Flink typically gives better performance While still using a standard Hadoop/YARN cluster

34 Status of Cascading-Flink Still young, but surprisingly robust Doesn t support Full or RightOuter HashJoins Pending optimizations Tuple serialization Flink improvements

35 Better Planning Data Source 1 Data Source 2 Operation 1 Group Defer late-stage join Operation 2 Avoid premature resource usage Join

36 More questions? Feel free to contact me Check out Cascading, Flink, and Cascading-Flink

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc. Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from 2005-2008 Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Tambako the Jaguar@flickr.com Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Jule_Berlin@flickr.com Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Portable stateful big data processing in Apache Beam

Portable stateful big data processing in Apache Beam Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google klk@google.com / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_ Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_ About Us At GetInData, we build custom Big Data solutions Hadoop, Flink, Spark, Kafka and more Our team is today represented

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Apache Beam. Modèle de programmation unifié pour Big Data

Apache Beam. Modèle de programmation unifié pour Big Data Apache Beam Modèle de programmation unifié pour Big Data Who am I? Jean-Baptiste Onofre @jbonofre http://blog.nanthrax.net Member of the Apache Software Foundation

More information

Cascading 2.1 User Guide

Cascading 2.1 User Guide Cascading 2.1 User Guide Concurrent, Inc. Copyright 2007-2013 Concurrent, Inc. Publication date January 2013 Table of Contents 1. About Cascading... 1 1.1. What is Cascading?... 1 1.2. Usage Scenarios...

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Architecture of Flink's Streaming Runtime. Robert

Architecture of Flink's Streaming Runtime. Robert Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org What is stream processing Real-world data is unbounded and is pushed to systems Right now: people are using the batch

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

_APP A_541_10/31/06. Appendix A. Backing Up Your Project Files

_APP A_541_10/31/06. Appendix A. Backing Up Your Project Files 1-59863-307-4_APP A_541_10/31/06 Appendix A Backing Up Your Project Files At the end of every recording session, I back up my project files. It doesn t matter whether I m running late or whether I m so

More information

Big Data with R and Hadoop

Big Data with R and Hadoop Big Data with R and Hadoop Jamie F Olson June 11, 2015 R and Hadoop Review various tools for leveraging Hadoop from R MapReduce Spark Hive/Impala Revolution R 2 / 52 Scaling R to Big Data R has scalability

More information

How Apache Beam Will Change Big Data

How Apache Beam Will Change Big Data How Apache Beam Will Change Big Data 1 / 21 About Big Data Institute Mentoring, training, and high-level consulting company focused on Big Data, NoSQL and The Cloud Founded in 2008 We help make companies

More information

Bringing Data to Life

Bringing Data to Life Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory

More information

The Evolution of a Data Project

The Evolution of a Data Project The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail

More information

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts White Paper Analytics & Big Data Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts Table of Contents page Compression...1 Early and Late Materialization...1

More information

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another

More information

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer http://ewolff.com Why Microservices? Why Microservices? Strong modularization Replaceability Small units Sustainable Development

More information

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is used to run an application

More information

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved. Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear

More information

Analysis in the Big Data Era

Analysis in the Big Data Era Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application contains

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Introduction to Apache Beam

Introduction to Apache Beam Introduction to Apache Beam Dan Halperin JB Onofré Google Beam podling PMC Talend Beam Champion & PMC Apache Member Apache Beam is a unified programming model designed to provide efficient and portable

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters 1 RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Scaling Marketplaces at Thumbtack QCon SF 2017

Scaling Marketplaces at Thumbtack QCon SF 2017 Scaling Marketplaces at Thumbtack QCon SF 2017 Nate Kupp Technical Infrastructure Data Eng, Experimentation, Platform Infrastructure, Security, Dev Tools Infrastructure from early beginnings You see that?

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM TOWARDS PORTABILITY AND BEYOND Maximilian Michels mxm@apache.org DATA PROCESSING WITH APACHE BEAM @stadtlegende maximilianmichels.com !2 BEAM VISION Write Pipeline Execute SDKs Runners Backends !3 THE

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Batch & Stream Graph Processing with Apache Flink. Vasia

Batch & Stream Graph Processing with Apache Flink. Vasia Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph

More information

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

CS 220: Introduction to Parallel Computing. Arrays. Lecture 4

CS 220: Introduction to Parallel Computing. Arrays. Lecture 4 CS 220: Introduction to Parallel Computing Arrays Lecture 4 Note: Windows I updated the VM image on the website It now includes: Sublime text Gitkraken (a nice git GUI) And the git command line tools 1/30/18

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

ELTMaestro for Spark: Data integration on clusters

ELTMaestro for Spark: Data integration on clusters Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez. Pig on Spark Mohit Sabharwal and Xuefu Zhang, 06/30/2015 Objective The initial patch of Pig on Spark feature was delivered by Sigmoid Analytics in September 2014. Since then, there has been effort by a

More information

Faster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni,

Faster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni, Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid @praveenr019 About me Apache Pig committer and Pig on Spark project lead. OUR CUSTOMERS Why pig on spark? Spark shell (scala),

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Hadoop Execution Environment

Hadoop Execution Environment Hadoop Execution Environment Hadoop Execution Environment Learn about execution environments in Hadoop. Limitations of classic MapReduce framework. New frameworks like YARN, Tez, Spark to compliment classic

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information