Faster Workflows, Faster. Ken Krugler President, Scale Unlimited
|
|
- Adrian Heath
- 5 years ago
- Views:
Transcription
1 Faster Workflows, Faster Ken Krugler President, Scale Unlimited
2 The Twitter Pitch Cascading is a solid, established workflow API Good for complex custom ETL workflows Flink is a new streaming dataflow engine 50% better performance by leveraging memory
3 Perils of Comparisons Performance comparisons are clickbait for devs You won t believe the speed of Flink! Really, really hard to do well I did moderate tuning of existing code base Somewhat complex workflow in EMR Dataset bigger than memory (100M 1B records)
4 TL;DR Flink gets faster by minimizing disk I/O Map-Reduce job always has write/read at job break Can also spill map output, reduce merge-sort Flink has no job boundaries And no map-side spills So only reduce merge-sort is extra I/O
5 TOC A very short intro to Cascading An even shorter intro to Flink An example of converting a workflow More in-depth results
6 In the beginning There was Hadoop, and it was good But life is too short to write M-R jobs with K-V data Then along came Cascading
7 Cascading
8 What is Cascading? A thin Java library on top of Hadoop An open source project (APL, 8 years old) An API for defining and running ETL workflows
9 30,000ft View Records (Tuples) flow through Pipes [ clay, boxing, 5 ]
10 30,000ft View Pipes connect Operations [ term, link, distance ] [ clay, boxing, 5 ] Filter by Distance
11 30,000ft View You do Operations on Tuples [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Filter by Distance [ term, link ] [ clay, boxing ]
12 30,000ft View Tuples flow into Pipes from Source Taps Hfs Tap [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ]
13 30,000ft View Tuples flow from Pipes into Sink Taps [ term, link, weight ] [ clay, boxing, 2.57 ] [ clay, pottery, 5.32 ] Hfs Tap
14 30,000ft View This is a data processing workflow (Flow) Hfs Tap [ term, link ] [ clay, boxing ] [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Group By Term/Link Count [ term, link, count ] Filter by Distance [ clay, boxing, 237 ] Hfs Tap
15 Java API to Define Flow Pipe ipdatapipe = new Pipe("ip data pipe"); RegexParser ipdataparser = new RegexParser(new Fields("Data IP", "Country"), ^([\\d\\.]+)\t(.*)); ipdatapipe = new Each(ipDataPipe, new Fields("line"), ipdataparser); Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do loganalysispipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); loganalysispipe = new Every(logAnalysisPipe, new Count(new Fields( Count"))); loganalysispipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logdatatap = new Hfs(new TextLine(), "access.log"); Tap ipdatatap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputtap = new Hfs(new TextLine(), "results"); FlowDef flowdef = new FlowDef().setName("log analysis flow").addsource(logdatapipe, logdatatap).addsource(ipdatapipe, ipdatatap).addtailsink(loganalysispipe, outputtap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
16 Visualizing the DAG Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/access.log']'] Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/ip-map.tsv']'] [{2}:'offset', 'line'] [{2}:'offset', 'line']] Each('log data pipe')[regexparser[decl:'logip', 'status'][args:1]] Each('ip data pipe')[regexparser[decl:'ip', 'country'][args:1]] [{2}:'logip', 'status'] [{2}:'ip', 'country'] CoGroup('log data pipe*ip data pipe')[by:ip data pipe:['ip']log data pipe:['logip']] [{4}:'logip', 'status', 'ip', 'country'] TempHfs['SequenceFile[['logip', 'status', 'ip', 'country']]'][log_data_pipe_ip_data_pip/20824/] [{4}:'logip', 'status', 'ip', 'country'] GroupBy('log data pipe*ip data pipe')[by:['country', 'status']] [{4}:'logip', 'status', 'ip', 'country'] File read or write Every('log data pipe*ip data pipe')[count[decl:'count']] [{3}:'country', 'status', 'count'] Map operation Each('log data pipe*ip data pipe')[filternullvalues[decl:'country', 'status', 'count']] [{3}:'country', 'status', 'count'] Reduce operation Hfs['TextLine[['offset', 'line']->[all]]']['build/test/']']
17 Things I Like Stream Shaping - easy to add, drop fields Field consistency checking in DAG Building blocks - Operations, SubAssemblies Flexible planner - MR, local, Tez
18 Time for a Change I ve used Cascading for 100s of projects And made a lot of money consulting on ETL But it was getting kind of boring
19 Flink
20 Elevator Pitch High throughput/low latency stream processing Also supports batch (bounded streams) Runs locally, stand-alone, or in YARN Super-awesome team
21 Versus Spark? Sigh OK Very similar in many ways Natively streaming, vs. natively batch Not as mature, smaller community/ecosystem
22 Similar to Cascading You define a DAG with Java (or Scala) code You have data sources and sinks Data flows through streams to operators The planner turns this into a bunch of tasks
23 It s Faster, But I don t want to rewrite my code I use lots of custom Cascading schemes I don t really know Scala And POJOs ad nauseam are no fun Same for Tuple21<Integer, String, String, >
24 Scala is the New APL val input = env.readfilestream(filename,100).flatmap { _.tolowercase.split("\\w+") filter { _.nonempty } }.timewindowall(time.of(60, TimeUnit.SECONDS)).trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))).fold(Set[String]()){(r,i) => { r + i}}.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
25 Tuple21 Hell public void reduce(iterable<tuple2<tuple3<string, String, Integer>, Tuple2<String, Integer>>> tuplegroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tuplegroup) { Tuple3<String, String, Integer> idwc = tuple.f0; Tuple2<String, Integer> idtw = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idwc.f1, (double)idwc.f2 / idtw.f1)); } }
26 Cascading-Flink
27 Cascading 3 Planner Converts the Cascading DAG into a Flink DAG Around 5K lines of code DAG it plans looks like Cascading local mode
28 Boundaries for Data Sets Hfs : 4A668 ['SequenceFile[['fnWikiTermDatum_articleName', 'fnwikitermdatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, String, int]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/terms'] Hfs : CFCE4 ['SequenceFile[['fnWikiCategoryDatum_articleName', 'fnwikicategorydatum_category' String, String]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/categories'] Each : 1FAE7 ('term TF*IDF pipe') [Identity[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, int]]] Each : 867C6 ('term TF*IDF pipe') [ExpressionFunction[decl:[{1}:'term_count']] [args:1]] Each : 8512C Speed == no spill to disk ('term TF*IDF pipe') [ExpressionFilter[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'term_count']] [args:1]] Each : 90EF7 ('term TF*IDF pipe') [Identity[decl:[{2}:'term', 'doc']]] Each : 4B9CE Each : A1BC8 ('term count per doc pipe') ('doc count per term pipe') [CompositeFunction[decl:[{3}:'doc', 'term', 'term_count_per_doc']]] [Identity[decl:[{2}:'doc', 'term']]] Task CPU is the same GroupBy : C106E Each : 59A0E ('term count per doc pipe') ('doc count per term pipe') [by:['doc', 'term']] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] Every : 7FA5A GroupBy : 8D4A8 Each : 99C62 ('term count per doc pipe') ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [Sum[decl:[{1}:'term_count_per_doc' [by:['doc', 'term']] Every : BFB12 [NoOp[decl:[{?}:NONE]]] Each : F36A9 Other than serde time ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [FirstNBuffer[decl:[{2}:'doc', 'term']]] [ExpressionFunction[decl:[{1}:'tf-idf']] Each : 2929A Each : EAD02 Each : Each : CEC08 ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [CompositeFunction[decl:[{2}:'doc', 'total_term_count_per_doc']]] [CompositeFunction[decl:[{2}:'term', 'doc_count_per_term']]] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] [ExpressionFilter[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']] GroupBy : 80ECC GroupBy : 3D592 GroupBy : C5FCF GroupBy : 6497F ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') [by:['doc']] [by:['term']] [by:['doc']] [by:['term']] Every : B50FC Every : 2EB3B Every : D5647 Every : CoGroup : AFCDA ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') ('term to category') [Sum[decl:[{1}:'total_term_count_per_doc' [Sum[decl:[{1}:'doc_count_per_term' [FirstNBuffer[decl:[{2}:'doc', 'term']]] [First[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']]] [by:categories:['fnwikicategorydatum_articlename' Integer]] long]] GroupBy : 5456C Each : 7A027 Each : 6A732 Each : FC012 ('total doc count pipe') ('top terms') ('term to category') ('total term count per doc pipe') [by:[none]] [Identity[decl:[{4}:'term', 'doc', 'tf-idf', 'term_count_per_doc']]] [Identity[decl:[{3}:'term', 'fnwikicategorydatum_category', 'tf-idf']]] [Identity[decl:[{1}:'temp_doc']]] CoGroup : AE3CF ('term frequency pipe') [by:term count per doc pipe:['doc']total term count per doc pipe:['temp_doc']] Every : C7967 ('total doc count pipe') [Count[decl:[{1}:'total_doc_count']]] Each : EAA7E ('term to category') [CompositeFunction[decl:[{3}:'term', 'fnwikicategorydatum_category', 'category_score']]] Each : 426CB CoGroup : B1688 Hfs : F12B3 GroupBy : FD7CC ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ['TextLine[['offset', 'line' ('term to category') [NoOp[decl:[{?}:NONE]]] [by:doc count per term pipe:[none]total doc count pipe:[none]] long, String]->[ALL]]'] [by:['term', 'fnwikicategorydatum_category']] ['/Users/kenkrugler/Downloads/wikiwords-working/term_scores'] Each : B1AF7 Each : 9F422 Every : 19D98 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ('term to category') [ExpressionFunction[decl:[{1}:'tf']] [ExpressionFunction[decl:[{1}:'idf']] [Sum[decl:[{1}:'category_score' Each : C9E3A Each : E0099 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') [NoOp[decl:[{?}:NONE]]] [Identity[decl:[{1}:'temp_term']]] Each : A1F82 ('term frequency pipe') Hfs : A8BFD [ExpressionFunction[decl:[{1}:'tf']] ['TextLine[['offset', 'line' long, String]->[ALL]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/term_categories'] CoGroup : A4414 ('term frequency pipe*doc count per term pipe*total doc count pipe') [by:term frequency pipe:['term']doc count per term pipe*total doc count pipe:['temp_term']]
29 How Painful? Use the FlinkFlowConnector Flink Flow planner to convert DAG to job Uber jar vs. classic Hadoop jar Grungy details of submitting jobs to EMR cluster
30 Wikiwords Workflow Find association between terms and categories For every page in Wikipedia, for every term Find distance from term to intra-wiki links Then calc statistics to find strong association Prob unusually high that term is close to link
31 Timing Test Details EMR cluster with 5 i2.xlarge slaves ~1 billion input records (term, article ref, distance) Hadoop MapReduce took 148 minutes Flink took 98 minutes So 1.5x faster - nice but not great Mostly due to spillage in many boundaries
32 Summary
33 If You re a Java ELT Dev And you have to deal with batch big data Then the Cascading API is a good fit And using Flink typically gives better performance While still using a standard Hadoop/YARN cluster
34 Status of Cascading-Flink Still young, but surprisingly robust Doesn t support Full or RightOuter HashJoins Pending optimizations Tuple serialization Flink improvements
35 Better Planning Data Source 1 Data Source 2 Operation 1 Group Defer late-stage join Operation 2 Avoid premature resource usage Join
36 More questions? Feel free to contact me Check out Cascading, Flink, and Cascading-Flink
Practical Big Data Processing An Overview of Apache Flink
Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationBixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.
Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from 2005-2008 Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationTambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf
Tambako the Jaguar@flickr.com Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Jule_Berlin@flickr.com Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationPortable stateful big data processing in Apache Beam
Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google klk@google.com / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationBenchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms
Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationA BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION
A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationStreaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_
Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_ About Us At GetInData, we build custom Big Data solutions Hadoop, Flink, Spark, Kafka and more Our team is today represented
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationApache Beam. Modèle de programmation unifié pour Big Data
Apache Beam Modèle de programmation unifié pour Big Data Who am I? Jean-Baptiste Onofre @jbonofre http://blog.nanthrax.net Member of the Apache Software Foundation
More informationCascading 2.1 User Guide
Cascading 2.1 User Guide Concurrent, Inc. Copyright 2007-2013 Concurrent, Inc. Publication date January 2013 Table of Contents 1. About Cascading... 1 1.1. What is Cascading?... 1 1.2. Usage Scenarios...
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationHomework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1
Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More informationArchitecture of Flink's Streaming Runtime. Robert
Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org What is stream processing Real-world data is unbounded and is pushed to systems Right now: people are using the batch
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationApache Flink- A System for Batch and Realtime Stream Processing
Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More information_APP A_541_10/31/06. Appendix A. Backing Up Your Project Files
1-59863-307-4_APP A_541_10/31/06 Appendix A Backing Up Your Project Files At the end of every recording session, I back up my project files. It doesn t matter whether I m running late or whether I m so
More informationBig Data with R and Hadoop
Big Data with R and Hadoop Jamie F Olson June 11, 2015 R and Hadoop Review various tools for leveraging Hadoop from R MapReduce Spark Hive/Impala Revolution R 2 / 52 Scaling R to Big Data R has scalability
More informationHow Apache Beam Will Change Big Data
How Apache Beam Will Change Big Data 1 / 21 About Big Data Institute Mentoring, training, and high-level consulting company focused on Big Data, NoSQL and The Cloud Founded in 2008 We help make companies
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationSpark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies
Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache
More informationApache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap
Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory
More informationThe Evolution of a Data Project
The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationReal-time data processing with Apache Flink
Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationKafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail
More informationWhy All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts
White Paper Analytics & Big Data Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts Table of Contents page Compression...1 Early and Late Materialization...1
More informationLecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka
Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another
More informationMicroservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer
Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer http://ewolff.com Why Microservices? Why Microservices? Strong modularization Replaceability Small units Sustainable Development
More information2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is used to run an application
More informationApache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear
More informationAnalysis in the Big Data Era
Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /
More informationAn Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.
An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More information08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application contains
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationIntroduction to Apache Beam
Introduction to Apache Beam Dan Halperin JB Onofré Google Beam podling PMC Talend Beam Champion & PMC Apache Member Apache Beam is a unified programming model designed to provide efficient and portable
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationRDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
1 RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationApache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017
More informationTowards a Real- time Processing Pipeline: Running Apache Flink on AWS
Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationScaling Marketplaces at Thumbtack QCon SF 2017
Scaling Marketplaces at Thumbtack QCon SF 2017 Nate Kupp Technical Infrastructure Data Eng, Experimentation, Platform Infrastructure, Security, Dev Tools Infrastructure from early beginnings You see that?
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationTOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM
TOWARDS PORTABILITY AND BEYOND Maximilian Michels mxm@apache.org DATA PROCESSING WITH APACHE BEAM @stadtlegende maximilianmichels.com !2 BEAM VISION Write Pipeline Execute SDKs Runners Backends !3 THE
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationBatch & Stream Graph Processing with Apache Flink. Vasia
Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationApplied Spark. From Concepts to Bitcoin Analytics. Andrew F.
Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationCS 220: Introduction to Parallel Computing. Arrays. Lecture 4
CS 220: Introduction to Parallel Computing Arrays Lecture 4 Note: Windows I updated the VM image on the website It now includes: Sublime text Gitkraken (a nice git GUI) And the git command line tools 1/30/18
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationLogging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:
Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationELTMaestro for Spark: Data integration on clusters
Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationPig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.
Pig on Spark Mohit Sabharwal and Xuefu Zhang, 06/30/2015 Objective The initial patch of Pig on Spark feature was delivered by Sigmoid Analytics in September 2014. Since then, there has been effort by a
More informationFaster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni,
Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid @praveenr019 About me Apache Pig committer and Pig on Spark project lead. OUR CUSTOMERS Why pig on spark? Spark shell (scala),
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationHadoop Execution Environment
Hadoop Execution Environment Hadoop Execution Environment Learn about execution environments in Hadoop. Limitations of classic MapReduce framework. New frameworks like YARN, Tez, Spark to compliment classic
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More information