CS 378 Big Data Programming. Lecture 20 Spark Basics

Size: px
Start display at page:

Download "CS 378 Big Data Programming. Lecture 20 Spark Basics"

Transcription

1 CS 378 Big Data Programming Lecture 20 Spark Basics

2 Review Assignment 9 Download and run Spark Controlling logging output CS Fall 2018 Big Data Programming 2

3 Review - Spark RDDs Resilient Distributed Dataset One RDD has one or more pariions ParIIons are distributed across machines Rebuilt from base data on failure (versus replicaion) Lazy evaluaion created/computed on demand RDD types offer various funcions map, reduce groupby, reducebykey joins (inner, letouterjoin, rightouterjoin) filter, sample CS Fall 2018 Big Data Programming 3

4 Review - Spark Programs A Spark program defines: RDDs Input from external sources Produced by a transformaion TransformaIons Produce a new RDD from the input RDD(s) AcIons Compute something from the input RDD Return non-rdd objects (e.g., a number) Write an RDD to external storage CS Fall 2018 Big Data Programming 4

5 Spark AcIons TransformaIons define data flow (data lineage) AcIons on RDDs iniiate computaion CounIng rdd.count() Extract some elements rdd.take(10) Get all elements rdd.collect() Careful, as this assembles the enire RDD. Filter first. CS Fall 2018 Big Data Programming 5

6 Spark TransformaIons TransformaIons take funcions as arguments This funcion defines the details of the transform Filter Apply a funcion to each element of an RDD, return those elements that evaluate to true Source RDD element type: T Result RDD element type: T Java funcion (class) type: Function<T, Boolean> Java method: Boolean call(t t) CS Fall 2018 Big Data Programming 6

7 Spark TransformaIons Map Apply a funcion to each element of an RDD, return the result of applying the funcion Source RDD element type: T Result RDD element type: R Java funcion (class) type: Function<T, R> Java method: R call(t t) CS Fall 2018 Big Data Programming 7

8 Spark TransformaIons Flat Map Apply a funcion to each element of an RDD, return the result of applying the funcion (an Iterator) Source RDD element type: T Result RDD element type: R Java funcion (class) type: FlatMapFunction<T, R> Java method: Iterator<R> call(t t) Spark then iterates through each iterable returned, to create an RDD with element type R CS Fall 2018 Big Data Programming 8

9 Spark TransformaIons Map to Pair Apply a funcion to each element of an RDD, return the result of applying the funcion (a key and a value) Source RDD element type: T Result RDD element type: <K, V> Java funcion (class) type: PairFunction<T, K, V> Java method: Tuple2<K, V> call(t t) CS Fall 2018 Big Data Programming 9

10 Spark TransformaIons Reduce by Key Apply a funcion to pairs of values with the same key, return the result (key and a reduced value) Source RDD element type: <K, V> Result RDD (JavaPairRDD) element type: <K, V> Java funcion (class) type: Function2<V, V, V> Java method: V call(v v1, V v2) CS Fall 2018 Big Data Programming 10

11 Spark Code Example WordCount again AlternaIve implementaions of WordCount CS Fall 2018 Big Data Programming 11

12 Assignment 10 Inverted Index in Spark Data set example Genesis:46:14 And the sons of Zebulun; Sered, and Elon, and Jahleel. Output for this document (one verse) How many outputs for this verse? What should the keys be? What should the value(s) be? CS Fall 2018 Big Data Programming 12

13 Assignment 10 Input record parsing Document-ID: book:chapter:verse Text string that is the verse Verse text parsing: Remove punctuaion, lowercase all words Output: Word, list of book:chapter:verse (no duplicates) Sort the book:chapter:verse references Extra credit: output a file ordered by the number of verses referenced for a word (descending) CS Fall 2018 Big Data Programming 13

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

12/04/2018. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow Spark supports also RDDs of key-value pairs They are called pair RDDs Pair RDDs are characterized by specific operations reducebykey(), join, etc. Obviously, pair RDDs are characterized also by the operations

More information

Spark supports also RDDs of key-value pairs

Spark supports also RDDs of key-value pairs 1 Spark supports also RDDs of key-value pairs They are called pair RDDs Pair RDDs are characterized by specific operations reducebykey(), join, etc. Obviously, pair RDDs are characterized also by the operations

More information

Outline. CS-562 Introduction to data analysis using Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:

More information

10/04/2019. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow

10/04/2019. Spark supports also RDDs of key-value pairs. Many applications are based on pair RDDs Pair RDDs allow Spark supports also RDDs of key-value pairs They are called pair RDDs Pair RDDs are characterized by specific operations reducebykey(), join(), etc. Obviously, pair RDDs are characterized also by the operations

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Frequently asked questions from the previous class survey 48-bit bookending in Bzip2: does the number have to be special? Spark seems to have too many

More information

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming

More information

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CSE 414: Section 7 Parallel Databases. November 8th, 2018 CSE 414: Section 7 Parallel Databases November 8th, 2018 Agenda for Today This section: Quick touch up on parallel databases Distributed Query Processing In this class, only shared-nothing architecture

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Data at the Speed of your Users

Data at the Speed of your Users Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014 Rustam Aliyev Solution Architect at. @rstml Big Data? Photo: Flickr

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks COMP 322: Fundamentals of Parallel Programming Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu COMP

More information

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016 CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016 Copyright, All rights reserved. 2016 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA.

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark COMP 322: Fundamentals of Parallel Programming Lecture 37: Distributed Computing, Apache Spark Vivek Sarkar, Shams Imam Department of Computer Science, Rice University vsarkar@rice.edu, shams@rice.edu

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can

More information

Introduction to Apache Spark. Patrick Wendell - Databricks

Introduction to Apache Spark. Patrick Wendell - Databricks Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Apache Spark Internals

Apache Spark Internals Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:

More information

Spark and distributed data processing

Spark and distributed data processing Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK? COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks

More information

Spark Programming with RDD

Spark Programming with RDD Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत DS256:Jan17 (3:1) Department of Computational and Data Sciences Spark Programming with RDD Yogesh Simmhan 21 M a r, 2017

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs. 10/22/2018 - FALL 2018 W10.A.0.0 10/22/2018 - FALL 2018 W10.A.1 FAQs Term project: Proposal 5:00PM October 23, 2018 PART 1. LARGE SCALE DATA ANALYTICS IN-MEMORY CLUSTER COMPUTING Computer Science, Colorado

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors

More information

CS246: Mining Massive Datasets Crash Course in Spark

CS246: Mining Massive Datasets Crash Course in Spark CS246: Mining Massive Datasets Crash Course in Spark Daniel Templeton 1 The Plan Getting started with Spark RDDs Commonly useful operations Using python Using Java Using Scala Help session 2 Go download

More information

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces

More information

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D. Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan - Some additions by Johannes Schneider How good

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR

More information

Interactive Graph Analytics with Spark

Interactive Graph Analytics with Spark Interactive Graph Analytics with Spark A talk by Daniel Darabos from Lynx Analytics about the design and implementation of the LynxKite analytics application About LynxKite Analytics web application AngularJS

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

CarbonData: Spark Integration And Carbon Query Flow

CarbonData: Spark Integration And Carbon Query Flow CarbonData: Spark Integration And Carbon Query Flow SparkSQL + CarbonData: 2 Carbon-Spark Integration Built-in Spark integration Spark 1.5, 1.6, 2.1 Interface SQL DataFrame API Integration: Format Query

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

Spark.jl: Resilient Distributed Datasets in Julia

Spark.jl: Resilient Distributed Datasets in Julia Spark.jl: Resilient Distributed Datasets in Julia Dennis Wilson, Martín Martínez Rivera, Nicole Power, Tim Mickel [dennisw, martinmr, npower, tmickel]@mit.edu INTRODUCTION 1 Julia is a new high level,

More information

Apache Spark: Hands-on Session A.A. 2016/17

Apache Spark: Hands-on Session A.A. 2016/17 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica

More information

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1 CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 13, 5/9/2016. Scribed by Alfredo Láinez, Luke de Oliveira.

More information

Apache Spark: Hands-on Session A.A. 2017/18

Apache Spark: Hands-on Session A.A. 2017/18 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 26: Spark CSE 414 - Spring 2017 1 HW8 due next Fri Announcements Extra office hours today: Rajiv @ 6pm in CSE 220 No lecture Monday (holiday) Guest lecture Wednesday Kris

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

Evolution From Shark To Spark SQL:

Evolution From Shark To Spark SQL: Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese

More information

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th. CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 - Due at the beginning of class May 18th. 1. Download the following materials: Slides: http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf

More information

What is sorting? Lecture 36: How can computation sort data in order for you? Why is sorting important? What is sorting? 11/30/10

What is sorting? Lecture 36: How can computation sort data in order for you? Why is sorting important? What is sorting? 11/30/10 // CS Introduction to Computation " UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department Professor Andrea Arpaci-Dusseau Fall Lecture : How can computation sort data in order for you? What is sorting?

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey If an RDD is read once, transformed will Spark

More information

The detailed Spark programming guide is available at:

The detailed Spark programming guide is available at: Aims This exercise aims to get you to: Analyze data using Spark shell Monitor Spark tasks using Web UI Write self-contained Spark applications using Scala in Eclipse Background Spark is already installed

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University 9/5/6 CS Introduction to Computing II Wayne Snyder Department Boston University Today: Arrays (D and D) Methods Program structure Fields vs local variables Next time: Program structure continued: Classes

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau CSE 6242 / CX 4242 Data and Visual Analytics Georgia Tech Spark & Spark SQL High- Speed In- Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau Slides adopted from Matei Zaharia

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

A very short introduction

A very short introduction A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013

More information

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and

More information

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit. Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer

More information

Data Engineering. How MapReduce Works. Shivnath Babu

Data Engineering. How MapReduce Works. Shivnath Babu Data Engineering How MapReduce Works Shivnath Babu Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Map function Reduce function

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Big Data Analytics with Apache Spark. Nastaran Fatemi

Big Data Analytics with Apache Spark. Nastaran Fatemi Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a

More information

6.830 Lecture Spark 11/15/2017

6.830 Lecture Spark 11/15/2017 6.830 Lecture 19 -- Spark 11/15/2017 Recap / finish dynamo Sloppy Quorum (healthy N) Dynamo authors don't think quorums are sufficient, for 2 reasons: - Decreased durability (want to write all data at

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

SPARK PAIR RDD JOINING ZIPPING, SORTING AND GROUPING

SPARK PAIR RDD JOINING ZIPPING, SORTING AND GROUPING SPARK PAIR RDD JOINING ZIPPING, SORTING AND GROUPING By www.hadoopexam.com Note: These instructions should be used with the HadoopExam Apache Spark: Professional Trainings. Where it is executed and you

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

In-Memory Processing with Apache Spark. Vincent Leroy

In-Memory Processing with Apache Spark. Vincent Leroy In-Memory Processing with Apache Spark Vincent Leroy Sources Resilient Distributed Datasets, Henggang Cui Coursera IntroducBon to Apache Spark, University of California, Databricks Datacenter OrganizaBon

More information

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara Parallelism with the PDI AEL Spark Engine Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara Agenda Why AEL Spark? Kettle Engine Spark 101 PDI on Spark Takeaways Why AEL Spark? Spark has

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 8 Complex Writable Types AVRO CS 378 - Fall 2017 Big Data Programming 1 Review Assignment 3 - InvertedIndex We ll look at implementapon details of: Mapper Combiner Reducer

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Project Design. Version May, Computer Science Department, Texas Christian University

Project Design. Version May, Computer Science Department, Texas Christian University Project Design Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that he

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly

More information

CS111: PROGRAMMING LANGUAGE II

CS111: PROGRAMMING LANGUAGE II CS111: PROGRAMMING LANGUAGE II Computer Science Department Lecture 1(c): Java Basics (II) Lecture Contents Java basics (part II) Conditions Loops Methods Conditions & Branching Conditional Statements A

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Streaming vs. batch processing

Streaming vs. batch processing COSC 6339 Big Data Analytics Introduction to Spark (III) 2 nd homework assignment Edgar Gabriel Fall 2018 Streaming vs. batch processing Batch processing: Execution of a compute job without manual intervention

More information

Word Count. Read text into an RDD

Word Count. Read text into an RDD Word Count Counting the number of occurances of words in a text is one of the most popular first eercises when learning Map-Reduce Programming. It is the equivalent to Hello World! in regular programming.

More information

Prelim 2. CS 2110, November 20, 2014, 7:30 PM Extra Total Question True/False Short Answer

Prelim 2. CS 2110, November 20, 2014, 7:30 PM Extra Total Question True/False Short Answer Prelim 2 CS 2110, November 20, 2014, 7:30 PM 1 2 3 4 5 Extra Total Question True/False Short Answer Complexity Induction Trees Graphs Extra Credit Max 20 10 15 25 30 5 100 Score Grader The exam is closed

More information

Project Compiler. CS031 TA Help Session November 28, 2011

Project Compiler. CS031 TA Help Session November 28, 2011 Project Compiler CS031 TA Help Session November 28, 2011 Motivation Generally, it s easier to program in higher-level languages than in assembly. Our goal is to automate the conversion from a higher-level

More information

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems CS-E4610 Modern Database Systems 05.01.2018-05.04.2018 Lecture 09 Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ PHD STUDENT I N COMPUTER SCIENCE, ELT E UNIVERSITY VISITING R ESEA RCHER,

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Static Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1

Static Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1 Static Semantics Lecture 15 (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1 Current Status Lexical analysis Produces tokens Detects & eliminates illegal tokens Parsing

More information

Java 2. Course Outcome Summary. Western Technical College. Course Information. Course History. Course Competencies

Java 2. Course Outcome Summary. Western Technical College. Course Information. Course History. Course Competencies Western Technical College 10152155 Java 2 Course Outcome Summary Course Information Description Career Cluster Instructional Level Total Credits 4.00 Total Hours 90.00 The goal as programmers, is to create

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

CS 61C: Great Ideas in Computer Architecture. Lecture 2: Numbers & C Language. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture. Lecture 2: Numbers & C Language. Krste Asanović & Randy Katz CS 61C: Great Ideas in Computer Architecture Lecture 2: Numbers & C Language Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c Numbers wrap-up This is not on the exam! Break C Primer Administrivia,

More information