Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed

Size: px
Start display at page:

Download "Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed"

Transcription

1 Big Data Analytics with Apache Spark Machiel Jansen Matthijs Moed

2 Schedule 10:15-10:45 General introduction to Spark 10:45-11:15 Hands-on: Python notebook 11:15-11:30 General introduction to Spark (continued) 12:30-13:15 Spark DataFrames 13:15-14:00 Hands-on: Spark DataFrames notebook 14:00-15:00 Hadoop Map/Reduce 15:00-15:30 Spark RDD 15:30-16:15 Hands-on: Spark RDD notebook 16:15-17:00 Machine learning

3 What is Apache Spark? Spark is a software development framework - not an environment in which you can (easily) run binary (unix) programs It provides a simplified (limited) and therefore easier way of writing distributed data intensive applications Spark runs on commodity hardware or in cloud environments

4 Scaling The need to react to increased load Keep the application/service running with similar performance under heavy load (many users, lots of data)

5 Scaling traditional applications Now the one running the application needs to: Distribute and split data Handle faults and errors inherent with scale Submit and track applications Use files or relational database (fixed schema's)

6 An Example Consider from a Tweet we are interested in finding: Names of persons Names of organisations Locations and places I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting!

7 Anatomy of a Tweet

8 A Straightforward Implementation Store tweets on disk Small Python program uses NLTK and Stanford NER to tag Write output back to disk

9 But

10 Scaling Bottlenecks Store tweets on disk: storage will eventually fill up Small Python program: it can do a tweet every few msecs/secs so need to run separate processes Write output back to disk: it will eventually fill Run separate processes: they all need input

11 Scaling up or out

12 State Remember things in between service requests Example: the internet shopping basket If different machines service requests they should share the state of the shopping basket

13 Shared mutable state Example: multiple processes can change a file or database or a variable When scaling out it becomes difficult to guard consistency and unwanted results Locking mechanisms should be in place and make reasoning about software complicated

14 Try to avoid shared state "Working with distributed systems is fundamentally different from writing software on a single computer and the main difference is that there are lots of new and exciting ways for things to go wrong"

15 Parallel programming is hard

16 Latencies inside a machine Latencies: Courtesy Ben Stopford:

17 Latency memory x

18 Humanized latency numbers x 1 billion

19 Spark - a general framework Spark aims to generalise MapReduce to support new applications with a more efficient engine, and simpler for the end users. Write programs in terms of distributed datasets and operations on them. Accessible from multiple programming languages: Scala Java Python R (only dataframes)

20

21 Jupyter

22 Hands-on: Notebooks Jupyter notebooks: 01-python.ipynb

23 Spark components

24

25 Two main API's Low level RDDs - Python, Scala, Java - no R support Structured API DataFrames - higher level - R support, SQL and ML

26 Spark modes SparkContext/SparkSession objects contains information about the cluster and is the linking pin between your code and the cluster. Local mode: single machine, using multiple cores. For testing and training purposes. Cluster mode: dedicated Spark cluster (also on clouds) Hadoop/cluster mode: Use Hadoop YARN to deploy cluster

27 An executing application

28 Spark UI The Spark UI provides a visual way to understand running applications and metrics about your Spark workload. Available on port 4040

29 Spark UI

30 Languages and CLI Scala Java Python R (only dataframes) Command line: spark-submit

31 Big Data = Big Data transfers Large, secure data transfer between institutes SURFnet network 10 Gbit/s Research Data Zone

32 Break

33 DataFrames A DataFrame is a distributed collection of data organized into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python. DataFrames can be constructed from a wide array of sources such as: structured data files, external databases, or existing RDDs.

34 DataFrames Collection of Row Objects with schema DataFrames are immutable Transformations and actions Lazy but schema is checked eagerly Distributed over machines in cluster

35 Transformations and actions Transformation: transform one or more DataFrames into another dataframe (lazily computed) Action: send data to driver (to your programming environment), write data to disk or database (triggers computation)

36 Partitions A partition is a collection of rows that sit on one physical machine in our cluster. A DataFrame s partitions represent how the data is physically distributed across your cluster of machines during execution. With DataFrames, we do not (for the most part) manipulate partitions individually. We simply specify high level transformations of data in the physical partitions and Spark determines how this work will actually execute on the cluster. (taken from Spark, The definitive guide - Chambers - OReilly 2018)

37 DataFrame API Relational in flavour: select, groupby, orderby, where/filter, join, limit etc. Possibility to define User Defined Functions (UDFs) Optional: use SQL directly and work with tables/views

38

39 Data Frame phone_df Transformations select filter groupby max Data Frame Data Frame Grouped Data Data Frame

40 Data Frame phone_df Actions select filter groupby max show Data Frame Data Frame Grouped Data Data Frame Action Python

41 An Executing Application

42 An Executing Application

43 Spark: What Runs Where? At first glance: Spark code and RDD variables look local Important to keep track of local variables and references to distributed data (variables of type RDD)

44 Get data to screen returns a DataFrame Actions: data transfer to driver

45 Show and take show() shows 20 rows by default

46 topandas() Convert Spark DataFrame to Pandas data frame

47 Optimizations Spark will optimize computation by building plans, making use of schema information.

48 Logical plan

49 Physical plan

50 Data sources CSV JSON Parquet ORC JDBC/ODBC connections Plain text files Numerous community-created data sources including: Cassandra HBase MongoDB AWS Redshift...and many others.

51 Reading DataFrameReader (via DataFrames read method) we specify several values: the format (1), the schema (2), the read mode (3), a series of options (4) the path

52 Read modes What to do with malformed records? default

53 Parquet Apache Parquet is an open source column-oriented file format Columnar compression and allows for reading individual columns instead of entire files Recommended for writing to long-term storage More efficient than json or csv Another advantage of Parquet is that it supports complex types. That means that if your column is an array (which would fail with a csv file for example), map, or struct - you ll still be able to read and write that file without issue.

54 Row- & column-oriented Spark + Parquet In Depth, Emily Curtin and Robbie Strickland (Spark Summit East)

55 Parquet metadata

56 Writing DataFrameWriter - via DataFrames write method the format (1), the save mode (2), a series of options (3) and finally the path (4)

57 DataFrames: Pros and cons Pro: Good level of abstraction for tabular, relational data Schema allows Spark to optimize queries Con: Less suited for unstructured data (text) User Defined Functions can be clunky and unwieldy

58 Spark SQL Tables are logically equivalent to a DataFrame. The core difference between tables and DataFrames is that DataFrames are defined in the scope of a programming language, tables inside of a database. This means when you create a table (assuming you never changed the database), it will belong to the default database.

59 SparkSQL Use SparkSession s sql method create an SQL view which is tied to the SparkSession

60 SparkSQL

61 Save DataFrame as a Table

62 How tables are stored

63 Grouped data

64 Statistics pyspark.sql.functions contains many functions: statistics, date, math, list manipulation

65 Hands-on: Notebooks Jupyter notebooks: 02-spark-dataframes.ipynb

66

67

68 What a programmer wants To work with datasets that don t fit on a single machine Have the operations on them automatically parallelized Be able to use the full expressiveness of your language Have this work like the native collections API Don t need to worry about machine failure

69 Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. Vol. 37. No. 5. ACM, Dean, J., and S. Ghemawat. "MapReduce: simplified data processing on large clusters. OSDI 04 Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation

70 The First Open Source Big Data Project Hadoop was started by Doug Cutting and Mike Cafarella Based on Google s concepts, written in Java Goal to power the Nutch web crawler Became a separate Apache project in 2006 Popular term for open-source big data projects: the Hadoop ecosystem

71 GFS/HDFS - design overview Handles failure of individual nodes Optimised for large (100+MB) files Optimised for sequential reads Favours high-throughput over low-latency

72 HDFS - architecture Files are split in 128MB blocks Blocks are stored on many machines (datanodes) Each block is stored three times (on three different nodes) Single namenode handles metadata (namespace, block locations) Clients connect directly to datanodes No updates of files

73 Writing to HDFS

74 Data locality In Hadoop the same machines are often used for both storage and compute The scheduler takes data location into account: tries to schedule tasks on the same machine as the data

75

76 Functional programming Restrict the programming interface so that the system can do more automatically. Use ideas from functional programming: Here is a function, apply it to all of the data I do not care where it runs (the system should handle that) Feel free to run it twice on different nodes (no side effects!)

77 Functions instead of iterations

78

79

80

81 Hadoop MapReduce

82

83 MapReduce strengths MapReduce framework handles a lot of work for its end-user: Splitting work in independent tasks Task scheduling, retrying on failure Data grouping/shuffling, in-memory/spilling to disk

84 MapReduce limitations Very low level: decomposing problems in (multiple) MapReduce jobs is hard Batch-oriented: unsuited for interactive use or realtime processing Disk sync: performance issues when chaining jobs (iterative algorithms)

85 Higher Level Frameworks = SQL on Hadoop = Dataflow API in Java = Pig - dataflow DSL = Graph processing All translated to MR jobs

86

87

88 Creation of an RDD Transforming an existing RDD Through SparkContext From an internal data structure: From reading in file (HDFS or otherwise): text = This is a sample text text_rdd = sc.parallelize(text) lines = sc.textfile('../data/links.tsv')

89 Example pyspark SparkContext RDD creation wordslist = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole'] wordsrdd = sc.parallelize(wordslist) filteredrdd = wordsrdd.filter(lambda x: len(x) >3) Laziness: Spark does nothing!

90 Partitions Like DataFrames, RDDs are physically divided in partitions Each partition contains a number of records Determined by the data source partitions can be processing in parallel, processing within a partition is done sequentially

91 Operations on RDDs Actions: Return some value or side-effect Triggers computation Example: count, saveastextfile Transformations: Create new RDD Lazily computed Example: map, filter

92 Transformations RDDs are created from other RDDs using transformations: map(f) => pass every element through function f reducebykey(f) => aggregate values with same key using f

93 Transformations (all lazy) RDDs are created from other RDDs using transformations: map(f) Apply function f to each element of the RDD flatmap(f) Apply function f to each element of the RDD and unpack lists etc. filter(pred) Apply predicate pred to each element RDD and return those that pass pred. dictinct() => remove duplicate entries in RDD

94 Example pyspark wordslist = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole'] wordsrdd = sc.parallelize(wordslist) filteredrdd = wordsrdd.filter(lambda x: len(x) >3) result = filteredrdd.count() 6 Action: Spark acts!

95 Map Pattern: Given a Python function f(x)that works on a single element x tokenize("i like traffic lights") [I, like, traffic, lights] Apply f to all elements in the rdd myrdd.map(tokenize)

96 map vs flatmap

97

98

99 Example reduce Function with two arguments

100 Pair RDDs The elements of a Pair RDD are pairs (tuples) (x,y) x is interpreted as the key, y as the value Very much like MapReduce Pair RDD have extra methods

101 Pair RDD transformations groupbykey() Returns a RDD with elements (key, valuelist) reducebykey(f(x,y)) Applies f to all values of each key (similar to Hadoop MapReduce) join(rdd) Joins two RDDs on their keys mapvalues(f) Apply f to the values, not the keys of the RDD

102 Creating a pair RDD Transform a RDD to a Pair RDD Use map, as in: pairrdd = textrdd.map(lambda x: (x,1)) map lambda x: (x,1)

103 Word Count Input the cat sat on the mat the aardvark sat on the sofa Output the 4 cat 1 sat 2 on 2.

104 Word Count Input Output the cat sat on the mat the aardvark sat on the sofa aardvark 1 cat 1 mat 1 on 2 sat 2 sofa 1 the 4

105 lines = sc.textfile(file) words = lines.flatmap(lambda s: s.split()) pairs = words.map(lambda w: (w, 1)) counts = pairs.reducebykey(lambda x, y: x + y)

106

107

108

109

110 Pseudo set operations

111 Efficient & fault-tolerant Intermediate RDDs are computed only when needed > pipelining RDD partitions never needed are not computed > laziness Lineage information is stored for every RDD partition > reconstruction

112 DIY

113 What is Docker?

114 Hands-on: Notebooks Jupyter notebooks: 03-spark-rdds.ipynb

115 Why another one? Machine Learning: MLlib

116 Spark MLlib Scale: many data sets/models become too big for single machine Spark is good at training models in a distributed fashion Not so good in predicting with very low latency (overhead for startup Spark jobs)

117 Machine learning 1. Data exploration 2. Data preprocessing 3. Model training 4. Model evaluation 5. Model inspection

118 MLlib: Spark s Machine Learning library The Apache Spark core distribution includes a machine learning library since its inception called MLlib MLlib was based on the RDD API Spark 1.2 introduced a new package called spark.ml spark.ml is a high-level interface based on DataFrames Since Spark 2.0 both are called MLlib DataFrames API is the primary API RDD API is in maintenance mode RDD API expected to be deprecated in 2.3, removed in 3.0

119 MLlib data types MLlib (RDD) uses some numerical data types backed by Breeze Local vector Dense and sparse vectors of doubles Labeled point Local vector + a label, used by supervised learning algorithms Local matrix Dense and sparse matrices stored on a single machine Distributed matrix Row, column indices with double values stored in one or more RDDs

120 New in Spark 2.3: ImageSchema Representation for images based on OpenCV, imageschema = StructType([ StructField("mode", StringType(), False), StructField("origin", StringType(), True), StructField("height", IntegerType(), False), StructField("width", IntegerType(), False), StructField("nChannels", IntegerType(), False), StructField("data", BinaryType(), False) ])

121 MLlib Common machine learning algorithms on top of Spark: classification: SVM, Naive Bayes, Random Forests regression: logistic regression, decision trees, isotonic regression clustering: K-means, PIC, LDA collaborative filtering: alternating least squares dimensionality reduction: SVD, PCA

122 Alternatives to Spark MLlib These libraries can use Spark as a backend and have their own API Sparkling Water (H2O) - DL4J - Apache Mahout -

123 Hands-on: Notebooks Jupyter notebooks: 04-decision-trees.ipynb 05-random-forests.ipynb

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225 Index A Anonymous function, 66 Apache Hadoop, 1 Apache HBase, 42 44 Apache Hive, 6 7, 230 Apache Kafka, 8, 178 Apache License, 7 Apache Mahout, 5 Apache Mesos, 38 42 Apache Pig, 7 Apache Spark, 9 Apache

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Chapter 1 - The Spark Machine Learning Library

Chapter 1 - The Spark Machine Learning Library Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Data Analytics and Machine Learning: From Node to Cluster

Data Analytics and Machine Learning: From Node to Cluster Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

SparkSQL 11/14/2018 1

SparkSQL 11/14/2018 1 SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Introduction to Apache Spark

Introduction to Apache Spark Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Course overview Part 1 Challenges Fundamentals and challenges

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Using Numerical Libraries on Spark

Using Numerical Libraries on Spark Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Big Data processing: a framework suitable for Economists and Statisticians

Big Data processing: a framework suitable for Economists and Statisticians Big Data processing: a framework suitable for Economists and Statisticians Giuseppe Bruno 1, D. Condello 1 and A. Luciani 1 1 Economics and statistics Directorate, Bank of Italy; Economic Research in High

More information

Outline. CS-562 Introduction to data analysis using Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Going Big Data on Apache Spark. KNIME Italy Meetup

Going Big Data on Apache Spark. KNIME Italy Meetup Going Big Data on Apache Spark KNIME Italy Meetup Agenda Introduction Why Apache Spark? Section 1 Gathering Requirements Section 2 Tool Choice Section 3 Architecture Section 4 Devising New Nodes Section

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit. Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

/ Cloud Computing. Recitation 13 April 17th 2018

/ Cloud Computing. Recitation 13 April 17th 2018 15-319 / 15-619 Cloud Computing Recitation 13 April 17th 2018 Overview Last week s reflection Team Project Phase 2 Quiz 11 OLI Unit 5: Modules 21 & 22 This week s schedule Project 4.2 No more OLI modules

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Scalable Machine Learning in R. with H2O

Scalable Machine Learning in R. with H2O Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Pyspark standalone code

Pyspark standalone code COSC 6339 Big Data Analytics Introduction to Spark (II) Edgar Gabriel Spring 2017 Pyspark standalone code from pyspark import SparkConf, SparkContext from operator import add conf = SparkConf() conf.setappname(

More information

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK? COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Machine Learning with Spark. Amir H. Payberah 02/11/2018

Machine Learning with Spark. Amir H. Payberah 02/11/2018 Machine Learning with Spark Amir H. Payberah payberah@kth.se 02/11/2018 The Course Web Page https://id2223kth.github.io 1 / 89 Where Are We? 2 / 89 Where Are We? 3 / 89 Big Data 4 / 89 Problem Traditional

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information