Scala and the JVM for Big Data: Lessons from Spark

Size: px
Start display at page:

Download "Scala and the JVM for Big Data: Lessons from Spark"

Transcription

1 Scala and the JVM for Big Data: Lessons from Spark 1 Dean Wampler , All Rights Reserved

2 Spark 2

3 A Distributed Computing Engine on the JVM 3

4 Node Cluster Node RDD Node Partition 1 Partition 1 Partition 1 Resilient Distributed Datasets 4

5 Productivity? Very concise, elegant, functional APIs. Scala, Java Python, R and SQL! 5

6 Productivity? Interactive shell (REPL) Scala, Python, R, and SQL 6

7 Notebooks Jupyter Spark Notebook Zeppelin Beaker Databricks 7

8 8

9 Example: Inverted Index 9

10 Web Crawl wikipedia.org/hadoop Hadoop provides MapReduce and HDFS index block wikipedia.org/hadoop Hadoop provides wikipedia.org/hbase HBase stores data in HDFS block wikipedia.org/hbase HBase stores 10

11 Compute Inverted Index index block wikipedia.org/hadoop Hadoop provides inverse index block hadoop hbase (/hadoop,1) (/hbase,1),(/hive,1) hdfs (/hadoop,1),(/hbase,1),(.. block hive (/hive,1) wikipedia.org/hbase HBase stores Miracle!! block block wikipedia.org/hive Hive queries block block 11

12 verted Index inverse index block acle!! hadoop hbase hdfs hive (/hadoop,1) (/hbase,1),(/hive,1) (/hadoop,1),(/hbase,1),(/hive,1) (/hive,1) 12

13 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) // (id, content) }.flatmap { case (id, content) => towords(content).map(word => ((word,id),1)) // towords not shown }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) // Sort the value seq by count, desc. }.saveastextfile("/path/to/output") 13

14 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) 14

15 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => RDD[String]: /hadoop, Hadoop provides RDD[(String,String)]: (/hadoop,hadoop provides) towords(contents).map(w => ((w,id),1)) 15

16 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { RDD[((String,String),Int)]: ((Hadoop,/hadoop),20) case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 16

17 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) seq => sortbycount(seq) }.saveastextfile("/path/to/output") 17

18 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 18

19 Productivity? textfile map Intuitive API: Dataflow of steps. Inspired by Scala collections and functional programming. flatmap reducebykey map groupbykey map saveastextfile 19

20 Performance? textfile map Lazy API: Combines steps into stages. Cache intermediate data in memory. flatmap reducebykey map groupbykey map saveastextfile 20

21 21

22 Higher-Level APIs 22

23 SQL: Datasets/ DataFrames 23

24 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() Example val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 24

25 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 25

26 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 26

27 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 27

28 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 28

29 val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) Not an arbitrary anonymous funcron, but a Column instance. 29

30 Performance The Dataset API has the same performance for all languages: Scala, Java, Python, R, and SQL! 30

31 def join(right: Dataset[_], joinexprs: Column): DataFrame = { def groupby(cols: Column*): RelationalGroupedDataset = { def orderby(sortexprs: Column*): Dataset[T] = { def select(cols: Column*): Dataset[] = { def where(condition: Column): Dataset[T] = { def limit(n: Int): Dataset[T] = { def intersect(other: Dataset[T]): Dataset[T] = { def sample(withreplacement: Boolean, fraction, seed) = { def drop(col: Column): DataFrame = { def map[u](f: T => U): Dataset[U] = { def flatmap[u](f: T => Traversable[U]): Dataset[U] ={ def foreach(f: T => Unit): Unit = { def take(n: Int): Array[Row] = { def count(): Long = { def distinct(): Dataset[T] = { def agg(exprs: Map[String, String]): DataFrame = { 31

32 32

33 Structured Streaming 33

34 DStream (discretized stream) Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD Window of 3 RDD Batches #1 Window of 3 RDD Batches #2 34

35 ML/MLlib

36 K-Means Machine Learning requires: Iterative training of models. Good linear algebra perf.

37 GraphX

38 PageRank Graph algorithms require: Incremental traversal. Efficient edge and node reps.

39 Foundation: The JVM 39

40 20 Years of DevOps Lots of Java Devs 40

41 Tools and Libraries Akka Breeze Algebird Spire & Cats Axle 41

42 Big Data Ecosystem 42

43 But it s not perfect 43

44 Richer data libs. in Python & R 44

45 Garbage Collection 45

46 GC Challenges Typical Spark heaps: 10s-100s GB. Uncommon for generic, non-data services. 46

47 GC Challenges Too many cached RDDs leads to huge old generation garbage. Billions of objects => long GC pauses. 47

48 Tuning GC Best for Spark: -XX:UseG1GC -XX:-ResizePLAB - Xms -Xmx - XX:InitiatingHeapOccupancyPerce nt= -XX:ConcGCThread= databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-sparkapplications.html 48

49 JVM Object Model 49

50 Java Objects? abcd : 4 bytes for raw UTF8, right? 48 bytes for the Java object: 12 byte header. 8 bytes for hash code. 20 bytes for array overhead. 8 bytes for UTF16 chars. 50

51 val myarray: Array[String] second first third Arrays zeroth 51

52 val person: Person name: String age: Int addr: Address 29 Buck Trends Class Instances 52

53 Hash Map h/c1 key value h/c2 h/c3 a value h/c4 a key Hash Maps 53

54 Improving Performance Why obsess about this? Spark jobs are CPU bound: Improve network I/O? ~2% better. Improve disk I/O? ~20% better. 54

55 What changed? Faster HW (compared to ~2000) 10Gbs networks SSDs. 55

56 What changed? Smarter use of I/O Pruning unneeded data sooner. Caching more effectively. Efficient formats, like Parquet. 56

57 What changed? But more CPU use today: More Serialization. More Compression. More Hashing (joins, group-bys). 57

58 Improving Performance To improve performance, we need to focus on the CPU, the: Better algorithms, sure. And optimize use of memory. 58

59 Project Tungsten Initiative to greatly improve Dataset/DataFrame performance. 59

60 Goals 60

61 Reduce References val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 61

62 Reduce References Fewer, bigger objects to GC. Fewer cache misses val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 62

63 Less Expression Overhead sql("select a + b FROM table") Evaluating expressions billions of times: Virtual function calls. Boxing/unboxing. Branching (if statements, etc.) 63

64 Implementation 64

65 Object Encoding New CompactRow type: null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data Compute hashcode and equals on raw bytes. 65

66 val person: Person Compare: name: String age: Int addr: Address 29 Buck Trends null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data 66

67 BytesToBytesMap: h/c1 Tungsten Memory Page h/c2 k1 v1 k2 v2 h/c3 k3 v3 k4 v4 h/c4 67

68 Hash Map Compare h/c1 h/c2 h/c3 key value a value h/c4 a key h/c1 h/c2 h/c3 h/c4 Tungsten Memory Page k1 v1 k2 v2 k3 v3 k4 v4 68

69 Memory Management Some allocations off heap. sun.misc.unsafe. 69

70 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. 70

71 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. Spark for whole queries. 71

72 72

73 No Value Types (Planned for Java 9 or 10) 73

74 case class Timestamp(epochMillis: Long) { def tostring: String = { } def add(delta: TimeDelta): Timestamp = { /* return new shifted time */ } Don t allocate on the heap; just push the primirve long } on the stack. (scalac does this now.) 74

75 Long operations aren t atomic According to the JVM spec 75

76 No Unsigned Types What s factorial(-1)? 76

77 Arrays Indexed with Ints Byte Arrays limited to 2GB! 77

78 scala> val N = 1100*1000*1000 N2: Int = // 1.1 billion scala> val array = Array.fill[Short](N)(0) array: Array[Short] = Array(0, 0, ) scala> import org.apache.spark.util.sizeestimator scala> SizeEstimator.estimate(array) res3: Long = // 2.2GB 78

79 scala> val b = sc.broadcast(array) broadcast.broadcast[array[short]] = scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) 79

80 scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) java.lang.outofmemoryerror: Boom! Requested array size exceeds VM limit at java.util.arrays.copyof() 80

81 But wait I actually lied to you 81

82 Spark handles large broadcast variables by breaking them into blocks. 82

83 Scala REPL 83

84 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 84

85 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() Pass this closure to RDD.map: i => b.value(i) 85

86 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() Verify that it s clean (serializable). i => b.value(i) at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 86

87 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which it does by serializing to a byte array 87

88 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which requires copying an array What array??? i => b.value(i) scala> val array = Array.fill[Short](N)(0) 88

89 Why did this happen? 89

90 You write: scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) 90

91 scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) class $iwc extends Serializable { } sc.parallelize().map(i => b.value(i)) } 91

92 Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) sucks in the whole object! So, this closure over b class $iwc extends Serializable { sc.parallelize().map(i => b.value(i)) } } 92

93 Lightbend is investigating re-engineering the REPL 93

94 Workarounds 94

95 Transient is often all you need: val array = scala> Array.fill[Short](N)(0) 95

96 object Data { // Encapsulate in objects! val N = 1100*1000*1000 val array = Array.fill[Short](N)(0) val getb = sc.broadcast(array) } object Work { def run(): Unit = { val b = Data.getB // local ref! val rdd = sc.parallelize(). map(i => b.value(i)) // only needs b rdd.take(10).foreach(println) }} 96

97 Why Scala? See the longer version of this talk at polyglotprogramming.com/talks 97

98 polyglotprogramming.com/talks

99 Questions?

100 Bonus Material You can find an extended version of this talk with more details at polyglotprogramming.com/talks 100

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Introduction to Apache Spark. Patrick Wendell - Databricks

Introduction to Apache Spark. Patrick Wendell - Databricks Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage

More information

Apache Spark Internals

Apache Spark Internals Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:

More information

Big Data Analytics with Apache Spark. Nastaran Fatemi

Big Data Analytics with Apache Spark. Nastaran Fatemi Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a

More information

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Spark and distributed data processing

Spark and distributed data processing Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

/ Cloud Computing. Recitation 13 April 17th 2018

/ Cloud Computing. Recitation 13 April 17th 2018 15-319 / 15-619 Cloud Computing Recitation 13 April 17th 2018 Overview Last week s reflection Team Project Phase 2 Quiz 11 OLI Unit 5: Modules 21 & 22 This week s schedule Project 4.2 No more OLI modules

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Evolution From Shark To Spark SQL:

Evolution From Shark To Spark SQL: Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese

More information

Data Engineering. How MapReduce Works. Shivnath Babu

Data Engineering. How MapReduce Works. Shivnath Babu Data Engineering How MapReduce Works Shivnath Babu Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Map function Reduce function

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

In-Memory Processing with Apache Spark. Vincent Leroy

In-Memory Processing with Apache Spark. Vincent Leroy In-Memory Processing with Apache Spark Vincent Leroy Sources Resilient Distributed Datasets, Henggang Cui Coursera IntroducBon to Apache Spark, University of California, Databricks Datacenter OrganizaBon

More information

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK? COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications

More information

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces

More information

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Spark Streaming. Guido Salvaneschi

Spark Streaming. Guido Salvaneschi Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive

More information

A very short introduction

A very short introduction A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed

More information

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71 Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

CS Spark. Slides from Matei Zaharia and Databricks

CS Spark. Slides from Matei Zaharia and Databricks CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive

More information

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would

More information

Beyond MapReduce: Apache Spark Antonino Virgillito

Beyond MapReduce: Apache Spark Antonino Virgillito Beyond MapReduce: Apache Spark Antonino Virgillito 1 Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration

More information

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기 빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks

More information

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng. CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

EPL660: Information Retrieval and Search Engines Lab 11

EPL660: Information Retrieval and Search Engines Lab 11 EPL660: Information Retrieval and Search Engines Lab 11 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Introduction to Apache Spark Fast and general engine for

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

The detailed Spark programming guide is available at:

The detailed Spark programming guide is available at: Aims This exercise aims to get you to: Analyze data using Spark shell Monitor Spark tasks using Web UI Write self-contained Spark applications using Scala in Eclipse Background Spark is already installed

More information

Introduction to Apache Spark

Introduction to Apache Spark Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1 CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 13, 5/9/2016. Scribed by Alfredo Láinez, Luke de Oliveira.

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

/ Cloud Computing. Recitation 13 April 12 th 2016

/ Cloud Computing. Recitation 13 April 12 th 2016 15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2

More information

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

Data processing in Apache Spark

Data processing in Apache Spark Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

Parallel Processing Spark and Spark SQL

Parallel Processing Spark and Spark SQL Parallel Processing Spark and Spark SQL Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 1 / 82 Motivation (1/4) Most current cluster

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 13 April 14 th 2015 15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Reactive App using Actor model & Apache Spark. Rahul Kumar Software Reactive App using Actor model & Apache Spark Rahul Kumar Software Developer @rahul_kumar_aws About Sigmoid We build realtime & big data systems. OUR CUSTOMERS Agenda Big Data - Intro Distributed Application

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

CS246: Mining Massive Datasets Crash Course in Spark

CS246: Mining Massive Datasets Crash Course in Spark CS246: Mining Massive Datasets Crash Course in Spark Daniel Templeton 1 The Plan Getting started with Spark RDDs Commonly useful operations Using python Using Java Using Scala Help session 2 Go download

More information