Scala and the JVM for Big Data: Lessons from Spark
|
|
- Dominic Holland
- 5 years ago
- Views:
Transcription
1 Scala and the JVM for Big Data: Lessons from Spark 1 Dean Wampler , All Rights Reserved
2 Spark 2
3 A Distributed Computing Engine on the JVM 3
4 Node Cluster Node RDD Node Partition 1 Partition 1 Partition 1 Resilient Distributed Datasets 4
5 Productivity? Very concise, elegant, functional APIs. Scala, Java Python, R and SQL! 5
6 Productivity? Interactive shell (REPL) Scala, Python, R, and SQL 6
7 Notebooks Jupyter Spark Notebook Zeppelin Beaker Databricks 7
8 8
9 Example: Inverted Index 9
10 Web Crawl wikipedia.org/hadoop Hadoop provides MapReduce and HDFS index block wikipedia.org/hadoop Hadoop provides wikipedia.org/hbase HBase stores data in HDFS block wikipedia.org/hbase HBase stores 10
11 Compute Inverted Index index block wikipedia.org/hadoop Hadoop provides inverse index block hadoop hbase (/hadoop,1) (/hbase,1),(/hive,1) hdfs (/hadoop,1),(/hbase,1),(.. block hive (/hive,1) wikipedia.org/hbase HBase stores Miracle!! block block wikipedia.org/hive Hive queries block block 11
12 verted Index inverse index block acle!! hadoop hbase hdfs hive (/hadoop,1) (/hbase,1),(/hive,1) (/hadoop,1),(/hbase,1),(/hive,1) (/hive,1) 12
13 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) // (id, content) }.flatmap { case (id, content) => towords(content).map(word => ((word,id),1)) // towords not shown }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) // Sort the value seq by count, desc. }.saveastextfile("/path/to/output") 13
14 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) 14
15 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => RDD[String]: /hadoop, Hadoop provides RDD[(String,String)]: (/hadoop,hadoop provides) towords(contents).map(w => ((w,id),1)) 15
16 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { RDD[((String,String),Int)]: ((Hadoop,/hadoop),20) case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 16
17 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) seq => sortbycount(seq) }.saveastextfile("/path/to/output") 17
18 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 18
19 Productivity? textfile map Intuitive API: Dataflow of steps. Inspired by Scala collections and functional programming. flatmap reducebykey map groupbykey map saveastextfile 19
20 Performance? textfile map Lazy API: Combines steps into stages. Cache intermediate data in memory. flatmap reducebykey map groupbykey map saveastextfile 20
21 21
22 Higher-Level APIs 22
23 SQL: Datasets/ DataFrames 23
24 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() Example val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 24
25 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 25
26 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 26
27 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 27
28 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 28
29 val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) Not an arbitrary anonymous funcron, but a Column instance. 29
30 Performance The Dataset API has the same performance for all languages: Scala, Java, Python, R, and SQL! 30
31 def join(right: Dataset[_], joinexprs: Column): DataFrame = { def groupby(cols: Column*): RelationalGroupedDataset = { def orderby(sortexprs: Column*): Dataset[T] = { def select(cols: Column*): Dataset[] = { def where(condition: Column): Dataset[T] = { def limit(n: Int): Dataset[T] = { def intersect(other: Dataset[T]): Dataset[T] = { def sample(withreplacement: Boolean, fraction, seed) = { def drop(col: Column): DataFrame = { def map[u](f: T => U): Dataset[U] = { def flatmap[u](f: T => Traversable[U]): Dataset[U] ={ def foreach(f: T => Unit): Unit = { def take(n: Int): Array[Row] = { def count(): Long = { def distinct(): Dataset[T] = { def agg(exprs: Map[String, String]): DataFrame = { 31
32 32
33 Structured Streaming 33
34 DStream (discretized stream) Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD Window of 3 RDD Batches #1 Window of 3 RDD Batches #2 34
35 ML/MLlib
36 K-Means Machine Learning requires: Iterative training of models. Good linear algebra perf.
37 GraphX
38 PageRank Graph algorithms require: Incremental traversal. Efficient edge and node reps.
39 Foundation: The JVM 39
40 20 Years of DevOps Lots of Java Devs 40
41 Tools and Libraries Akka Breeze Algebird Spire & Cats Axle 41
42 Big Data Ecosystem 42
43 But it s not perfect 43
44 Richer data libs. in Python & R 44
45 Garbage Collection 45
46 GC Challenges Typical Spark heaps: 10s-100s GB. Uncommon for generic, non-data services. 46
47 GC Challenges Too many cached RDDs leads to huge old generation garbage. Billions of objects => long GC pauses. 47
48 Tuning GC Best for Spark: -XX:UseG1GC -XX:-ResizePLAB - Xms -Xmx - XX:InitiatingHeapOccupancyPerce nt= -XX:ConcGCThread= databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-sparkapplications.html 48
49 JVM Object Model 49
50 Java Objects? abcd : 4 bytes for raw UTF8, right? 48 bytes for the Java object: 12 byte header. 8 bytes for hash code. 20 bytes for array overhead. 8 bytes for UTF16 chars. 50
51 val myarray: Array[String] second first third Arrays zeroth 51
52 val person: Person name: String age: Int addr: Address 29 Buck Trends Class Instances 52
53 Hash Map h/c1 key value h/c2 h/c3 a value h/c4 a key Hash Maps 53
54 Improving Performance Why obsess about this? Spark jobs are CPU bound: Improve network I/O? ~2% better. Improve disk I/O? ~20% better. 54
55 What changed? Faster HW (compared to ~2000) 10Gbs networks SSDs. 55
56 What changed? Smarter use of I/O Pruning unneeded data sooner. Caching more effectively. Efficient formats, like Parquet. 56
57 What changed? But more CPU use today: More Serialization. More Compression. More Hashing (joins, group-bys). 57
58 Improving Performance To improve performance, we need to focus on the CPU, the: Better algorithms, sure. And optimize use of memory. 58
59 Project Tungsten Initiative to greatly improve Dataset/DataFrame performance. 59
60 Goals 60
61 Reduce References val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 61
62 Reduce References Fewer, bigger objects to GC. Fewer cache misses val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 62
63 Less Expression Overhead sql("select a + b FROM table") Evaluating expressions billions of times: Virtual function calls. Boxing/unboxing. Branching (if statements, etc.) 63
64 Implementation 64
65 Object Encoding New CompactRow type: null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data Compute hashcode and equals on raw bytes. 65
66 val person: Person Compare: name: String age: Int addr: Address 29 Buck Trends null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data 66
67 BytesToBytesMap: h/c1 Tungsten Memory Page h/c2 k1 v1 k2 v2 h/c3 k3 v3 k4 v4 h/c4 67
68 Hash Map Compare h/c1 h/c2 h/c3 key value a value h/c4 a key h/c1 h/c2 h/c3 h/c4 Tungsten Memory Page k1 v1 k2 v2 k3 v3 k4 v4 68
69 Memory Management Some allocations off heap. sun.misc.unsafe. 69
70 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. 70
71 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. Spark for whole queries. 71
72 72
73 No Value Types (Planned for Java 9 or 10) 73
74 case class Timestamp(epochMillis: Long) { def tostring: String = { } def add(delta: TimeDelta): Timestamp = { /* return new shifted time */ } Don t allocate on the heap; just push the primirve long } on the stack. (scalac does this now.) 74
75 Long operations aren t atomic According to the JVM spec 75
76 No Unsigned Types What s factorial(-1)? 76
77 Arrays Indexed with Ints Byte Arrays limited to 2GB! 77
78 scala> val N = 1100*1000*1000 N2: Int = // 1.1 billion scala> val array = Array.fill[Short](N)(0) array: Array[Short] = Array(0, 0, ) scala> import org.apache.spark.util.sizeestimator scala> SizeEstimator.estimate(array) res3: Long = // 2.2GB 78
79 scala> val b = sc.broadcast(array) broadcast.broadcast[array[short]] = scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) 79
80 scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) java.lang.outofmemoryerror: Boom! Requested array size exceeds VM limit at java.util.arrays.copyof() 80
81 But wait I actually lied to you 81
82 Spark handles large broadcast variables by breaking them into blocks. 82
83 Scala REPL 83
84 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 84
85 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() Pass this closure to RDD.map: i => b.value(i) 85
86 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() Verify that it s clean (serializable). i => b.value(i) at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 86
87 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which it does by serializing to a byte array 87
88 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which requires copying an array What array??? i => b.value(i) scala> val array = Array.fill[Short](N)(0) 88
89 Why did this happen? 89
90 You write: scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) 90
91 scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) class $iwc extends Serializable { } sc.parallelize().map(i => b.value(i)) } 91
92 Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) sucks in the whole object! So, this closure over b class $iwc extends Serializable { sc.parallelize().map(i => b.value(i)) } } 92
93 Lightbend is investigating re-engineering the REPL 93
94 Workarounds 94
95 Transient is often all you need: val array = scala> Array.fill[Short](N)(0) 95
96 object Data { // Encapsulate in objects! val N = 1100*1000*1000 val array = Array.fill[Short](N)(0) val getb = sc.broadcast(array) } object Work { def run(): Unit = { val b = Data.getB // local ref! val rdd = sc.parallelize(). map(i => b.value(i)) // only needs b rdd.take(10).foreach(println) }} 96
97 Why Scala? See the longer version of this talk at polyglotprogramming.com/talks 97
98 polyglotprogramming.com/talks
99 Questions?
100 Bonus Material You can find an extended version of this talk with more details at polyglotprogramming.com/talks 100
Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationAn Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.
An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationIntroduction to Apache Spark. Patrick Wendell - Databricks
Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage
More informationApache Spark Internals
Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:
More informationBig Data Analytics with Apache Spark. Nastaran Fatemi
Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a
More informationWHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS
WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science
More informationMapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin
Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationSpark and distributed data processing
Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationData-intensive computing systems
Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More information/ Cloud Computing. Recitation 13 April 17th 2018
15-319 / 15-619 Cloud Computing Recitation 13 April 17th 2018 Overview Last week s reflection Team Project Phase 2 Quiz 11 OLI Unit 5: Modules 21 & 22 This week s schedule Project 4.2 No more OLI modules
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationEvolution From Shark To Spark SQL:
Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese
More informationData Engineering. How MapReduce Works. Shivnath Babu
Data Engineering How MapReduce Works Shivnath Babu Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Map function Reduce function
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationIn-Memory Processing with Apache Spark. Vincent Leroy
In-Memory Processing with Apache Spark Vincent Leroy Sources Resilient Distributed Datasets, Henggang Cui Coursera IntroducBon to Apache Spark, University of California, Databricks Datacenter OrganizaBon
More informationCOSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?
COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications
More informationTHE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces
More informationApache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides
Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationApplied Spark. From Concepts to Bitcoin Analytics. Andrew F.
Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationA very short introduction
A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationSpark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71
Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationCS Spark. Slides from Matei Zaharia and Databricks
CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive
More informationCSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would
More informationBeyond MapReduce: Apache Spark Antonino Virgillito
Beyond MapReduce: Apache Spark Antonino Virgillito 1 Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration
More information빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기
빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationStructured Streaming. Big Data Analysis with Scala and Spark Heather Miller
Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly
More informationPractical Big Data Processing An Overview of Apache Flink
Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013
More informationEPL660: Information Retrieval and Search Engines Lab 11
EPL660: Information Retrieval and Search Engines Lab 11 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Introduction to Apache Spark Fast and general engine for
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationAccelerating Spark Workloads using GPUs
Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark
More informationThe detailed Spark programming guide is available at:
Aims This exercise aims to get you to: Analyze data using Spark shell Monitor Spark tasks using Web UI Write self-contained Spark applications using Scala in Eclipse Background Spark is already installed
More informationIntroduction to Apache Spark
Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul
More informationWe consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 13, 5/9/2016. Scribed by Alfredo Láinez, Luke de Oliveira.
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming
More informationAn Overview of Apache Spark
An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationWHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3
WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More information/ Cloud Computing. Recitation 13 April 12 th 2016
15-319 / 15-619 Cloud Computing Recitation 13 April 12 th 2016 Overview Last week s reflection Project 4.1 Quiz 11 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 21 Project 4.2
More informationLightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast
Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students
More informationApache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationParallel Processing Spark and Spark SQL
Parallel Processing Spark and Spark SQL Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 1 / 82 Motivation (1/4) Most current cluster
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More information/ Cloud Computing. Recitation 13 April 14 th 2015
15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo
More informationHigher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
More informationReactive App using Actor model & Apache Spark. Rahul Kumar Software
Reactive App using Actor model & Apache Spark Rahul Kumar Software Developer @rahul_kumar_aws About Sigmoid We build realtime & big data systems. OUR CUSTOMERS Agenda Big Data - Intro Distributed Application
More informationBacktesting with Spark
Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationCS246: Mining Massive Datasets Crash Course in Spark
CS246: Mining Massive Datasets Crash Course in Spark Daniel Templeton 1 The Plan Getting started with Spark RDDs Commonly useful operations Using python Using Java Using Scala Help session 2 Go download
More information