Scala and the JVM for Big Data: Lessons from Spark

Size: px

Start display at page:

Download "Scala and the JVM for Big Data: Lessons from Spark"

Dominic Holland
5 years ago
Views:

Scala and the JVM for Big Data: Lessons from Spark polyglotprogramming.

2 Spark 2

3 A Distributed Computing Engine on the JVM 3

4 Node Cluster Node RDD Node Partition 1 Partition 1 Partition 1 Resilient Distributed Datasets 4

5 Productivity? Very concise, elegant, functional APIs. Scala, Java Python, R and SQL! 5

6 Productivity? Interactive shell (REPL) Scala, Python, R, and SQL 6

7 Notebooks Jupyter Spark Notebook Zeppelin Beaker Databricks 7

8 8

9 Example: Inverted Index 9

10 Web Crawl wikipedia.org/hadoop Hadoop provides MapReduce and HDFS index block wikipedia.org/hadoop Hadoop provides wikipedia.org/hbase HBase stores data in HDFS block wikipedia.org/hbase HBase stores 10

11 Compute Inverted Index index block wikipedia.org/hadoop Hadoop provides inverse index block hadoop hbase (/hadoop,1) (/hbase,1),(/hive,1) hdfs (/hadoop,1),(/hbase,1),(.. block hive (/hive,1) wikipedia.org/hbase HBase stores Miracle!! block block wikipedia.org/hive Hive queries block block 11

12 verted Index inverse index block acle!! hadoop hbase hdfs hive (/hadoop,1) (/hbase,1),(/hive,1) (/hadoop,1),(/hbase,1),(/hive,1) (/hive,1) 12

13 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) // (id, content) }.flatmap { case (id, content) => towords(content).map(word => ((word,id),1)) // towords not shown }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) // Sort the value seq by count, desc. }.saveastextfile("/path/to/output") 13

14 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) 14

15 import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ val sparkcontext = new SparkContext(master, Inv. Index ) sparkcontext.textfile("/path/to/input"). map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => RDD[String]: /hadoop, Hadoop provides RDD[(String,String)]: (/hadoop,hadoop provides) towords(contents).map(w => ((w,id),1)) 15

16 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { RDD[((String,String),Int)]: ((Hadoop,/hadoop),20) case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 16

17 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. mapvalues { RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) seq => sortbycount(seq) }.saveastextfile("/path/to/output") 17

18 val array = line.split(",", 2) (array(0), array(1)) }.flatmap { case (id, contents) => towords(contents).map(w => ((w,id),1)) }.reducebykey(_ + _). map { case ((word,id),n) => (word,(id,n)) }.groupbykey. RDD[(String,Iterable((String,Int))]: (Hadoop,seq(/hadoop,20),)) mapvalues { seq => sortbycount(seq) }.saveastextfile("/path/to/output") 18

19 Productivity? textfile map Intuitive API: Dataflow of steps. Inspired by Scala collections and functional programming. flatmap reducebykey map groupbykey map saveastextfile 19

20 Performance? textfile map Lazy API: Combines steps into stages. Cache intermediate data in memory. flatmap reducebykey map groupbykey map saveastextfile 20

21 21

22 Higher-Level APIs 22

23 SQL: Datasets/ DataFrames 23

24 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() Example val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 24

25 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 25

26 import org.apache.spark.sparksession val spark = SparkSession.builder().master("local").appName("Queries").getOrCreate() val flights = spark.read.parquet("/flights") val planes = spark.read.parquet("/planes") flights.createorreplacetempview("flights") planes. createorreplacetempview("planes") flights.cache(); planes.cache() 26

27 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 27

28 planes. createorreplacetempview("planes") flights.cache(); planes.cache() val planes_for_flights1 = sqlcontext.sql(""" SELECT * FROM flights f JOIN planes p ON f.tailnum = p.tailnum LIMIT 100""") val planes_for_flights2 = Returns another Dataset. flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) 28

29 val planes_for_flights2 = flights.join(planes, flights("tailnum") === planes ("tailnum")).limit(100) Not an arbitrary anonymous funcron, but a Column instance. 29

30 Performance The Dataset API has the same performance for all languages: Scala, Java, Python, R, and SQL! 30

31 def join(right: Dataset[_], joinexprs: Column): DataFrame = { def groupby(cols: Column*): RelationalGroupedDataset = { def orderby(sortexprs: Column*): Dataset[T] = { def select(cols: Column*): Dataset[] = { def where(condition: Column): Dataset[T] = { def limit(n: Int): Dataset[T] = { def intersect(other: Dataset[T]): Dataset[T] = { def sample(withreplacement: Boolean, fraction, seed) = { def drop(col: Column): DataFrame = { def map[u](f: T => U): Dataset[U] = { def flatmap[u](f: T => Traversable[U]): Dataset[U] ={ def foreach(f: T => Unit): Unit = { def take(n: Int): Array[Row] = { def count(): Long = { def distinct(): Dataset[T] = { def agg(exprs: Map[String, String]): DataFrame = { 31

32 32

33 Structured Streaming 33

34 DStream (discretized stream) Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD Window of 3 RDD Batches #1 Window of 3 RDD Batches #2 34

35 ML/MLlib

36 K-Means Machine Learning requires: Iterative training of models. Good linear algebra perf.

37 GraphX

38 PageRank Graph algorithms require: Incremental traversal. Efficient edge and node reps.

39 Foundation: The JVM 39

40 20 Years of DevOps Lots of Java Devs 40

41 Tools and Libraries Akka Breeze Algebird Spire & Cats Axle 41

42 Big Data Ecosystem 42

43 But it s not perfect 43

44 Richer data libs. in Python & R 44

45 Garbage Collection 45

46 GC Challenges Typical Spark heaps: 10s-100s GB. Uncommon for generic, non-data services. 46

47 GC Challenges Too many cached RDDs leads to huge old generation garbage. Billions of objects => long GC pauses. 47

48 Tuning GC Best for Spark: -XX:UseG1GC -XX:-ResizePLAB - Xms -Xmx - XX:InitiatingHeapOccupancyPerce nt= -XX:ConcGCThread= databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-sparkapplications.html 48

49 JVM Object Model 49

50 Java Objects? abcd : 4 bytes for raw UTF8, right? 48 bytes for the Java object: 12 byte header. 8 bytes for hash code. 20 bytes for array overhead. 8 bytes for UTF16 chars. 50

51 val myarray: Array[String] second first third Arrays zeroth 51

52 val person: Person name: String age: Int addr: Address 29 Buck Trends Class Instances 52

53 Hash Map h/c1 key value h/c2 h/c3 a value h/c4 a key Hash Maps 53

54 Improving Performance Why obsess about this? Spark jobs are CPU bound: Improve network I/O? ~2% better. Improve disk I/O? ~20% better. 54

55 What changed? Faster HW (compared to ~2000) 10Gbs networks SSDs. 55

56 What changed? Smarter use of I/O Pruning unneeded data sooner. Caching more effectively. Efficient formats, like Parquet. 56

57 What changed? But more CPU use today: More Serialization. More Compression. More Hashing (joins, group-bys). 57

58 Improving Performance To improve performance, we need to focus on the CPU, the: Better algorithms, sure. And optimize use of memory. 58

59 Project Tungsten Initiative to greatly improve Dataset/DataFrame performance. 59

60 Goals 60

61 Reduce References val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 61

62 Reduce References Fewer, bigger objects to GC. Fewer cache misses val person: Person val myarray: Array[String] name: String age: Int 29 addr: Address Hash Map Buck Trends second first third zeroth h/c1 h/c2 key value h/c3 a value h/c4 a key 62

63 Less Expression Overhead sql("select a + b FROM table") Evaluating expressions billions of times: Virtual function calls. Boxing/unboxing. Branching (if statements, etc.) 63

64 Implementation 64

65 Object Encoding New CompactRow type: null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data Compute hashcode and equals on raw bytes. 65

66 val person: Person Compare: name: String age: Int addr: Address 29 Buck Trends null bit set (1bit/field) values (8bytes/field) variable length offset to var. len. data 66

67 BytesToBytesMap: h/c1 Tungsten Memory Page h/c2 k1 v1 k2 v2 h/c3 k3 v3 k4 v4 h/c4 67

68 Hash Map Compare h/c1 h/c2 h/c3 key value a value h/c4 a key h/c1 h/c2 h/c3 h/c4 Tungsten Memory Page k1 v1 k2 v2 k3 v3 k4 v4 68

69 Memory Management Some allocations off heap. sun.misc.unsafe. 69

70 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. 70

71 Less Expression Overhead sql("select a + b FROM table") Solution: Generate custom byte code. Spark 1.X - for subexpressions. Spark for whole queries. 71

72 72

73 No Value Types (Planned for Java 9 or 10) 73

74 case class Timestamp(epochMillis: Long) { def tostring: String = { } def add(delta: TimeDelta): Timestamp = { /* return new shifted time */ } Don t allocate on the heap; just push the primirve long } on the stack. (scalac does this now.) 74

75 Long operations aren t atomic According to the JVM spec 75

76 No Unsigned Types What s factorial(-1)? 76

77 Arrays Indexed with Ints Byte Arrays limited to 2GB! 77

78 scala> val N = 1100*1000*1000 N2: Int = // 1.1 billion scala> val array = Array.fill[Short](N)(0) array: Array[Short] = Array(0, 0, ) scala> import org.apache.spark.util.sizeestimator scala> SizeEstimator.estimate(array) res3: Long = // 2.2GB 78

79 scala> val b = sc.broadcast(array) broadcast.broadcast[array[short]] = scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) 79

80 scala> SizeEstimator.estimate(b) res0: Long = 2368 scala> sc.parallelize(0 until ). map(i => b.value(i)) java.lang.outofmemoryerror: Boom! Requested array size exceeds VM limit at java.util.arrays.copyof() 80

81 But wait I actually lied to you 81

82 Spark handles large broadcast variables by breaking them into blocks. 82

83 Scala REPL 83

84 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 84

85 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() Pass this closure to RDD.map: i => b.value(i) 85

86 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() Verify that it s clean (serializable). i => b.value(i) at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() 86

87 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which it does by serializing to a byte array 87

88 java.lang.outofmemoryerror: Requested array size exceeds VM limit at java.util.arrays.copyof() at java.io.bytearrayoutputstream.write() at java.io.objectoutputstream.writeobject() at spark.serializer.javaserializationstream.writeobject() at spark.util.closurecleaner$.ensureserializable(..) at org.apache.spark.rdd.rdd.map() which requires copying an array What array??? i => b.value(i) scala> val array = Array.fill[Short](N)(0) 88

89 Why did this happen? 89

90 You write: scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) 90

91 scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) class $iwc extends Serializable { } sc.parallelize().map(i => b.value(i)) } 91

92 Scala compiles: class $iwc extends Serializable { val array = Array.fill[Short](N)(0) val b = sc.broadcast(array) scala> val array = Array.fill[Short](N)(0) scala> val b = sc.broadcast(array) scala> sc.parallelize(0 until ). map(i => b.value(i)) sucks in the whole object! So, this closure over b class $iwc extends Serializable { sc.parallelize().map(i => b.value(i)) } } 92

93 Lightbend is investigating re-engineering the REPL 93

94 Workarounds 94

95 Transient is often all you need: val array = scala> Array.fill[Short](N)(0) 95

96 object Data { // Encapsulate in objects! val N = 1100*1000*1000 val array = Array.fill[Short](N)(0) val getb = sc.broadcast(array) } object Work { def run(): Unit = { val b = Data.getB // local ref! val rdd = sc.parallelize(). map(i => b.value(i)) // only needs b rdd.take(10).foreach(println) }} 96

97 Why Scala? See the longer version of this talk at polyglotprogramming.com/talks 97

98 polyglotprogramming.com/talks

99 Questions?

100 Bonus Material You can find an extended version of this talk with more details at polyglotprogramming.com/talks 100

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized