Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Size: px

Start display at page:

Download "Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM"

Shauna Esther Adams
6 years ago
Views:

1 Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM

2 With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About

3 Secret Word from EPAM itsubbotnik Big Data Training 3

4 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs Big Data Training 4

5 Sprk Dvlprs! Let s start! 5

6 < SPARK 2.0 6

7 Modern Java in 2016 Big Data in 2014 Big Data Training 7

8 Big Data in 2017 Big Data Training 8

9 Machine Learning EVERYWHERE Big Data Training 9

10 Machine Learning vs Traditional Programming Big Data Training 10

11 Big Data Training 11

12 Something wrong with HADOOP Big Data Training 12

13 Hadoop is not SEXY 13

14 Whaaaat? 14

15 Map Reduce Job Writing 15

16 MR code 16

17 Hadoop Developers Right Now 17

18 Iterative Calculations 10x 100x 18

19 MapReduce vs Spark 19

20 MapReduce vs Spark 20

21 MapReduce vs Spark 21

22 SPARK 2.0 DISCUSSION 22

23 Spark Family 23

24 Spark Family 24

25 Case #0 : How to avoid DStreams with RDD-like API? 25

26 Continuous Applications 26

27 Continuous Applications cases Updating data that will be served in real time Extract, transform and load (ETL) Creating a real-time version of an existing batch job Online machine learning 27

28 Write Batches 28

29 Catch Streaming 29

30 The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data. 30

31 // Read JSON once from S3 logsdf = spark.read.json("s3://logs") Batch // Transform with DataFrame API and save logsdf.select("user", "url", "date").write.parquet("s3://out") 31

32 // Read JSON continuously from S3 logsdf = spark.readstream.json("s3://logs") Real Time // Transform with DataFrame API and save logsdf.select("user", "url", "date").writestream.parquet("s3://out").start() 32

33 Unlimited Table 33

34 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 34

35 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 35

36 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 36

37 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() Don t forget to start Streaming val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 37

38 WordCount with Structured Streaming 38

39 Structured Streaming provides fast & scalable fault-tolerant end-to-end with exactly-once semantic stream processing ability to use DataFrame/DataSet API for streaming 39

40 Structured Streaming provides (in dreams) fast & scalable fault-tolerant end-to-end with exactly-once semantic stream processing ability to use DataFrame/DataSet API for streaming 40

41 Let s UNION streaming and static DataSets 41

42 Let s UNION streaming and static DataSets org.apache.spark.sql. AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported; 42

43 Let s UNION streaming and static DataSets Go to UnsupportedOperationChecker.scala and check your operation 43

44 Case #1 : We should think about optimization in RDD terms 44

45 Single Thread collection 45

46 No perf issues, right? 46

47 The main concept more partitions = more parallelism 47

48 Do it parallel 48

49 I d like NARROW 49

50 Map, filter, filter 50

51 GroupByKey, join 51

52 Case #2 : DataFrames suggest mix SQL and Scala functions 52

53 History of Spark APIs 53

54 rdd.filter(_.age > 21) // RDD RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 54

55 History of Spark APIs 55

56 rdd.filter(_.age > 21) // RDD SQL df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 56

57 rdd.filter(_.age > 21) // RDD Expression df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 57

58 History of Spark APIs 58

59 rdd.filter(_.age > 21) // RDD DataSet df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 59

60 Case #2 : DataFrame is referring to data attributes by name 60

61 DataSet = RDD s types + DataFrame s Catalyst RDD API compile-time type-safety off-heap storage mechanism performance benefits of the Catalyst query optimizer Tungsten 61

62 DataSet = RDD s types + DataFrame s Catalyst RDD API compile-time type-safety off-heap storage mechanism performance benefits of the Catalyst query optimizer Tungsten 62

63 Structured APIs in SPARK 63

64 Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now 64

65 case class User( String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Define case class spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 65

66 case class User( String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Read JSON spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 66

67 case class User( String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Filter by Field spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 67

68 Case #3 : Spark has many contexts 68

69 Spark Session New entry point in spark for creating datasets Replaces SQLContext, HiveContext and StreamingContext Move from SparkContext to SparkSession signifies move away from RDD 69

70 val sparksession = SparkSession.builder.master("local").appName("spark session example").getorcreate() Spark Session val df = sparksession.read.option("header","true").csv("src/main/resources/names.csv") df.show() 70

71 No, I want to create my lovely RDDs 71

72 Where s parallelize() method? 72

73 case class User( String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = spark.read.json("/home/tmp/datasets/users.json").as[user] RDD? userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 73

74 Case #4 : Spark uses Java serialization A LOT 74

75 Two choices to distribute data across cluster Java serialization By default with ObjectOutputStream Kryo serialization Should register classes (no support of Serialazible) 75

76 The main problem: overhead of serializing Each serialized object contains the class structure as well as the values 76

77 The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don t forget about GC 77

78 Tungsten Compact Encoding 78

79 Encoder s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser 79

80 Encoders 80

81 No custom encoders 81

82 Case #5 : Not enough storage levels 82

83 Caching in Spark Frequently used RDD can be stored in memory One method, one short-cut: persist(), cache() SparkContext keeps track of cached RDD Serialized or deserialized Java objects 83

84 Full list of options MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 84

85 Spark Core Storage Level MEMORY_ONLY (default for Spark Core) MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 85

86 Spark Streaming Storage Level MEMORY_ONLY (default for Spark Core) MEMORY_AND_DISK MEMORY_ONLY_SER (default for Spark Streaming) MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 86

87 Developer API to make new Storage Levels 87

88 What s the most popular file format in BigData? 88

89 Case #6 : External libraries to read CSV 89

90 data = sqlcontext.read.format("csv").option("header", "true").option("inferschema", "true") Easy to read CSV.load("/datasets/samples/users.csv") data.cache() data.createorreplacetempview( users") display(data) 90

91 Case #7 : How to measure Spark performance? 91

92 You'd measure performance! 92

93 TPCDS 99 Queries 93

94 94

95 How to benchmark Spark 95

96 Special Tool from Databricks Benchmark Tool for SparkSQL 96

97 Spark 2 vs Spark

98 Case #8 : What s faster: SQL or DataSet API? 98

99 Job Stages in old Spark 99

100 Scheduler Optimizations 100

101 Catalyst Optimizer for DataFrames 101

102 Unified Logical Plan DataFrame = Dataset[Row] 102

103 Bytecode 103

104 DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=final,isdistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=partial,isdistinct=false)], output=[cut#20,color#21,sum#58,count#59l]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----) 104

105 Case #9 : Why does explain() show so many Tungsten things? 105

106 How to be effective with CPU Runtime code generation Exploiting cache locality Off-heap memory management 106

107 Tungsten s goal Push performance closer to the limits of modern hardware 107

108 Maybe something UNSAFE? 108

109 UnsafeRowFormat Bit set for tracking null values Small values are inlined For variable-length values are stored relative offset into the variablelength data section Rows are always 8-byte word aligned Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation 109

110 Case #10 : Can I influence on Memory Management in Spark? 110

111 Case #11 : Should I tune generation s stuff? 111

112 Cached Data 112

113 During operations 113

114 For your needs 114

115 For Dark Lord 115

116 IN CONCLUSION 116

117 We have no ability join structured streaming and other sources to handle it one unified ML API GraphX rethinking and redesign Custom encoders Datasets everywhere integrate with something important 117

118 Roadmap Support other data sources (not only S3 + HDFS) Transactional updates Dataset is one DSL for all operations GraphFrames + Structured MLLib Tungsten: custom encoders The RDD-based API is expected to be removed in Spark

119 And we can DO IT! Push Events & Alarms ( , SNMP etc.) Real-Time ETL & CEP Search Index Social Real-Time Data-Marts Real-time Data Ingestion Batch Data-Marts External Batch Data Ingestion Batch ETL & Raw Area HDFS CFS as an option Titan & KairosDB store data in Cassandra Relations Graph Ontology Metadata Internal Scheduler Time-Series Data Real-time Dashboarding All Raw Data backup is stored here Events & Alarms Events & Alarms 119

120 First Part 120

121 Second Part 121

122 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 122

123 Any questions? 123

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail