Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Size: px

Start display at page:

Download "Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM"

Chastity Green
6 years ago
Views:

1 Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

2 With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About

3 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 3

4 Spark Family 4

5 Spark Family 5

6 Pre-summary Before RealTime Spark + Cassandra Sending messages with Kafka DStream Kafka Consumer Structured Streaming in Spark 2.1 Kafka Writer in Spark 2.2 6

7 < REAL-TIME 7

8 Modern Java in 2016 Big Data in

9 Batch jobs produce reports. More and more.. 9

10 But customer can wait forever (ok, 1h) 10

11 Big Data in

12 Machine Learning EVERYWHERE 12

13 Data Lake in promotional brochure 13

14 Data Lake in production 14

15 Simple Flow in Reporting/BI systems 15

16 Let s use Spark. It s fast! 16

17 MapReduce vs Spark 17

18 MapReduce vs Spark 18

19 Simple Flow in Reporting/BI systems with Spark 19

20 Spark handles last year logs with ease 20

21 Where can we store events? 21

22 Let s use Cassandra to store events! 22

23 Let s use Cassandra to read events! 23

24 CREATE KEYSPACE myspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE test; Cassandra CREATE TABLE logs ( application TEXT, time TIMESTAMP, message TEXT, PRIMARY KEY (application, time)); 24

25 val dataset = sqlcontext.read.format("org.apache.spark.sql.cassandra") Cassandra to Spark.options(Map( "table" -> "logs", "keyspace" -> "myspace" )).load() dataset.filter("message = 'Log message'").show() 25

26 Simple Flow in Pre-Real-Time systems 26

27 Spark cluster over Cassandra Cluster 27

28 More events every second! 28

29 SENDING MESSAGES 29

30 Your Father s Messaging System 30

31 Your Father s Messaging System 31

32 Your Father s Messaging System InitialContext ctx = new InitialContext(); QueueConnectionFactory f = (QueueConnectionFactory)ctx.lookup( qcfactory"); QueueConnection con = f.createqueueconnection(); con.start(); 32

33 KAFKA 33

34 Kafka messaging system 34

35 Kafka messaging system distributed 35

36 Kafka messaging system distributed supports Publish-Subscribe model 36

37 Kafka messaging system distributed supports Publish-Subscribe model persists messages on disk 37

38 Kafka messaging system distributed supports Publish-Subscribe model persists messages on disk replicates within the cluster (integrated with Zookeeper) 38

39 The main benefits of Kafka Scalability with zero down time Zero data loss due to replication 39

40 Kafka Cluster consists of brokers (leader or follower) topics ( >= 1 partition) partitions partition offsets replicas of partition producers/consumers 40

41 Kafka Components with topic messages #1 Producer Thread #1 Producer Thread #2 Producer Thread #3 Data Data Data Topic: Messages Zookeeper 41

42 Kafka Components with topic messages #2 Producer Thread #1 Producer Thread #2 Producer Thread #3 Data Data Data Leader Part #1 Topic: Messages Leader Part #2 Part #1 Part #2 Broker #1 Zookeeper 42

43 Kafka Components with topic messages #3 Part #1 Producer Thread #1 Producer Thread #2 Data Data Part #1 Topic: Messages Part #2 Broker #1 Producer Thread #3 Data Part #2 Part #1 Broker #2 Part #2 Zookeeper 43

44 Why do we need Zookeeper? 44

45 Kafka Demo 45

46 REAL TIME WITH DSTREAMS 46

47 RDD Factory 47

48 From socket to console with DStreams 48

49 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 49

50 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 50

51 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 51

52 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 52

53 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 53

54 Kafka as a main entry point for Spark 54

55 DStreams Demo 55

56 How to avoid DStreams with RDD-like API? 56

57 SPARK 2.2 DISCUSSION 57

58 Continuous Applications 58

59 Continuous Applications cases Updating data that will be served in real time Extract, transform and load (ETL) Creating a real-time version of an existing batch job Online machine learning 59

60 The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data. 60

61 // Read JSON once from S3 Batch Spark 2.2 logsdf = spark.read.json("s3://logs") // Transform with DataFrame API and save logsdf.select("user", "url", "date").write.parquet("s3://out") 61

62 // Read JSON continuously from S3 Real Time Spark 2.2 logsdf = spark.readstream.json("s3://logs") // Transform with DataFrame API and save logsdf.select("user", "url", "date").writestream.parquet("s3://out").start() 62

63 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 63

64 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 64

65 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 65

66 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() Don t forget to start Streaming val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 66

67 Unlimited Table 67

68 WordCount with Structured Streaming [Complete Mode] 68

69 Kafka -> Structured Streaming -> Console 69

70 Kafka To Console Demo 70

71 OPERATIONS 71

72 You can filter sort aggregate join foreach explain 72

73 Operators Demo 73

74 How it works? 74

75 Deep Diving in Spark Internals Dataset 75

76 Deep Diving in Spark Internals Dataset Logical Plan 76

77 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan 77

78 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical Physical Plan Plan 78

79 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Planner Logical Plan Logical Plan Logical Physical Plan Plan Selected Physical Plan 79

80 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical Physical Plan Plan Selected Physical Plan 80

81 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Incremental #1 Logical Plan Logical Plan Logical Physical Plan Plan Incremental #2 Selected Physical Plan Incremental #3 81

82 Incremental Execution: Planner polls Planner 82

83 Incremental Execution: Planner runs Planner R U N Offset: 1-5 Incremental #1 Count: 5 83

84 Incremental Execution: Planner runs #2 Planner R U N Offset: 1-5 Incremental #1 Count: 5 Offset: 6-9 Incremental #2 Count: 4 84

85 Aggregation with State Planner R U N Offset: 1-5 Incremental #1 Count: 5 5 Offset: 6-9 Incremental #2 Count: 5+4=9 85

86 DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=final,isdistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=partial,isdistinct=false)], output=[cut#20,color#21,sum#58,count#59l]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----) 86

87 What s the difference between Complete and Append output modes? 87

88 COMPLETE, APPEND & UPDATE 88

89 There are two main modes and one in future append (default) 89

90 There are two main modes and one in future append (default) complete 90

91 There are two main modes and one in future append (default) complete update [in dreams] 91

92 Aggregation with watermarks 92

93 SOURCES & SINKS 93

94 Spark Streaming is a brick in the Big Data Wall 94

95 Let's save to Parquet files 95

96 Let's save to Parquet files 96

97 Let's save to Parquet files 97

98 File to Memory 98

99 Can we write to Kafka? 99

100 Nightly Build 100

101 Kafka-to-Kafka 101

102 J-K-S-K-S-C 102

103 Pipeline Demo 103

104 I didn t find sink/source for XXX 104

105 import org.apache.spark.sql.foreachwriter Console Foreach Sink val customwriter = new ForeachWriter[String] { override def open(partitionid: Long, version: Long) = true override def process(value: String) = println(value) override def close(errorornull: Throwable) = {} } stream.writestream.queryname( ForeachOnConsole").foreach(customWriter).start 105

106 Pinch of wisdom check checkpointlocation don t use MemoryStream think about GC pauses be careful about nighty builds use.groupby.count() instead count() use console sink instead.show() function 106

107 We have no ability join two streams work with update mode make full outer join take first N rows sort without pre-aggregation 107

108 Roadmap 2.2 Support other data sources (not only S3 + HDFS) Transactional updates Dataset is one DSL for all operations GraphFrames + Structured MLLib KafkaWriter TensorFrames 108

109 IN CONCLUSION 109

110 Final point Scalable Fault-Tolerant Real-Time Pipeline with Spark & Kafka is ready for usage 110

111 A few papers about Spark Streaming and Kafka Introduction in Spark + Kafka 111

112 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 112

113 Any questions? 113

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts