Real-time data processing with Apache Flink

Size: px

Start display at page:

Download "Real-time data processing with Apache Flink"

Stella Fisher
5 years ago
Views:

1 Real-time data processing with Apache Flink Gyula Fóra Flink committer Swedish ICT

2 Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing: Analyzing and acting on real-time streaming data, using continuous queries 2

3 3 Parts of a Streaming Infrastructure Server Logs Sensors Transaction logs Gathering Broker Analysis 3

4 Streaming landscape Apache Storm True streaming over distributed dataflow Low level API (Bolts, Spouts) + Trident Spark Streaming Stream processing emulated on top of batch system (non-native) Functional API (DStreams), restricted by batch runtime Apache Samza True streaming built on top of Apache Kafka, state is first class citizen Slightly different stream notion, low level API Apache Flink True streaming over stateful distributed dataflow Rich functional API exploiting streaming runtime; e.g. rich windowing semantics 4

5 Hadoop M/R Table Gelly ML Dataflow MRQL Cascading (WiP) Table SAMOA Dataflow What is Flink DataSet (Java/Scala/Python) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Tez Embedded 5

6 Program compilation case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths.join(edges).where("to").equalto("from") { (path, edge) => Path(path.from, edge.to) }.union(paths).distinct() next } Program Type extraction stack Optimizer Pre-flight (Client) Map Filter DataSourc e orders.tbl build HT GroupRed sort forward Join Hybrid Hash probe hash-part [0] hash-part [0] DataSourc e lineitem.tbl Dataflow Graph Dataflow metadata deploy operators Task scheduling Master track intermediate results Workers 6

7 Flink Streaming 7

8 What is Flink Streaming Native stream processor (low-latency) Expressive functional API Flexible operator state, stream windows Exactly-once processing guarantees 8

9 Native vs non-native streaming Non-native streaming Stream discretizer while (true) { // get next few records // issue batch computation } Job Job Job Job Native streaming Long-standing operators while (true) { // process next record } 9

10 Pipelined stream processor Streaming Shuffle! 10

11 Defining windows in Flink Trigger policy When to trigger the computation on current window Eviction policy When data points should leave the window Defines window width/size E.g., count-based policy evict when #elements > n start a new window every n-th element Built-in: Count, Time, Delta policies 11

12 Expressive APIs case class Word (word: String, frequency: Int) DataSet API (batch): val lines: DataSet[String] = env.readtextfile(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.groupBy("word").sum("frequency").print() DataStream API (streaming): val lines: DataStream[String] = env.fromsocketstream(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)).groupBy("word").sum("frequency").print() 12

13 DataStream API 13

14 Overview of the API Data stream sources File system Message queue connectors Arbitrary source functionality Stream transformations Basic transformations: Map, Reduce, Filter, Aggregations Binary stream transformations: CoMap, CoReduce Windowing semantics: Policy based flexible windowing (Time, Count, Delta ) Temporal binary stream operators: Joins, Crosses Native support for iterations Data stream outputs For the details please refer to the programming guide: Src Map Src Reduce Filter Merge Sum Sink 14

Use-case: Financial analytics Reading from multiple inputs Merge stock data from various sources Window aggregations Compute simple statistics over windows Data driven windows Define arbitrary

15 Use-case: Financial analytics Reading from multiple inputs Merge stock data from various sources Window aggregations Compute simple statistics over windows Data driven windows Define arbitrary windowing semantics Combine with sentiment analysis Enrich your analytics with social media feeds (Twitter) Streaming joins Join multiple data streams Detailed explanation and source code on our blog 15

Reading from multiple inputs StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) "HDP, 23.8" "HDP, 26.6" (1) (2) (3) (4) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.

16 Reading from multiple inputs StockPrice(SPX, ) StockPrice(FTSE, ) "HDP, 23.8" "HDP, 26.6" (1) (2) (3) (4) StockPrice(SPX, ) StockPrice(FTSE, ) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) case class StockPrice(symbol : String, price : Double) val env = StreamExecutionEnvironment.getExecutionEnvironment (2) (1) val socketstockstream = env.sockettextstream("localhost", 9999).map(x => { val split = x.split(",") StockPrice(split(0), split(1).todouble) }) (3) (4) val SPX_Stream = env.addsource(generatestock("spx")(10) _) val FTSE_Stream = env.addsource(generatestock("ftse")(20) _) val stockstream = socketstockstream.merge(spx_stream, FTSE_STREAM) 16

Window aggregations (2) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.

17 Window aggregations (2) StockPrice(HDP, 23.8) StockPrice(SPX, ) StockPrice(FTSE, ) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (3) StockPrice(SPX, ) StockPrice(FTSE, ) StockPrice(HDP, 26.6) (4) StockPrice(SPX, ) StockPrice(FTSE, ) StockPrice(HDP, 25.2) (1) (2) (3) (4) val windowedstream = stockstream.window(time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedstream.minby("price") val maxbystock = windowedstream.groupby("symbol").maxby("price") val rollingmean = windowedstream.groupby("symbol").mapwindow(mean _) 17

Data-driven windows StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.

18 Data-driven windows StockPrice(SPX, ) StockPrice(FTSE, ) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (2) (3) (4) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) Count(HDP, 1) case class Count(symbol : String, count : Int) (1) (2) (3) (4) val pricewarnings = stockstream.groupby("symbol").window(delta.of(0.05, pricechange, defaultprice)).mapwindow(sendwarning _) val warningsperstock = pricewarnings.map(count(_, 1)).groupBy("symbol").window(Time.of(30, SECONDS)).sum("count") 18

19 Combining with a Twitter stream "hdp is on the rise!" "I wish I bought more YHOO and HDP stocks" (2) (3) (4) (1) Count(HDP, 2) Count(YHOO, 1) (1) val tweetstream = env.addsource(generatetweets _) (2) (3) val mentionedsymbols = tweetstream.flatmap(tweet => tweet.split(" ")).map(_.touppercase()).filter(symbols.contains(_)) (4) val tweetsperstock = mentionedsymbols.map(count(_, 1)).groupBy("symbol").window(Time.of(30, SECONDS)).sum("count") 19

Streaming joins Count(HDP, 1) (1) (2) 0.5 Count(HDP, 2) Count(YHOO, 1) (1,2) (1) val tweetsandwarning = warningsperstock.join(tweetsperstock).onwindow(30, SECONDS).

20 Streaming joins Count(HDP, 1) (1) (2) 0.5 Count(HDP, 2) Count(YHOO, 1) (1,2) (1) val tweetsandwarning = warningsperstock.join(tweetsperstock).onwindow(30, SECONDS).where("symbol").equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) } (2) val rollingcorrelation = tweetsandwarning.window(time.of(30, SECONDS)).mapWindow(computeCorrelation _) 20

21 Performance Performance optimizations Effective serialization due to strongly typed topologies Operator chaining (thread sharing/no serialization) Different automatic query optimizations Competitive performance ~ 1.5m events / sec / core As a comparison Storm promises ~ 1m tuples / sec / node 21

22 Fault tolerance 22

23 Overview Fault tolerance in other systems Message tracking/acks (Apache Storm) RDD lineage tracking/recomputation Fault tolerance in Apache Flink Based on consistent global snapshots Algorithm inspired by Chandy-Lamport Low runtime overhead, stateful exactlyonce semantics 23

24 Checkpointing / Recovery Pushes checkpoint barriers through the data flow barrier Operator checkpoint starting Checkpoint done Data Stream After barrier = Before barrier = Not in snapshot part of the snapshot (backup till next snapshot) checkpoint in progress Checkpoint done Asynchronous Barrier Snapshotting for globally consistent checkpoints 24

25 State management State declared in the operators is managed and checkpointed by Flink Pluggable backends for storing persistent snapshots Currently: JobManager, FileSystem (HDFS, Tachyon) State partitioning and flexible scaling in the future 25

26 Closing 26

27 Streaming roadmap for 2015 State management New backends for state snapshotting Support for state partitioning and incremental snapshots Master Failover Improved monitoring Integration with other Apache projects SAMOA, Zeppelin, Ignite Streaming machine learning and other new libraries 27

Architecture of Flink's Streaming Runtime. Robert

Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org What is stream processing Real-world data is unbounded and is pushed to systems Right now: people are using the batch