Real-time data processing with Apache Flink

Similar documents
Architecture of Flink's Streaming Runtime. Robert

Apache Flink Big Data Stream Processing

Practical Big Data Processing An Overview of Apache Flink

Apache Flink- A System for Batch and Realtime Stream Processing

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

The Stream Processor as a Database. Ufuk

Streaming Analytics with Apache Flink. Stephan

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Apache Flink. Alessandro Margara

Big Data Stream Processing

Over the last few years, we have seen a disruption in the data management

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Modern Stream Processing with Apache Flink

Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

CSE 444: Database Internals. Lecture 23 Spark

The Power of Snapshots Stateful Stream Processing with Apache Flink

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Apache Flink Streaming Done Right. Till

Spark Streaming. Guido Salvaneschi

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

Spark, Shark and Spark Streaming Introduction

Data Analytics with HPC. Data Streaming

Unifying Big Data Workloads in Apache Spark

Hadoop, Yarn and Beyond

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

The Future of Real-Time in Spark

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

microsoft

Data-Intensive Distributed Computing

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10

Data Acquisition. The reference Big Data stack

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

The Stratosphere Platform for Big Data Analytics

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

Chapter 4: Apache Spark

Flash Storage Complementing a Data Lake for Real-Time Insight

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Large-Scale Graph Processing with Apache Flink GraphDevroom FOSDEM 15. Vasia Kalavri Flink committer & PhD

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Infrastructures & Technologies

Streaming & Apache Storm

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Webinar Series TMIP VISION

Batch & Stream Graph Processing with Apache Flink. Vasia

Apache Spark 2.0. Matei

Distributed Computation Models

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast

Apache Spark Internals

Certified Big Data Hadoop and Spark Scala Course Curriculum

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

CompSci 516: Database Systems

Apache Storm. Hortonworks Inc Page 1

Parallel and Distributed Stream Processing: Systems Classification and Specific Issues

Spark Streaming. Professor Sasu Tarkoma.

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Turning Relational Database Tables into Spark Data Sources

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

/ Cloud Computing. Recitation 15 December 6 th 2016

Spark Streaming: Hands-on Session A.A. 2017/18

Big Data Architect.

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Hadoop Course Content

Apache Bahir Writing Applications using Apache Bahir

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Batch Processing Basic architecture

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

An Introduction to Apache Spark

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Apache Spark and Scala Certification Training

An Overview of Apache Spark

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Hadoop Map Reduce 10/17/2018 1

STORM AND LOW-LATENCY PROCESSING.

25/05/2018. Spark Streaming is a framework for large scale stream processing

Apache Hive for Oracle DBAs. Luís Marques

Transcription:

Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT

Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing: Analyzing and acting on real-time streaming data, using continuous queries 2

3 Parts of a Streaming Infrastructure Server Logs Sensors Transaction logs Gathering Broker Analysis 3

Streaming landscape Apache Storm True streaming over distributed dataflow Low level API (Bolts, Spouts) + Trident Spark Streaming Stream processing emulated on top of batch system (non-native) Functional API (DStreams), restricted by batch runtime Apache Samza True streaming built on top of Apache Kafka, state is first class citizen Slightly different stream notion, low level API Apache Flink True streaming over stateful distributed dataflow Rich functional API exploiting streaming runtime; e.g. rich windowing semantics 4

Hadoop M/R Table Gelly ML Dataflow MRQL Cascading (WiP) Table SAMOA Dataflow What is Flink DataSet (Java/Scala/Python) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Tez Embedded 5

Program compilation case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths.join(edges).where("to").equalto("from") { (path, edge) => Path(path.from, edge.to) }.union(paths).distinct() next } Program Type extraction stack Optimizer Pre-flight (Client) Map Filter DataSourc e orders.tbl build HT GroupRed sort forward Join Hybrid Hash probe hash-part [0] hash-part [0] DataSourc e lineitem.tbl Dataflow Graph Dataflow metadata deploy operators Task scheduling Master track intermediate results Workers 6

Flink Streaming 7

What is Flink Streaming Native stream processor (low-latency) Expressive functional API Flexible operator state, stream windows Exactly-once processing guarantees 8

Native vs non-native streaming Non-native streaming Stream discretizer while (true) { // get next few records // issue batch computation } Job Job Job Job Native streaming Long-standing operators while (true) { // process next record } 9

Pipelined stream processor Streaming Shuffle! 10

Defining windows in Flink Trigger policy When to trigger the computation on current window Eviction policy When data points should leave the window Defines window width/size E.g., count-based policy evict when #elements > n start a new window every n-th element Built-in: Count, Time, Delta policies 11

Expressive APIs case class Word (word: String, frequency: Int) DataSet API (batch): val lines: DataSet[String] = env.readtextfile(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.groupBy("word").sum("frequency").print() DataStream API (streaming): val lines: DataStream[String] = env.fromsocketstream(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)).groupBy("word").sum("frequency").print() 12

DataStream API 13

Overview of the API Data stream sources File system Message queue connectors Arbitrary source functionality Stream transformations Basic transformations: Map, Reduce, Filter, Aggregations Binary stream transformations: CoMap, CoReduce Windowing semantics: Policy based flexible windowing (Time, Count, Delta ) Temporal binary stream operators: Joins, Crosses Native support for iterations Data stream outputs For the details please refer to the programming guide: Src Map Src Reduce Filter Merge Sum Sink http://flink.apache.org/docs/latest/streaming_guide.html 14

Use-case: Financial analytics Reading from multiple inputs Merge stock data from various sources Window aggregations Compute simple statistics over windows Data driven windows Define arbitrary windowing semantics Combine with sentiment analysis Enrich your analytics with social media feeds (Twitter) Streaming joins Join multiple data streams Detailed explanation and source code on our blog http://flink.apache.org/news/2015/02/09/streaming-example.html 15

Reading from multiple inputs StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) "HDP, 23.8" "HDP, 26.6" (1) (2) (3) (4) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) case class StockPrice(symbol : String, price : Double) val env = StreamExecutionEnvironment.getExecutionEnvironment (2) (1) val socketstockstream = env.sockettextstream("localhost", 9999).map(x => { val split = x.split(",") StockPrice(split(0), split(1).todouble) }) (3) (4) val SPX_Stream = env.addsource(generatestock("spx")(10) _) val FTSE_Stream = env.addsource(generatestock("ftse")(20) _) val stockstream = socketstockstream.merge(spx_stream, FTSE_STREAM) 16

Window aggregations (2) StockPrice(HDP, 23.8) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (3) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 26.6) (4) StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 25.2) (1) (2) (3) (4) val windowedstream = stockstream.window(time.of(10, SECONDS)).every(Time.of(5, SECONDS)) val lowest = windowedstream.minby("price") val maxbystock = windowedstream.groupby("symbol").maxby("price") val rollingmean = windowedstream.groupby("symbol").mapwindow(mean _) 17

Data-driven windows StockPrice(SPX, 2113.9) StockPrice(FTSE, 6931.7) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) (1) (2) (3) (4) StockPrice(HDP, 23.8) StockPrice(HDP, 26.6) Count(HDP, 1) case class Count(symbol : String, count : Int) (1) (2) (3) (4) val pricewarnings = stockstream.groupby("symbol").window(delta.of(0.05, pricechange, defaultprice)).mapwindow(sendwarning _) val warningsperstock = pricewarnings.map(count(_, 1)).groupBy("symbol").window(Time.of(30, SECONDS)).sum("count") 18

Combining with a Twitter stream "hdp is on the rise!" "I wish I bought more YHOO and HDP stocks" (2) (3) (4) (1) Count(HDP, 2) Count(YHOO, 1) (1) val tweetstream = env.addsource(generatetweets _) (2) (3) val mentionedsymbols = tweetstream.flatmap(tweet => tweet.split(" ")).map(_.touppercase()).filter(symbols.contains(_)) (4) val tweetsperstock = mentionedsymbols.map(count(_, 1)).groupBy("symbol").window(Time.of(30, SECONDS)).sum("count") 19

Streaming joins Count(HDP, 1) (1) (2) 0.5 Count(HDP, 2) Count(YHOO, 1) (1,2) (1) val tweetsandwarning = warningsperstock.join(tweetsperstock).onwindow(30, SECONDS).where("symbol").equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) } (2) val rollingcorrelation = tweetsandwarning.window(time.of(30, SECONDS)).mapWindow(computeCorrelation _) 20

Performance Performance optimizations Effective serialization due to strongly typed topologies Operator chaining (thread sharing/no serialization) Different automatic query optimizations Competitive performance ~ 1.5m events / sec / core As a comparison Storm promises ~ 1m tuples / sec / node 21

Fault tolerance 22

Overview Fault tolerance in other systems Message tracking/acks (Apache Storm) RDD lineage tracking/recomputation Fault tolerance in Apache Flink Based on consistent global snapshots Algorithm inspired by Chandy-Lamport Low runtime overhead, stateful exactlyonce semantics 23

Checkpointing / Recovery Pushes checkpoint barriers through the data flow barrier Operator checkpoint starting Checkpoint done Data Stream After barrier = Before barrier = Not in snapshot part of the snapshot (backup till next snapshot) checkpoint in progress Checkpoint done Asynchronous Barrier Snapshotting for globally consistent checkpoints 24

State management State declared in the operators is managed and checkpointed by Flink Pluggable backends for storing persistent snapshots Currently: JobManager, FileSystem (HDFS, Tachyon) State partitioning and flexible scaling in the future 25

Closing 26

Streaming roadmap for 2015 State management New backends for state snapshotting Support for state partitioning and incremental snapshots Master Failover Improved monitoring Integration with other Apache projects SAMOA, Zeppelin, Ignite Streaming machine learning and other new libraries 27

flink.apache.org @ApacheFlink