Modern Stream Processing with Apache Flink

Similar documents
The Power of Snapshots Stateful Stream Processing with Apache Flink

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

Apache Flink Big Data Stream Processing

The Stream Processor as a Database. Ufuk

Streaming Analytics with Apache Flink. Stephan

The Future of Real-Time in Spark

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Real-time data processing with Apache Flink

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Spark, Shark and Spark Streaming Introduction

Real-time Streaming Applications on AWS Patterns and Use Cases

Personalizing Netflix with Streaming datasets

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Reactive Microservices Architecture on AWS

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Cloud Analytics and Business Intelligence on AWS

Qualys Cloud Platform

Apache Flink. Alessandro Margara

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud. Santa Clara, California April 23th 25th, 2018

Spark and Flink running scalable in Kubernetes Frank Conrad

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Apache Flink Streaming Done Right. Till

Deploying Applications on DC/OS

Deep Dive into Concepts and Tools for Analyzing Streaming Data

WHITEPAPER. MemSQL Enterprise Feature List

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Architecting Microsoft Azure Solutions (proposed exam 535)

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

70-532: Developing Microsoft Azure Solutions

Sunil Shah SECURE, FLEXIBLE CONTINUOUS DELIVERY PIPELINES WITH GITLAB AND DC/OS Mesosphere, Inc. All Rights Reserved.

Windows Azure Services - At Different Levels

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

IoT Sensor Analytics with Apache Kafka, KSQL and TensorFlow

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Practical Big Data Processing An Overview of Apache Flink

Developing Microsoft Azure Solutions (70-532) Syllabus

Overview of Data Services and Streaming Data Solution with Azure

Architecture of Flink's Streaming Runtime. Robert

MarkLogic Technology Briefing

Redesigning Apache Flink s Distributed Architecture. Till

Building a Data-Friendly Platform for a Data- Driven Future

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Data Acquisition. The reference Big Data stack

Exam : Implementing Microsoft Azure Infrastructure Solutions

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

Migration. 22 AUG 2017 VMware Validated Design 4.1 VMware Validated Design for Software-Defined Data Center 4.1

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

Flash Storage Complementing a Data Lake for Real-Time Insight

Focus On: Oracle Database 11g Release 2

Extending Flink s Streaming APIs

Course Outline: Designing, Optimizing, and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

Data Protection Modernization: Meeting the Challenges of a Changing IT Landscape

Backup strategies for Stateful Containers in OpenShift Using Gluster based Container-Native Storage

Microsoft Azure Course Content

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Microsoft SQL Server

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

data Artisans Streaming Ledger

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

70-532: Developing Microsoft Azure Solutions

Using DC/OS for Continuous Delivery

Spark Streaming. Guido Salvaneschi

Introduction to Apache Apex

Intra-cluster Replication for Apache Kafka. Jun Rao

CSE 444: Database Internals. Lecture 23 Spark

HBase Solutions at Facebook

AWS Lambda: Event-driven Code in the Cloud

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Nevin Dong 董乃文 Principle Technical Evangelist Microsoft Cooperation

Cisco Cloud Strategy. Uwe Müller. Leader PreSales Cloud & Datacenter Germany

Data Acquisition. The reference Big Data stack

Developing Microsoft Azure Solutions (70-532) Syllabus

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Event Streams using Apache Kafka

Best practices for OpenShift high-availability deployment field experience

Continuous Integration and Delivery with Spinnaker

An Information Asset Hub. How to Effectively Share Your Data

AWS Serverless Architecture Think Big

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

VOLTDB + HP VERTICA. page

Distributed Data on Distributed Infrastructure. Claudius Weinberger & Kunal Kusoorkar, ArangoDB Jörg Schad, Mesosphere

Oracle 12c Dataguard Administration (32 Hours)

Lambda Architecture for Batch and Stream Processing. October 2018

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Processing of big data with Apache Spark

MEAP Edition Manning Early Access Program Flink in Action Version 2

HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON

Transcription:

1

Modern Stream Processing with Apache Flink Till Rohrmann GOTO Berlin 2017 2

Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager 3

What changes faster? Data or Query? Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, Stream Processing Use Case 4

Batch Processing 5

Stream Processing 6

Apache Flink in a Nutshell 7

Apache Flink in a Nutshell Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Queries Application Applications Devices etc. Streams Historic Data Database Stream File / Object Storage 8

The Core Building Blocks Event Streams State (Event) Time Snapshots real-time and hindsight complex business logic consistency with out-of-order data and late data forking / versioning / time-travel 9

Powerful Abstractions Layered abstractions to navigate simple to complex use cases High-level Analytics API Stream- & Batch Data Processing Stream SQL / Tables (dynamic tables) DataStream API (streams, windows) val stats = stream.keyby("sensor").timewindow(time.seconds(5)).sum((a, b) -> a.add(b)) Stateful Event- Driven Applications Process Function (events, state, time) def processelement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { } out.collect( ) // emit events state.update( ) // modify state // schedule a timer callback ctx.timerservice.registereventtimetimer(event.timestamp + 500) } 10

Hardened at scale Athena X Streaming SQL Platform Service Streaming Platform as a Service 100s jobs, 1000s nodes, TBs state metrics, analytics, real time ML Streaming SQL as a platform Fraud detection Streaming Analytics Platform 11

Distributed application infrastructure 12

Good old centralized architecture Application Application Application Application $$$ The big mean central database The grumpy DBA 13

Modern distributed app. architecture Application Application Sensor Application APIs Application The limit to what you can do is how sophisticated can you compute over the stream! Boils down to: How well do you handle state and time 14

A Flink-favored approach 15

Event Sourcing + Memory Image event / command periodically snapshot the memory event log persists events (temporarily) main memory update local variables/structures Process 16

Event Sourcing + Memory Image Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process 17

Distributed Memory Image Distributed application, many memory images. Snapshots are all consistent together. 18

Stateful Event & Stream Processing Scalable embedded state Access at memory speed & scales with parallel operators 19

Compute, State, and Storage Classic tiered architecture compute layer Streaming architecture compute + application state database layer application state + backup stream storage and snapshot storage (backup) 20

Performance Classic tiered architecture Streaming architecture synchronous reads/writes across tier boundary all modifications are local asynchronous writes of large blobs 21

Consistency Classic tiered architecture Streaming architecture exactly once per state =1 =1 distributed transactions at scale typically at-most / at-least once 22

Scaling a Service Classic tiered architecture Streaming architecture provision compute provision compute and state together separately provision additional database capacity 23

Rolling out a new Service Classic tiered architecture Streaming architecture provision compute and state together provision a new database (or add capacity to an existing one) simply occupies some additional backup space 24

What users built on checkpoints Upgrades and Rollbacks Cross Datacenter Failover State Archiving Application Migration Spot Instance Region Chasing A/B testing 25

What is the next wave of stream processing applications? 26

What changes faster? Data or Query? Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, Stream Processing Use Case 27

Analytics & Business Logic Source events analytics metrics Business Logic Stream Stream Processor Database Application 28

Blending Analytics & Business Logic Source events Stream Streaming Application Stream Processor RPCs (Actor) Messages 29

AthenaX by Uber 30

Can one build an entire sophisticated web application (say a social network) on a stream processor? (Yes, we can! ) 31

@ Social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink 32

The next wave of stream processing applications is all types of stateful applications that react to data and time! 33

Stateful stream applications Continuous applications versioning, upgrading, rollback, duplicating, migrating, 34

The da Platform Architecture Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more) Streams from Kafka, Kinesis, S3, HDFS, Databases,... da Application Manager Application lifecycle management da Platform 2 Apache Flink Stateful stream processing Logging Metrics CI/CD Kubernetes Container platform 35

Versioned Applications, not Jobs/Jars Stream Processing Application New Application Version 3 Version 3a upgrade upgrade Version 2 Version 1 fork / duplicate Version 2a Code and Application Snapshot 36

Deployments, not Flink Clusters Threat Metrics App. Testing Fraud Detection App. Application Testing Activity Monitor Application Testing / QA Kubernetes Cluster Production Kubernetes Cluster 37

Hooks for CI/CD pipelines stateful upgrade trigger CI upgrade API da Application Manager Application Version 12 push update CI Service Kubernetes Cluster 38

Thank you! @stsffap @ApacheFlink @dataartisans 39

We are hiring! data-artisans.com/careers 40

41