Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

Size: px

Start display at page:

Download "Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides"

Ursula O’Connor’
5 years ago
Views:

1 Apache Flink Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

2 What is Apache Flink Massive parallel data flow engine with unified batch-and streamprocessing

3 CEP Event Processing Table Relational FlinkML Machine Learning Gelly Graph Provessing Table Relational System Stack APIs & Libraries DataStream API DataSet API Core Runtime Distributed streaming Dataflow Deploy Local Single JVM Cluster YARN Cloud

4 The case for Flink Performance and ease of use Exploits in-memory processing and pipelining, languageembedded logical APIs Unified batch and real streaming Batch and Stream APIs on top of a streaming engine A runtime that "just works" without tuning Custom memory management inside the JVM Predictable and dependable execution Bird s-eye view of what runs and how, and what failed and why

5 Built-in(native) vs. driver-based looping Non-native iterations Non-native streaming

6 Built-in(native) vs. driver-based looping

7 Applications Stream Batch DataStr eam processing processing DataSet FlinkML Machine Learning Graph Analysis Gelly

8 Basic API Concept Source Data Stream Operation Data Stream Sink Source Data Set Operation Data Set Sink Flink program writing: 1) Bootstrap sources 2) Apply operations 3) Output to sink

9 Machine learning library: FlinkML Recently started effort Currently available algorithms Classification Logistic Regression Clustering Recommendation (ALS) Machine Learning K-Means

10 Graph Analysis library: Gelly Large-scale graph processing API Iterative Graph Processing Currently available algorithms Single Source Shortest Paths Weakly Connected Components Community Detection Page Rank Label Propagation SSSP Graph Partitioning

11 Single Source Shortest Paths VertexUpdateFunction Messaging Function Vertex-centric

12 Single Source Shortest Paths 0 S Messaging A 3 B A 3 Vertex Update 3 A A 5 B 5 5 B B C C D D D

13 Single Source Shortest Paths 3 A 5 B Messaging B 1 С 5 C 2 D 5 B C Vertex Update B C D 3 D 8 8 D

14 Single Source Shortest Paths 4 B Messaging C 2 D 1 7 C 6 9 Vertex Update 6 C D C D 7 D

15 Single Source Shortest Paths

16 Single Source Shortest Paths VertexUpdateFunction: defines how a vertex will update its value based on the received messages Vertex-centric

17 Single Source Shortest Paths Messaging Function: defines what messages a vertex sends out for the next superstep Vertex-centric

18 Architecture Overview Client Task Manager Job Manager Task Manager Task Manager 1) Client Optimize Construct job graph Pass job graph to manager Retrieve job results 2) Master (Job Manager) Parallelization: Create Execution Graph Scheduling: Assign tasks to task managers State: Supervise the execution 3) Worker (Task Manager) Operations are split up into tasks Each parallel instance of an operation runs in a separate tasks slot

19 Installation Flink-1.0.2, Hadoop 2.7 wget tar xf flink bin-hadoop27-scala_2.11.tgz

20 CEP Event Processing Table Relational FlinkML Machine Learning Gelly Graph Provessing Table Relational Ways to Run a Flink Program APIs & Libraries DataStream API DataStream API Core Runtime Distributed streaming Dataflow Deploy Local Single JVM Cluster YARN Cloud

21 Local Execution Starts local Flink cluster All processes run in the same JVM Behaves just like a regular Cluster Very useful for developing and debugging Job Manager Task Manager Task Manager Task Manager Task Manager JVM

22 Wordcount: Program

23 Source env.readcsvfile(...) env.readfile(...) env.readhadoopfile(...) env.readsequencefile(...) env.readtextfile(...) Collection-based *JSONParser Sink datastream.print() datastream.writeastext(...) datastream.writeascsv(...) Others: SocketInputFormat KafkaInputFormat Databases

24 Creation of Flink Project Create empty Maven project (IntelliJ IDEA) Add Flink dependencies Change Manifest to specify Main Class Build to JAR

25 Local Execution: WordCount bin/start-local.sh bin/flink run examples/batch/wordcount.jar Running FlatMap x3 Reduce x3 FlatMap DataSink x3 Reduce -> DataSink -> Finish

26 Local Execution: Result She sells sea shells on the sea shore; The shells that she sells are sea shells I'm sure. So if she sells sea shells on the sea shore, I'm sure that the shells are sea shore shells are 2 i 2 if 1 m 2 on 2 sea 6 sells 3 she 3 shells 6 shore 3 so 1 sure 2 that 2 the 4

Job Manager Web interface http://master:8081 Shows overall system status Job

27 Job Manager Web interface Shows overall system status Job execution details Task Manager resource utilization Allow submit new job by form

28 Cluster. YARN Execution Multi-user scenario Resource sharing Uses YARN containers to run a Flink cluster Easy to setup Client Resource Manager Node Manager Job Manager Node Manager Node Manager Task Manager Node Manager Task Manager Other Application YARN Cluster

29 Cluster. YARN Execution Configure flink nano conf/flink-conf.yaml edit line >> jobmanager.rpc.address: Copy configuration on workers scp -r master:/home/hadoop-admin/flink Setup path to Hadoop export HADOOP_CONF_DIR='/opt/hadoop/etc/hadoop' Run Yarn session with 2 TaskManagers with 1GB of memory each bin/yarn-session.sh -n 2 -tm 1024

30 Cluster. YARN Execution Run example bin/flink run p 2 -m yarn-cluster -yn 2 -yjm ytm 1024 examples/batch/wordcount.jar

31 Cluster. YARN Execution Assigning jobs to Task Managers

32 Cluster. YARN Execution Running FlatMap x3 Reduce x6 FlatMap DataSink x6 Reduce x2 DataSink x2 -> Finish

33 Cluster. YARN Execution

34 Flink vs. Spark: Batch

35 Flink vs. Spark: Streaming

36 Flink vs. Spark: Batch

37 Thank you

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate