Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Size: px

Start display at page:

Download "Towards a Real- time Processing Pipeline: Running Apache Flink on AWS"

Bernard Harrison
5 years ago
Views:

1 Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016

2 Stream Processing Challenges Event time and out of order events Consistency, fault tolerance, and high availability Rich forms of window queries Low latency and high throughput

3 Analyzing NYC Taxi Rides in Real Time

5 Event Processing Architecture Replayable Log Processing Visualization Amazon Kinesis Apache Flink Amazon Elasticsearch

6 Apache Flink Apache Flink is an open source platform for distributed stream and batch data processing. artisans.com/why- apache- flink/

7 Apache Flink

8 Amazon Elastic MapReduce (EMR) Easily provision & manage clusters for your big data needs Hadoop, Spark, Presto, HBase, Tez, Hive, Pig, Apache Flink support added in EMR 5.1 Dynamically scalable, persistent or transient clusters Provides access control, firewalls, encryption

9 Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Throughput Secured via AWS IAM Durable storage of data streams

10 Amazon Kinesis Data Sources Availability Zone Availability Zone Availability Zone App.1 [Aggregate & De- Duplicate] Data Sources App.2 S3 Data Sources Data Sources AWS Endpoint Shard 1 Shard 2 Shard N [Metric Extraction] App.3 [Sliding Window Analysis] Redshift DynamoDB Data Sources App.4 [Machine Learning]

11 Amazon Kinesis Central bus for all event data Decoupling of multiple producers and consumers Keeps a replayable log of your events Many options to consume events with Apache Flink (new), Spark Streaming, Presto, Hive, Pig, Storm (or custom KCL apps)

12 Amazon Elasticsearch Service Provisions and maintains an Elasticsearch cluster Complete ELK stack, including Kibana Scalable Secured via AWS IAM

13 Architecture EC2 instance (bastion host) Amazon Kinesis Amazon EMR Amazon Elasticsearch Service

14 Demo

15 Lessons Learned

16 Building the Flink Kinesis Connector The Flink Kinesis connector artifact is not available from Maven Central Build the Connector with Maven mvn clean install - Pinclude- kinesis DskipTests - Dhadoop- two.version=2.7.2 For future projects, add the dependency to your local Maven repository mvn install:install- file - Dfile=flink- connector- kinesis_ jar

17 Approximate Event Time Each Amazon Kinesis record includes an ApproximateArrivalTimestamp The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record By default the event time of Flink uses this timestamp when reading from a Kinesis stream StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setstreamtimecharacteristic(timecharacteristic.eventtime);

18 Event Time and Watermarks With event time the time of an event is determined by the producer Flink measures progress in event time by means of Watermarks Watermarks must be ingested to each individual Kinesis shard DataStream<Event> kinesis = env.addsource(new FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new PunctuatedAssigner())

19 Data Encryption with Amazon EMR and Flink Security configuration supports encryption for data stored within the file system Hadoop Distributed File System (HDFS) block- transfer and RPC S3 data (SSE- S3, SSE- KMS, CSE- KMS, CSE- Custom) Local disk (except boot volumes) In- transit data (no Flink support yet) env.readtextfile("s3://...") env.setstatebackend(new FsStateBackend("hdfs://..."))

20 Connecting to the Flink Dashboard Use dynamic port forwarding to the Master node ssh - D 8157 hadoop@... Use FoxyProxy to redirect URLs to localhost *ec2*.amazonaws.com* *.compute.internal* Navigate to the YARN Resource Manager and select the Tracking UI

21 Starting Flink and Submitting Jobs Use steps to interact with Flink through the AWS API

22 Extending FlinkFunctionality Flink Elasticsearch sink merely supports TCP transport A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using Jest (io.searchbox) aws- signing- request- interceptor (vc.inreach.aws)

23 Questions?