Introduction to Apache Kafka

Similar documents
0.8.0 Producer Example

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

0.8.0 SimpleConsumer Example

rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Intra-cluster Replication for Apache Kafka. Jun Rao

Tools for Social Networking Infrastructures

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduction to Kafka (and why you care)

Auto Management for Apache Kafka and Distributed Stateful System in General

Fluentd + MongoDB + Spark = Awesome Sauce

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Data Acquisition. The reference Big Data stack

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Apache Storm. Hortonworks Inc Page 1

@joerg_schad Nightmares of a Container Orchestration System

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Data Analytics with HPC. Data Streaming

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Apache Kafka Your Event Stream Processing Solution

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Building Durable Real-time Data Pipeline

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Processing of big data with Apache Spark

pykafka Release dev.2

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Data pipelines with PostgreSQL & Kafka

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Data Acquisition. The reference Big Data stack

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

SCALING LIKE TWITTER WITH APACHE MESOS

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Over the last few years, we have seen a disruption in the data management

Spark, Shark and Spark Streaming Introduction

Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH

UMP Alert Engine. Status. Requirements

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Create High Performance, Massively Scalable Messaging Solutions with Apache ActiveBlaze

Give Your Site a Boost With memcached. Ben Ramsey

Give Your Site a Boost With memcached. Ben Ramsey

Introducing Kafka Connect. Large-scale streaming data import/export for

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

REAL-TIME ANALYTICS WITH APACHE STORM

Scalable Streaming Analytics

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

What is Apache Kafka?

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

Design and Architecture. Derek Collison

Spark Streaming. Guido Salvaneschi

DATA SCIENCE USING SPARK: AN INTRODUCTION

Microservices Lessons Learned From a Startup Perspective

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Streaming Log Analytics with Kafka

Webinar Series TMIP VISION

ZooKeeper. Table of contents

Kafka Connect the Dots

Installing and configuring Apache Kafka

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Kafka Streams: Hands-on Session A.A. 2017/18

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Integrating Hive and Kafka

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Flash Storage Complementing a Data Lake for Real-Time Insight

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

/ Cloud Computing. Recitation 15 December 6 th 2016

Hortonworks Cybersecurity Package

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

HDInsight > Hadoop. October 12, 2017

PaaS SAE Top3 SuperAPP

Distributed File Systems II

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology

CA485 Ray Walshe Google File System

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

KIP-266: Fix consumer indefinite blocking behavior

Cloud Computing & Visualization

Distributed Systems 16. Distributed File Systems II

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

microsoft

Streaming OLAP Applications

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Map-Reduce. Marco Mura 2010 March, 31th

MIS Database Systems.

Diving into Open Source Messaging: What Is Kafka?

Energy Management of MapReduce Clusters. Jan Pohland

Performance and Scalability with Griddable.io

Four times Microservices: REST, Kubernetes, UI Integration, Async. Eberhard Fellow

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Apache BookKeeper. A High Performance and Low Latency Storage Service

STORM AND LOW-LATENCY PROCESSING.

Evolution of an Apache Spark Architecture for Processing Game Data

Hadoop. Introduction / Overview

Transcription:

Introduction to Apache Kafka Chris Curtin Head of Technical Research Atlanta Java Users Group March 2013

About Me 20+ years in technology Head of Technical Research at Silverpop (12 + years at Silverpop) Built a SaaS platform before the term SaaS was being used Prior to Silverpop: real-time control systems, factory automation and warehouse management Always looking for technologies and algorithms to help with our challenges Car nut 2

Silverpop Open Positions Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) Senior Software Engineer MIS (.NET stack) Software Engineer Software Engineer Integration Services (PHP, MySQL) Delivery Manager Engineering Technical Lead Engineering Technical Project Manager Integration Services http://www.silverpop.com Go to Careers under About 3

Caveats We don t use Kafka in production I don t have any experience with Kafka in operations I am not an expert on messaging systems/jms/mqseries etc. 4

Apache Kafka from Apache Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages. High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second. Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics. Support for parallel data load into Hadoop. 5

Background LinkedIn product donated to Apache Most core developers are from LinkedIn Pretty good pickup outside of LinkedIn: Air BnB & Urban Airship for example Fun fact: no logo yet 6

Why? Data Integration 7

Point to Point integration (thanks to LinkedIn for slide) 8

9

(Thanks to http://linkstate.wordpress.com/2011/04/19/recabling-project/) 10

What we d really like (thanks to LinkedIn for slide) 11

Looks Familiar: JMS to the rescue! 12

Okay: Data warehouse to the rescue! 13

Okay: CICS to the rescue! 14

Kafka changes the paradigm Kafka doesn t keep track of who consumed which message 15

Consumption Management Kafka leaves management of what was consumed up to the business logic Each message has a unique identifier (within the topic and partition) Consumers can ask for message by identifier, even if they are days old Identifiers are sequential within a topic and partition. 16

Why is Kafka Interesting? Horizontally scalable messaging system 17

Terminology Topics are the main grouping mechanism for messages Brokers store the messages, take care of redundancy issues Producers write messages to a broker for a specific topic Consumers read from Brokers for a specific topic Topics can be further segmented by partitions Consumers can read a specific partition from a Topic 18

API (1 of 25) - Basics Producer: send(string topic, String key, Message message) Consumer: Iterator<Message> fetch( ) 19

API Just kidding that s pretty much it for the API Minor variation on the consumer for Simple consumers but that s really it under the covers functions to get current offsets or implement nontrivial consumers 20

Architecture (thanks to LinkedIn for slide) 21

Producers Pretty Basic API Partitioning is a little odd, requires Producers to know about partition scheme Producers DO NOT know about consumers 22

Consumers: Consumer Groups Easiest to get started with Kafka makes sure only one thread in the group sees a message for a topic (or a message within a Partition) Uses Zookeeper to keep track of what messages were consumed in which topic/partitions No once and only once delivery semantics here Rebalance may mean a message gets replayed 23

Consumers: Simple Consumer Consumer subscribes to a specific topic and partition Consumer has to keep track of what message offset was last consumed A lot more error handling required if Brokers have issues But a lot more control over which messages are read. Does allow for exactly once messaging 24

Consumer Model Design Partition design impacts overall throughput Producers know partitioning class Producers write to single Broker leader for a partition Offsets as only transaction identifier complicates consumer throw more hardware at the backlog is complicated Consumer Groups == 1 thread per partition If expensive operations can t throw more threads at it Not a lot of real world examples on balancing # of topics vs. # of partitions 25

Why is Kafka Interesting? Memory Mapped Files Kernel-space processing 26

What is a commit log? (thanks to LinkedIn for slide) 27

Brokers Lightweight, very fast message storing Writes messages to disk using kernel space NOT JVM Uses OS Pagecache Data is stored in flat files on disk, directory per topic and partition Handles the replication 28

Brokers continued Very low memory utilization almost nothing is held in memory (Remember, Broker doesn t keep track of who has consumed a message) Handle TTL operations on data Drop a file when the data is too old 29

Why is Kafka Interesting? Stuff just works Producers and Consumers are about business logic 30

Consumer Use Case: batch loading Consumers don t have to be online all the time Wake up every hour, ask Kafka for events since last request Load into a database, push to external systems etc. Load into Hadoop (Stream if using MapR) 31

Consumer Use Case: Complex Event Processing Feed to Storm or similar CEP Partition on user id, subsystem, product etc. independent of Kafka s partition Execute rules on the data Made a mistake? Replay the events and fix it 32

Consumer Use Case: Operations Logs Load old operational messages to debug problems Do it without impacting production systems (remember, consumers can start at any offset!) Have business logic write to different output store than production, but drive off production data 33

Adding New Business Logic (thanks to LinkedIn for slide) 34

Adding Producers Define Topics and # of partitions via Kafka tools (possibly tell Kafka to balance leaders across machines) Start producing 35

Adding Consumers Using Kafka adding consumers doesn t impact producers Minor impact on Brokers (just keeping track of connections) 36

Producer Code public class TestProducer { public static void main(string[] args) { long events = Long.parseLong(args[0]); long blocks = Long.parseLong(args[1]); Random rnd = new Random(); Properties props = new Properties(); props.put("broker.list", "vrd01.atlnp1:9092,vrd02.atlnp1:9092,vrd03.atlnp1:9092"); props.put("serializer.class", "kafka.serializer.stringencoder"); props.put("partitioner.class", "com.silverpop.kafka.playproducer.organizationpartitioner"); ProducerConfig config = new ProducerConfig(props); } } Producer<Integer, String> producer = new Producer<Integer, String>(config); for (long nblocks = 0; nblocks < blocks; nblocks++) { for (long nevents = 0; nevents < events; nevents++) { long runtime = new Date().getTime(); String msg = runtime + "," + (50 + nblocks) + "," + nevents+ "," + rnd.nextint(1000); String key = String.valueOf(orgId); KeyedMessage<Integer, String> data = new KeyedMessage<Integer, String>("test1", key, msg); producer.send(data); } } producer.close(); 37

Simple Consumer Code String topic = "test1"; int partition = 0; SimpleConsumer simpleconsumer = new SimpleConsumer("vrd01.atlnp1", 9092,100000, 64 * 1024, "test"); boolean loop = true; long maxoffset = -1; while (loop) { FetchRequest req = new FetchRequestBuilder().clientId("randomClient").addFetch(topic, partition, maxoffset+1, 100000).build(); FetchResponse fetchresponse = simpleconsumer.fetch(req); loop = false; for (MessageAndOffset messageandoffset : fetchresponse.messageset(topic, partition)) { loop = true; ByteBuffer payload = messageandoffset.message().payload(); maxoffset = messageandoffset.offset(); byte[] bytes = new byte[payload.limit()]; payload.get(bytes); System.out.println(String.valueOf(maxOffset) + ": " + new String(bytes, "UTF-8")); } } simpleconsumer.close(); 38

Consumer Groups Code // create 4 partitions of the stream for topic test, to allow 4 threads to consume Map<String, List<KafkaStream<Message>>> topicmessagestreams = consumerconnector.createmessagestreams(immutablemap.of("test", 4)); List<KafkaStream<Message>> streams = topicmessagestreams.get("test"); // create list of 4 threads to consume from each of the partitions ExecutorService executor = Executors.newFixedThreadPool(4); // consume the messages in the threads for(final KafkaStream<Message> stream: streams) { executor.submit(new Runnable() { public void run() { for(messageandmetadata msgandmetadata: stream) { // process message (msgandmetadata.message()) } } }); } 39

Demo 4- node Kafka cluster 4 node Storm cluster 4 node MongoDB cluster Test Producer in IntelliJ creates website events into Kafka Storm-Kafka Spout reads from Kafka into Storm topology Trident groups by organization and counts visits by day Trident end point writes to MongoDB MongoDB shell query to see counts change 40

LinkedIn Clusters (2012 presentation) 8 nodes per datacenter ~20 GB RAM available for Kafka 6TB storage, RAID 10, basic SATA drives 10,000 connections into the cluster for both production and consumption 41

Performance (LinkedIn 2012 presentation) 10 billion messages/day Sustained peak: 172,000 messages/second written 950,000 messages/second read 367 topics 40 real-time consumers Many ad hoc consumers 10k connections/colo 9.5TB log retained End-to-end delivery time: 10 seconds (avg) 42

Questions so far? 43

Something completely Different Nathan Marz (twitter, BackType) Creator of Storm 44

Immutable Applications No updates to data Either insert or delete Functional Applications http://manning.com/marz/bd_meap_ch01.pdf 45

(thanks to LinkedIn for slide) 46

Information Apache Kafka site: http://kafka.apache.org/ List of presentations: https://cwiki.apache.org/confluence/display/ KAFKA/Kafka+papers+and+presentations Kafka wiki: https://cwiki.apache.org/confluence/display/kafka/index Paper: http://sites.computer.org/debull/a12june/pipeline.pdf Slides: http://www.slideshare.net/chriscurtin Me: ccurtin@silverpop.com @ChrisCurtin on twitter 47

Silverpop Open Positions Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) Senior Software Engineer MIS (.NET stack) Software Engineer Software Engineer Integration Services (PHP, MySQL) Delivery Manager Engineering Technical Lead Engineering Technical Project Manager Integration Services http://www.silverpop.com/marketing-company/careers/openpositions.html 48