Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Similar documents
Intra-cluster Replication for Apache Kafka. Jun Rao

Tools for Social Networking Infrastructures

Flash Storage Complementing a Data Lake for Real-Time Insight

Data Acquisition. The reference Big Data stack

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Apache BookKeeper. A High Performance and Low Latency Storage Service

Auto Management for Apache Kafka and Distributed Stateful System in General

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

Fluentd + MongoDB + Spark = Awesome Sauce

Data Acquisition. The reference Big Data stack

PaaS SAE Top3 SuperAPP

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Introduction to Apache Kafka

Building Durable Real-time Data Pipeline

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case

Big Data Architect.

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Data pipelines with PostgreSQL & Kafka

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Hortonworks and The Internet of Things

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

BIG DATA COURSE CONTENT

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

HDInsight > Hadoop. October 12, 2017

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Lecture 11 Hadoop & Spark

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

@ COUCHBASE CONNECT. Using Couchbase. By: Carleton Miyamoto, Michael Kehoe Version: 1.1w LinkedIn Corpora3on

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

10 Million Smart Meter Data with Apache HBase

The NoSQL Landscape. Frank Weigel VP, Field Technical Opera;ons

Installing and configuring Apache Kafka

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop. Introduction / Overview

Outline. Spanner Mo/va/on. Tom Anderson

Processing of big data with Apache Spark

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

VOLTDB + HP VERTICA. page

Streaming Log Analytics with Kafka

Time Series Storage with Apache Kudu (incubating)

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Source, Sink, and Processor Configuration Values

MapR Enterprise Hadoop

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks

Oracle Database 12c: JMS Sharded Queues

Integrating Splunk with AWS services:

Apache Kafka a system optimized for writing. Bernhard Hopfenmüller. 23. Oktober 2018

microsoft

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH

Introducing Kafka Connect. Large-scale streaming data import/export for

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Kafka Connect the Dots

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

A BigData Tour HDFS, Ceph and MapReduce

April Copyright 2013 Cloudera Inc. All rights reserved.

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Talend Big Data Sandbox. Big Data Insights Cookbook

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Distributed Systems 16. Distributed File Systems II

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Big Data with Hadoop Ecosystem

Apache Kafka Your Event Stream Processing Solution

CIB Session 12th NoSQL Databases Structures

Western Michigan University

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

HBASE + HUE THE UI FOR APACHE HADOOP

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Logisland Event mining at scale. Thomas [ ]

Big Data Infrastructure at Spotify

Oracle Data Integrator 12c: Integration and Administration

Virtual IMS user group: Newsletter 57

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Configuring and Deploying Hadoop Cluster Deployment Templates

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Data Platforms and Pattern Mining

Performance and Scalability with Griddable.io

Pulsar. Realtime Analytics At Scale. Wang Xinglang

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

IBM Data Replication for Big Data

Introduction to Kafka (and why you care)

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB

Transcription:

Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent

Agenda Why people use Ka1a Technical overview of Ka1a What s coming

What s Apache Ka1a Distributed, high throughput pub/sub system

Ka1a Usage

Common PaIern in Data- driven Companies User Value Product Virality Insights Data Signals Science 5

GeLng Data Is Hard Data scien)sts spent 70% )me on gelng and cleaning data Why?

Variety in Data Sources Database records Users, products, orders Business metrics Clicks, impressions, pageviews Opera)onal metrics CPU usage, requests/sec Applica)on logs Service calls, errors IOT

Variety in New Specialized Systems Batch oriented Hadoop ecosystem (Pig, Hive, Spark, ) Real )me Key/value store (Cassandra, MongoDB, HBase, ) Search (Elas)cSearch, Solr, ) Stream processing (Storm, Spark streaming, Samza) Graph (GraphLab, FlockDB, ) Time series DB (open TSDB, )

Danger of Point- to- point Pipelines Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Hadoop Log Search Monitoring Data Warehouse Social Graph Rec. Engine Search Security... Email Production Services

Ideal Architecture: Stream Data Plaborm Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Log Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security... Email Production Services Ka1a is the center of stream data plaborm!

Ka1a at LinkedIn 800 billion messages, 175 TB data per day 13 million messages wriien/sec 65 million messages read/sec Tens of thousands of producers Thousands of consumers

Agenda Why people use Ka1a Technical overview of Ka1a Throughput and scalability Real )me and batch consump)on Durability and availability What s coming

Producer (Java) Concepts and API Topic defines a message queue byte[] key = "key".getbytes(); byte[] value = "value".getbytes(); record = new ProducerRecord("my-topic", key, value); producer.send(record); Consumer (Java) streams[] = Consumer.createMessageStreams("topic1", 1); for(message: streams[0]) { // do something with message }

Distributed Architecture producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer Topics are par))oned for parallelism 14

Built- in Cluster Management Automated failure detec)on and failover Leverage Apache Zookeeper Online data movement

Simple Efficient Storage with a Log Data Source writes Log 0 1 2 3 4 5 6 7 8 9 10 11 12 reads reads Destination System A (time = 7) Destination System B (time = 11) 16

Batching and Compression Batching Producer, broker, consumer Compression Gzip, snappy, lz4 End- to- end

Real Time and Batch Consump)on Mul)- subscrip)on model Data persisted on disk, NOT cached in JVM Rely on pagecache Zero- copy transfer (broker - > consumer) Ordered consump)on per topic par))on Significantly less bookkeeping

Durability and Availability Built- in replica)on Configurable replica)on factor Tolera)ng f 1 failures with f replicas Automated failover

Replicas and Layout Topic par))on has replicas Replicas spread evenly among brokers logs logs logs logs topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 broker 1 broker 2 broker 3 broker 4

Data Flow in Replica)on producer ack 3 1 leader 2 follower 2 follower 4 consumer commit topic1-part1 topic1-part1 topic1-part1 broker 1 broker 2 broker 3 When producer receives ack Latency Durability on failures no ack no network delay some data loss wait for leader 1 network roundtrip a few data loss wait for commiied 2 network roundtrips no data loss

Extend to Mul)ple Par))ons producer leader topic1-part1 producer follower topic1-part1 follower topic1-part1 leader topic2-part1 follower topic2-part1 follower topic2-part1 producer follower follower leader topic3-part1 topic3-part1 topic3-part1 broker 1 broker 2 broker 3 broker 4 Leaders are evenly spread among brokers

Agenda Why people use Ka1a Technical overview of Ka1a What s coming

Stream Data Plaborm

Future Releases of Apache Ka1a New java consumer client BeIer performance Easier protocol for non- java client Security Authen)ca)on: SSL and Kerberos Authoriza)on BeIer cluster management More automated tools Quotas Transac)onal support Exactly once delivery

Confluent Mission: Make stream data plaborm a reality Ka1a development and support Product: Metadata management (released) Rest endpoint (released) Connectors for common systems Monitor data flow end- to- end Stream processing integra)on

Q&A More info on Apache Ka1a hip://ka1a.apache.org/ Confluent hip://confluent.io hip://confluent.io/careers Ka1a meetup tonight at 6:30 (Texas V)