Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision

Similar documents
Big Data Integration Patterns. Michael Häusler Jun 12, 2017

Evolution of an Apache Spark Architecture for Processing Game Data

WHITEPAPER. MemSQL Enterprise Feature List

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

BIG DATA REVOLUTION IN JOBRAPIDO

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Confluent Developer Training: Building Kafka Solutions B7/801/A

Data Acquisition. The reference Big Data stack

Building Durable Real-time Data Pipeline

Integrating Hive and Kafka

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Installing and configuring Apache Kafka

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Fluentd + MongoDB + Spark = Awesome Sauce

Auto Management for Apache Kafka and Distributed Stateful System in General

Apache Thrift Introduction & Tutorial

Security and Performance advances with Oracle Big Data SQL

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

The Emergence of the Datacenter Developer. Tobi Knaup, Co-Founder & CTO at

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Performance and Scalability with Griddable.io

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Streaming Log Analytics with Kafka

Research at PNNL: Powered by AWS NLIT 2018

Intra-cluster Replication for Apache Kafka. Jun Rao

The Stream Processor as a Database. Ufuk

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Griddable.io architecture

rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

Over the last few years, we have seen a disruption in the data management

VOLTDB + HP VERTICA. page

Introducing Kafka Connect. Large-scale streaming data import/export for

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Monitor your containers with the Elastic Stack. Monica Sarbu

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Modern Data Warehouse The New Approach to Azure BI

Cisco Tetration Analytics

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Infrastructure at Spotify

Real-time Data Engineering in the Cloud Exercise Guide

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018

Alert Filtering System Prototyping for ZTF and LSST

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Data pipelines with PostgreSQL & Kafka

Accelerate Your Data Pipeline for Data Lake, Streaming and Cloud Architectures

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

An Introduction to Big Data Formats

A day in the life of a log message Kyle Liberti, Josef

PSOACI Tetration Overview. Mike Herbert

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

Introduction to Kafka (and why you care)

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Open Data Standards for Administrative Data Processing

Schema Registry Overview

Protocol Buffers, grpc

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Apache Kafka a system optimized for writing. Bernhard Hopfenmüller. 23. Oktober 2018

IEMS 5780 / IERG 4080 Building and Deploying Scalable Machine Learning Services

Going Serverless. Building Production Applications Without Managing Infrastructure

Measuring HEC Performance For Fun and Profit

Improve Web Application Performance with Zend Platform

Catalyst. Uber s Serverless Platform. Shawn Burke - Staff Engineer Uber Seattle

Processing of big data with Apache Spark

Create High Performance, Massively Scalable Messaging Solutions with Apache ActiveBlaze

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

20+ MILLION RECORDS A SECOND

Cisco Tetration Analytics

TungstenFabric (Contrail) at Scale in Workday. Mick McCarthy, Software Workday David O Brien, Software Workday

Microservices Lessons Learned From a Startup Perspective

Kafka Connect the Dots

Model-Driven Telemetry. Shelly Cadora Principal Engineer, Technical Marketing

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Kafka Streams: Hands-on Session A.A. 2017/18

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Streaming Analytics with Apache Flink. Stephan

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

The Long Road from Capistrano to Kubernetes

Data Acquisition. The reference Big Data stack

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Lessons Learned: Building Scalable & Elastic Akka Clusters on Google Managed Kubernetes. - Timo Mechler & Charles Adetiloye

Projects and Apache Thrift. August 30 th 2018 Suresh Marru, Marlon Pierce

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Stanislav Harvan Internet of Things

Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Transcription:

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision

1+ PB Data lake size (AWS S3)

Number of topics in the biggest cluster 500+ (Apache Kafka)

10k - 100k+ Messages per second (Apache Kafka)

Scaling the data pipeline even further Volume Industry best practices Games Using previous experience Use-cases Completely unpredictable Complexity

Kafka topics are partitioned and replicated Partition 1 Partition 2 Partition 3 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Consumer or Producer Kafka topic

Scaling the pipeline in terms of Volume

Producers Consumers

Scaling producers Asynchronous / non-blocking writes (default) Compression and batching Sampling Throttling Acks? 0, 1, -1 Standard Kafka producer tuning: batch.size, linger.ms, buffer.memory, etc.

Proxy

Each approach has pros and cons Simple Low-latency connection Number of TCP connections per broker starts to look scary Flexible Possible to do basic enrichment Easier to manage Kafka clusters Really hard to do maintenance on Kafka clusters

Simple rule for high-performant producers? Just write to Kafka, nothing else 1. 1. Not even auth?

Scaling Kafka clusters Just add more nodes! Disk IO is extremely important Tuning io.threads and network.threads Retention For more: Optimizing Your Apache Kafka Deployment whitepaper from Confluent

It s not always about tuning. Sometimes we need more than one cluster. Different workloads require different topologies.

Stream processing Short retention More partitions Ingestion (HTTP Proxy) Long retention High SLA Lots of consumers Medium retention ACL

Scaling consumers is usually pretty trivial - just increase the number of partitions. Unless you can t. What then?

Block Storage Metadata Microbatch Archiver Work Queue Populator Metadata Message Queue

Even if you can add more partitions Still can have bottlenecks within a partition (large messages) In case of reprocessing, it s really hard to quickly add A LOT of new partitions AND remove them after Also, number of partitions is not infinite

You can t be sure about any improvements without load testing. Not only for a cluster, but producers and consumers too.

Scaling and extending the pipeline in terms of Games and Use-cases

We need to keep the number of topics and partitions low More topics means more operational burden Number of partitions in a fixed cluster is not infinite Autoscaling Kafka is impossible, scaling is hard

Topic naming convention Producer Unique game id CoD WW2 on PSN $env.$source.$title.$category-$version prod.glutton.1234.telemetry_match_event-v1

A proper solution has been invented decades ago. Think about databases.

Messaging system IS a form of a database Data topic = Database + Table. Data topic = Namespace + Data type.

Compare this prod.glutton.1234.telemetry_match_event-v1 dev.user_login_records.4321.all-v1 prod.marketplace.5678.purchase_event-v1 telemetry.matches user.logins marketplace.purchases

Each approach has pros and cons Topics that use metadata for their names are obviously easier to track and monitor (and even consume). As a consumer, I can consume exactly what I want, instead of consuming a single large topic and extracting required values. These dynamic fields can and will change. Producers (sources) and consumers will change. Very efficient utilization of topics and partitions. Finally, it s impossible to enforce any constraints with a topic name. And you can always end up with dev data in prod topic and vice versa.

After removing necessary metadata from the topic names stream processing becomes mandatory.

Stream processing becomes mandatory Measuring Validating Enriching Filtering & routing

Having a single message schema for a topic is more than just a nice-to-have.

Number of supported message formats 8

Stream processor JSON?? Protobuf?? Custom Avro

Custom deserialization // Application.java props.put("value.deserializer", "com.example.customdeserializer"); // CustomDeserializer.java public class CustomDeserializer implements Deserializer<???> { @Override public??? deserialize(string topic, byte[] data) {??? } }

Message envelope anatomy Header / Metadata ID, env, timestamp, source, game,... Body / Payload Event Message

Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }

Schema Registry API to manage message schemas Single source of truth for all producers and consumers It should be impossible to send a message to the pipeline without registering its schema in the Schema Registry! Good Schema Registry supports immutability, versioning and basic validation Activision uses custom Schema Registry implemented with Python and Cassandra

Summary Kafka tuning and best practices matter Invest in good SDKs for producing and consuming data Unified message envelope and topic names make adding a new game almost effortless Operational stream processing makes it possible. Make sure you can support adhoc filtering and routing of data Topic names should express data types, not producer or consumer metadata Schema Registry is a must-have

Thanks! @sap1ens