Evolution of an Apache Spark Architecture for Processing Game Data

Similar documents
Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Data Acquisition. The reference Big Data stack

Microservices. Webservices with Scala (II) Microservices

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Cloud Computing & Visualization

Fluentd + MongoDB + Spark = Awesome Sauce

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Data Acquisition. The reference Big Data stack

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

S-Store: Streaming Meets Transaction Processing

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Fast Data apps with Alpakka Kafka connector. Sean Glover,

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Turning Relational Database Tables into Spark Data Sources

DATA SCIENCE USING SPARK: AN INTRODUCTION

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

BIG DATA REVOLUTION IN JOBRAPIDO

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Griddable.io architecture

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Building Event Driven Architectures using OpenEdge CDC Richard Banville, Fellow, OpenEdge Development Dan Mitchell, Principal Sales Engineer

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Deploying Applications on DC/OS

microsoft

The Stream Processor as a Database. Ufuk

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Kafka (and why you care)

Lecture 11 Hadoop & Spark

Accenture Cloud Platform Serverless Journey

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR

Oracle NoSQL Database Enterprise Edition, Version 18.1

Streaming Analytics with Apache Flink. Stephan

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Big Data Infrastructures & Technologies

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Scalable Streaming Analytics

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Cloud Analytics and Business Intelligence on AWS

Fast Data apps with Alpakka Kafka connector and Akka Streams. Sean Glover,

An Information Asset Hub. How to Effectively Share Your Data

Migrating from Oracle to Espresso

Apache Spark and Scala Certification Training

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Flash Storage Complementing a Data Lake for Real-Time Insight

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Apache Flink. Alessandro Margara

Building loosely coupled and scalable systems using Event-Driven Architecture. Jonas Bonér Patrik Nordwall Andreas Källberg

Streaming Data: The Opportunity & How to Work With It

Distributed systems for stream processing

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

A Tutorial on Apache Spark

Hadoop. Introduction / Overview

Kafka Connect the Dots

Unifying Big Data Workloads in Apache Spark

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

VOLTDB + HP VERTICA. page

Developing Microsoft Azure Solutions: Course Agenda

Course Outline. Lesson 2, Azure Portals, describes the two current portals that are available for managing Azure subscriptions and services.

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Big Data Integration Patterns. Michael Häusler Jun 12, 2017

Lambda Architecture for Batch and Stream Processing. October 2018

The Evolution of Big Data Platforms and Data Science

Processing of big data with Apache Spark

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Innovatus Technologies

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Data Analytics at Logitech Snowflake + Tableau = #Winning

Streaming Log Analytics with Kafka

Hortonworks DataFlow Sam Lachterman Solutions Engineer

DURATION : 03 DAYS. same along with BI tools.

Over the last few years, we have seen a disruption in the data management

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Data Access 3. Managing Apache Hive. Date of Publish:

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Spark, Shark and Spark Streaming Introduction

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

StorageTapper. Real-time MySQL Change Data Uber. Ovais Tariq, Shriniket Kale & Yevgeniy Firsov. October 03, 2017

Spark Streaming. Guido Salvaneschi

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

Spark Overview. Professor Sasu Tarkoma.

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

It also performs many parallelization operations like, data loading and query processing.

Technical Sheet NITRODB Time-Series Database

Transcription:

Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017

About Me nafshartous@wbgames.com WB Analytics Core Platform Lead Contributor to Reactive Kafka Based in Turbine Game Studio (Needham, MA) Hobbies Sailing Chess

Some of our games

Intro Ingestion pipeline Redesigned ingestion pipeline Summary and lessons learned

Problem Statement How to evolve Spark Streaming architecture to address challenges in streaming data into Amazon Redshift

Tech Stack

Tech Stack Data Warehouse

Kafka Topics Partitions (Key, Value) at Partition, Offset Producers Message is a (key, value) Optional key used to assign message to partition Consumers can start processing from earliest, latest, or from specific offsets Consumers

Game Data Game (mobile and console) instrumented to send event data Volume varies up to 100,000 events per second per game Games have up to ~ 70 event types Data use- cases Development Reporting Decreasing player churn Increase revenue

Ingestion Pipelines Batch Pipeline Input Processing Storage JSON Hadoop Map Reduce Vertica Spark / Redshift Real- time Pipeline Input Processing Storage Avro Spark Streaming Redshift

Spark Versions Upgraded to Spark 2.0.2 from 1.5.2 Load tested Spark 2.1 Blocked by deadlock issue SPARK- 19300 Executor is waiting for lock

Intro Ingestion pipeline Re- designed ingestion pipeline Summary and lessons learned

Process for Game Sending Events Avro Schema Schema Hash Schema Registry Returned hash based on schema fields/types Registration triggers Redshift table create/alter statements Avro Data Schema Hash Event Ingestion

Ingestion Pipeline Event Avro Schema Hash HTTPS S3 Kafka Event Ingestion Service Data topic Micro Batch Spark Streaming Run COPY Data flow Invocation

Redshift Copy Command Redshift optimized for loading from S3 create table if not exists public.person ( id integer, name varchar ) person.txt 1 john doe 2 sarah smith COPY is a SQL statement executed by Redshift Example COPY Table copy public.person from 's3://mybucket/person.txt'

Ingestion Pipeline Event Avro Schema Hash HTTPS S3 Event Ingestion Service Data topic Kafka Micro Batch Spark Streaming Data flow Invocation

Challenges Redshift designed for loading large data files Not for highly concurrent workloads (single- threaded commit queue) Redshift latency can destabilize Spark streaming Data loading competes with user queries and reporting workloads Weekly maintenance

Intro Ingestion pipeline Redesigned ingestion pipeline Summary and lessons learned

Redesign The Pipeline Goals De- couple Spark streaming job from Redshift Tolerate Redshift unavailability High- level Solution Spark only writes to S3 Spark sends copy tasks to Kafka topic consumed by (new) Redshift loader Design Redshift loader to be fault- tolerant w.r.t. Redshift

Technical Design Options Options considered for building Redshift loader 2 nd Spark streaming Build a lightweight consumer

Redshift Loader Redshift loader built using Reactive Kafka API s for Scala and Java Reactive Kafka Reactive Kafka High- level Kafka API Leverages Akka streams and Akka Akka Streams Akka

Akka Akka is an implementation of Actors Actors: a model of concurrent computation in distributed systems, Gul Agha, 1986 Queue Actor Actors Single- threaded entities with an asynchronous message queue (mailbox) No shared memory Features Location transparency Actors can be distributed over a cluster Fault- tolerance Actors restarted on failure http://akka.io

Akka Streams Hard to implement stream processing considering Back pressure slow down rate to that of slowest part of stream Not dropping messages Akka Streams is a domain specific language for stream processing Stream executed by Akka

Akka Streams DSL Source generates stream elements Flow is a transformer (input and output) Sink is stream endpoint Source Flow Sink

Akka Streams Example Run stream to process two elements val s = Source(1 to 2) s.map(x => println("hello: " + x)).runwith(sink.ignore) Not executed by calling thread Nothing happens until run method is invoked Output Hello: 1 Hello: 2

Reactive Kafka Reactive Kafka stream is a type of Akka Stream Supported version is from Kafka 0.10+ 0.8 branch is unsupported and less stable https://github.com/akka/reactive- kafka

Reactive Kafka Example Create consumer config implicit val system = ActorSystem("Example") val consumersettings = ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer).withBootstrapServers("localhost:9092").withGroupId("group1") Deserializers for key, value Kafka endpoint Consumer group Create and run stream Creates Source that streams elements from Kafka message has type ConsumerRecord (Kafka API) Consumer.plainSource(consumerSettings, Subscriptions.topics("topic.name")).map { message => println("message: " + message.value()) }.runwith(sink.ignore)

Backpressure Slows consumption when rate is too fast for part of the stream Asynchronous operations inside map bypass backpressure mechanism Use mapasync instead of map for asynchronous operations (futures)

Game Clients Revised Architecture Event Avro HTTPS Event Ingestion Service Data topic COPY topic Kafka Spark Streaming S3 Redshift Loader Copy Tasks Data flow Invocation

Goals De- couple Spark streaming job from Redshift Tolerate Redshift unavailability

Redshift Cluster Status Cluster status displayed on AWS console Can be obtained programmatically via AWS SDK

Redshift Fault Tolerance Loader Checks health of Redshift using AWS SDK Start consuming when Redshift available Shut down consumer when Redshift not available Consumer.Control.shutdown() Run test query to validate database connections Don t rely on JDBC driver s Connection.isClosed() method

Transactions With auto- commit enabled each COPY is a transaction Commit queue limits throughput Better throughput by executing multiple COPY s in a single transaction Run several concurrent transactions per job

Deadlock Concurrent transactions create potential for deadlock since COPY statements lock tables Redshift will detect and return deadlock exception

Time Deadlock Transaction 1 Transaction 2 Copy table A Copy table B A, B locked Copy table B Copy table A Wait for lock Deadlock

Deadlock Avoidance Master hashes to worker based on Redshift table name Ensures that all COPY s for the same table are dispatched to the same worker Alternatively could order COPY s within transaction to avoid deadlock

Intro Ingestion pipeline Redesigned ingestion pipeline Summary and lessons learned

Lessons (Re) learned Distinguish post- processing versus data processing Use Spark for data processing Assume dependencies will fail Load testing Don t focus exclusively on load volume Number of event types Distribution of event volumes

Monitoring Monitor individual components Redshift loader sends a heartbeat via CloudWatch metric API Monitor flow through the entire pipeline Send test data and verify successful processing Catches all failures

Possible Future Directions Use Akka Persistence Implement using more of Reactive Kafka Use Source.groupedWithin for batching COPY s instead of Akka

Related Note Kafka Streams released as part of Kafka 0.10+ Provides streams API similar to Reactive Kafka Reactive Kafka has API integration with Akka

Imports for Code Examples import org.apache.kafka.clients.consumer.consumerrecord import akka.actor.{actorref, ActorSystem} import akka.stream.actormaterializer import akka.stream.scaladsl.{keep, Sink, Source} import scala.util.{success, Failure} import scala.concurrent.executioncontext.implicits.global import akka.kafka.consumersettings import org.apache.kafka.clients.consumer.consumerconfig import org.apache.kafka.common.serialization.{stringdeserializer, ByteArrayDeserializer} import akka.kafka.subscriptions import akka.kafka.consumermessage.{committableoffsetbatch, CommittableMessage} import akka.kafka.scaladsl.consumer

Questions?