Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

Similar documents
Hadoop & Big Data Analytics Complete Practical & Real-time Training

Apache Hive for Oracle DBAs. Luís Marques

Introduction to BigData, Hadoop:-

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop Online Training

Innovatus Technologies

BIG DATA COURSE CONTENT

Big Data Analytics using Apache Hadoop and Spark with Scala

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Data-Intensive Distributed Computing

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Infrastructure at Spotify

Configuring and Deploying Hadoop Cluster Deployment Templates

microsoft

Data Storage Infrastructure at Facebook

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Big Data Facebook

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Hadoop Stack

Prototyping Data Intensive Apps: TrendingTopics.org

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Cloudera Introduction

Hadoop Overview. Lars George Director EMEA Services

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Big Data Hadoop Course Content

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Certified Big Data and Hadoop Course Curriculum

Exam Questions

Introduction to Hive Cloudera, Inc.

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop course content

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Hadoop Development Introduction

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Techno Expert Solutions An institute for specialized studies!

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Microsoft Big Data and Hadoop

April Copyright 2013 Cloudera Inc. All rights reserved.

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

HCatalog. Table Management for Hadoop. Alan F. Page 1

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Introduction to Hadoop and MapReduce

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Integrating Hive and Kafka

Cloudera Introduction

Technical Sheet NITRODB Time-Series Database

Tuning the Hive Engine for Big Data Management

Comparing SQL and NOSQL databases

SCALABLE DISTRIBUTED DEEP LEARNING

Stages of Data Processing

Oracle Data Integrator 12c: Integration and Administration

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Oracle Data Integrator 12c: Integration and Administration

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Oracle Big Data Connectors

50 Must Read Hadoop Interview Questions & Answers

Cloud Analytics and Business Intelligence on AWS

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Time Series Live 2017

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency

Data Access 3. Migrating data. Date of Publish:

Project Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

QMF Analytics v11: Not Your Green Screen QMF

Data Acquisition. The reference Big Data stack

Evolution of an Apache Spark Architecture for Processing Game Data

Chapter 5. The MapReduce Programming Model and Implementation

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Certified Big Data Hadoop and Spark Scala Course Curriculum

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Understanding NoSQL Database Implementations

Flash Storage Complementing a Data Lake for Real-Time Insight

Autonomous Database Level 100

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Big Data Architect.

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Data Lake Based Systems that Work

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

Hadoop An Overview. - Socrates CCDH

Transcription:

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

INTRO About KIXEYE An online gaming company focused on mid- core and hard- core games Founded in 00 Over 00 employees by Feb 0 times longer retention and 0 times higher ARPU Analytics Engineering Team Part of the Business Intelligence team team members

INTRO 0 M click events M active users DAILY STATS 00G logs data Triple the number in 0 Q

Requirements Requirements A fault- tolerant and scalable system Support standard reports Support ad- hoc, exploratory data queries Easy to use and manage Real- time is nice- to- have, but not necessary

Architecture Log Architecture Dashboard Custom ETL + Sqoop Oozie HIVE async REST API MySQL

Log Collection and Processing Log Architecture ~ GB uncompressed logs (player clicks) / hour Logs collected by Apache Chukwa - Choose over Flume and Scribe for Chukwa s easy configuration Log data cleaned and parsed as Snappy compressed JSON (staging) - Choose over Protocol Buffer and Avro for JSON s simplicity

ETL Component Architecture Custom ETL + Sqoop Oozie Hadoop cluster size - - core node * 0-0 mappers and 80 reducers Run ETL every 0 min - Populate RCFile into Hive tables Sqoop is used to for collecting data from legacy ETL system All ETL tasks are managed by Oozie

HIVE Tables Architecture HIVE 0+ click types (e.g. install, attack ) are loaded into corresponding tables Insertion is done by enabling dynamic partitions - FROM SELECT *** INSERT INTO is very useful Tables are usually partitioned by game and day - Some are further partitioned by hour

HIVE Tables (Cont d ) Architecture HIVE RCFile with Snappy compression is the data format - Excellent performance but expensive alter table - Evaluated jsonserde and protobuf format with custom serde, slow in querying Clustered by and Tablesample - A useful feature for analysts

HIVE Tables (Cont d ) Architecture HIVE Small files from hourly loading - Weekly merge operations alter table TBL partition (PART) concatenate; Evaluated Hive index on certain fields (e.g. level) - Improvement is not significant

Data Access Pull Data Access Two data access patterns Pull RESTful service built on top of Beeswax async REST API Asynchronous and concurrent requests compared to HiveServer query/status/fetch 00+ queries from the analysts every day Fixing bugs and adding features: To support multi- hivedb To support caching, load- balancing, and fail- over

Data Access Push Data Access A wrapper library for hive f command Data load Data merge Data migration Metric generation Used by ETL engineers

A Case Study Using Hive UDTF to Generate Session Session definition Stats Two consecutive user activities are separated as different sessions if the time interval between them exceeds a time- out threshold (e.g. 0 min) Requirements: Compute incrementally Provide as a Hive function

Using Hive UDTF to Generate Sessions Hourly Partition 0 Hourly Partition 0 Hourly Partition 0 Hourly Partition 0 view: collect_set(ts) group by uid 00 [ts, ts, ts, ] 00 [ts, ts, ts, ] 999 [ts, ts, ts, ] A Case Study Redis Intermediate data UDTF session_label 999 session_ ts 999 session_ ts 999 session_ ts

Lessons Learned Final notes Analysts are greedy Scan full set of data and ignore partitions Non- optimized joins RCFile is a double- edged sword Sqoop does not support RCFile Inflexible schema Automate, automate, and automate Constantly- changing ETL requirements New metrics on new features

Future Work Visualization layer Integration with Hbase Richer UDFs Final notes

We are hiring! Questions Our audacious goals: Build a world- class data and analytics team Deliver high- quality, real- time player behavior intelligence Join us to build the game- changing analytics system http://www.kixeye.com/#/en/jobs

Q & A asun@kixeye.com Final Notes