Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Similar documents
Big Data Facebook

Data Storage Infrastructure at Facebook

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

VOLTDB + HP VERTICA. page

HCatalog. Table Management for Hadoop. Alan F. Page 1

April Copyright 2013 Cloudera Inc. All rights reserved.

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Data Warehousing and Analytics Infrastructure at Facebook

5 Fundamental Strategies for Building a Data-centered Data Center

Big Data with Hadoop Ecosystem

Modern Data Warehouse The New Approach to Azure BI

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Microsoft Big Data and Hadoop

Data-Intensive Distributed Computing

HBase Solutions at Facebook

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Data Infrastructure at Spotify

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Big Data Hadoop Stack

An Introduction to Big Data Formats

Was ist dran an einer spezialisierten Data Warehousing platform?

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Unifying Big Data Workloads in Apache Spark

What is Gluent? The Gluent Data Platform

CloudExpo November 2017 Tomer Levi

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Prototyping Data Intensive Apps: TrendingTopics.org

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Security and Performance advances with Oracle Big Data SQL

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to BigData, Hadoop:-

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Apache Hive for Oracle DBAs. Luís Marques

BI ENVIRONMENT PLANNING GUIDE

Evolving To The Big Data Warehouse

Acquiring Big Data to Realize Business Value

Hacking PostgreSQL Internals to Solve Data Access Problems

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

microsoft

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Importing and Exporting Data Between Hadoop and MySQL

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Cloud Computing & Visualization

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Chase Wu New Jersey Institute of Technology

BIG DATA COURSE CONTENT

Certified Big Data and Hadoop Course Curriculum

Databricks, an Introduction

Hadoop & Big Data Analytics Complete Practical & Real-time Training

@Pentaho #BigDataWebSeries

CISC 7610 Lecture 2b The beginnings of NoSQL

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Top Five Reasons for Data Warehouse Modernization Philip Russom

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Sempala. Interactive SPARQL Query Processing on Hadoop

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Understanding NoSQL Database Implementations

Big Data Analytics using Apache Hadoop and Spark with Scala

A Review Paper on Big data & Hadoop

Data Analytics at Logitech Snowflake + Tableau = #Winning

Introduction to Hive Cloudera, Inc.

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Microsoft Exam

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Advanced Database Technologies NoSQL: Not only SQL

Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive SQL over Hadoop

<Insert Picture Here> Introduction to Big Data Technology

TESTING BIG DATA WORLD RIGA. by Konstantin Pletenev OCTOBER, 2017, TAPOST GROW CONFIDENTLY

EsgynDB Enterprise 2.0 Platform Reference Architecture

Modernizing Business Intelligence and Analytics

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

MapReduce Algorithms

Data Architectures in Azure for Analytics & Big Data

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Automating Information Lifecycle Management with

Transcription:

Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011 Co-founded Apache Hive @ Facebook

Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Qubole

Big Data @ FB(2011): Scale 25 PB of compressed data ~ 150 PB of uncompressed data 400 TB/day (uncompressed) of new data 1 new job every second

Big Data @ FB: Scope Simple reporting Model generation Adhoc analysis + data science Index generation Many many others...

A/B Testing Email #1

A/B Testing Email #2

A/B Testing Email #2 is 3x Better

Evolution: 2007-2011 30000 22500 DW Size in TB 25000 15000 7500 8000 15 250 800 0 2007 2008 2009 2010 2011

2007: Traditional EDW Web Clusters Scribe Mid-Tier Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse

2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse

2007: Limitations Most use cases were in business metrics - data science, model building etc. not possible Only summary data was stored online - details archived away

2008: Move to Hadoop Web Clusters Scribe Mid-Tier Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse

2008: Move to Hadoop Web Clusters Scribe Mid-Tier Batch copier/ loaders NAS Filers Hadoop/Hive Data Warehouse MySQL Clusters RDBMS Data Mart

2008: Immediate Pros Data science at scale became possible For the first time all of the instrumented data could be held online Use cases expanded

2009: Democratizing Data Web Clusters Scribe Mid-Tier NAS Filers Hadoop/Hive Data Warehouse MySQL Clusters RDBMS Data Mart

2009: Democratizing Databee & Chronos: Data Pipeline Framework Data Nectar: instrumentation & schema aware data collection HiPal: Adhoc Queries + Data Discovery Hadoop/Hive Data Warehouse Scrapes: Configuration Driven

2009: Democratizing Data(Nectar) Typical Nectar Pipeline Simple schema evolution built in json encoded short term data decomposing json for long term storage

2009: Democratizing Data (Tools) HiPal - data discovery and query authoring Charting and dashboard generation tools

2009: Democratizing Data (Tools) Databee: Workflow language Chronos: Scheduling tool

2009: Cons of Democratization Isolation to protect against Bad Jobs Fair sharing of the cluster - what is a high priority job and how to enforce it

2010: Controlling Chaos Isolation Reducing operational overhead Better resource utilization Measurement, ownership, accountability

2010: Isolation Web Clusters Scribe Mid-Tier Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters

2010: Isolation Web Clusters Scribe Mid-Tier Platinum Warehouse NAS Filers Hive Replication MySQL Clusters Silver Warehouse

2010: Ops Efficiency Web Clusters Scribe HDFS ptail: parallel tail on hdfs near real time data consumers Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2010: Resource Utilization (Disk) HDFS-RAID: from 3 replicas to 2.2 replicas RCFile: Row columnar format for compressing Hive tables

2010: Resource Utilization (CPU) Continuous copier/ loaders Incremental scrapes Hive optimizations to save CPU

2010: Monitoring(SLAs) Per job statistics rolled up to owner/group/team Expected time of arrival vs Actual time of arrival of data Simple data quality metrics

2011: New Requirements More real time requirements for aggregations Optimizing resource utilization

2011: Beyond Hadoop Puma for real time analytics Peregrine for simple and fast queries

2011: Puma Web Clusters Scribe HDFS ptail: parallel tail on hdfs near real time data consumers Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2011: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster

Some takeaways Operating and optimizing Data Infrastructure is a hard problem Lots of components from log collection, storage, compute, query processing, tools and interfaces Lots of choices within each part of the stack

Qubole Mission: Data Infrastructure in the Cloud made Easy, Fast and Reliable We take care of operating and optimizing this infrastructure so that you can focus on your data, analysis, algorithms and building your data apps

Qubole - Information Early Trial(by invitation): www.qubole.com Come talk to us to join a small and passionate team jobs@qubole.com Follow us on twitter/facebook/linkedin