Big Data Facebook

Similar documents
Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Data Storage Infrastructure at Facebook

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

VOLTDB + HP VERTICA. page

5 Fundamental Strategies for Building a Data-centered Data Center

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

HBase Solutions at Facebook

Data Warehousing and Analytics Infrastructure at Facebook

What is Gluent? The Gluent Data Platform

HCatalog. Table Management for Hadoop. Alan F. Page 1

Modern Data Warehouse The New Approach to Azure BI

April Copyright 2013 Cloudera Inc. All rights reserved.

Big Data Hadoop Stack

Microsoft Big Data and Hadoop

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

@Pentaho #BigDataWebSeries

Big Data Infrastructure at Spotify

Data-Intensive Distributed Computing

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Big Data with Hadoop Ecosystem

An Introduction to Big Data Formats

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Top Five Reasons for Data Warehouse Modernization Philip Russom

Evolving To The Big Data Warehouse

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Acquiring Big Data to Realize Business Value

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Advanced Database Technologies NoSQL: Not only SQL

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

TESTING BIG DATA WORLD RIGA. by Konstantin Pletenev OCTOBER, 2017, TAPOST GROW CONFIDENTLY

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Capture Business Opportunities from Systems of Record and Systems of Innovation

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to BigData, Hadoop:-

A Review Paper on Big data & Hadoop

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Information Management course

Stages of Data Processing

Modernizing Business Intelligence and Analytics

Security and Performance advances with Oracle Big Data SQL

<Insert Picture Here> Introduction to Big Data Technology

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Microsoft Exam

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Was ist dran an einer spezialisierten Data Warehousing platform?

Chase Wu New Jersey Institute of Technology

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Migrate from Netezza Workload Migration

Apache Hive for Oracle DBAs. Luís Marques

MapReduce Algorithms

microsoft

Copyright 2016 Datalynx Pty Ltd. All rights reserved. Datalynx Enterprise Data Management Solution Catalogue

Importing and Exporting Data Between Hadoop and MySQL

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Understanding NoSQL Database Implementations

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

BI ENVIRONMENT PLANNING GUIDE

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

BIG DATA COURSE CONTENT

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Automating Information Lifecycle Management with

Streamlining CASTOR to manage the LHC data torrent

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

Hadoop Overview. Lars George Director EMEA Services

Introduction to Hive Cloudera, Inc.

Certified Big Data and Hadoop Course Curriculum

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Hadoop & Big Data Analytics Complete Practical & Real-time Training

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Prototyping Data Intensive Apps: TrendingTopics.org

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Building Next- GeneraAon Data IntegraAon Pla1orm. George Xiong ebay Data Pla1orm Architect April 21, 2013

Přehled novinek v SQL Server 2016

IoT Impact On Storage Architecture

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

BIG DATA TESTING: A UNIFIED VIEW

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Lambda Architecture for Batch and Stream Processing. October 2018

Transcription:

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo

Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions

Big Data @ FB: Scale 25 PB of compressed data equivalent to 300 years of HD-TV video

Big Data @ FB: Scale 150 PB of uncompressed data equivalent to 3 x the entire written works of mankind from the beginning of recorded history in all languages

Big Data @ FB: Scale 400 TB/day (uncompressed) of new data That is a lot of disks

Big Data @ FB: Scope Simple reporting Model generation Adhoc analysis + data science Index generation Many many others...

A/B Testing Email #1

A/B Testing Email #2

A/B Testing Email #2 is 3x Better

Friend Map By Paul Butler - https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919

Big Data @ FB: Scope one new job every second ~ 15% of the company uses the clusters

Evolution: 2007-2011 DW Size in TB 30000 25000 22500 15000 7500 8000 15 250 800 0 2007 2008 2009 2010 2011

2007: Traditional EDW

2007: Traditional EDW Web Clusters MySQL Clusters

2007: Traditional EDW Web Clusters MySQL Clusters RDBMS Data Warehouse

2007: Traditional EDW Web Clusters Scribe Mid-Tier MySQL Clusters RDBMS Data Warehouse

2007: Traditional EDW Web Clusters Scribe Mid-Tier NAS Filers MySQL Clusters RDBMS Data Warehouse

2007: Traditional EDW Web Clusters Scribe Mid-Tier Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours RDBMS Data Warehouse

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. RDBMS Data Warehouse

2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse

2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse

2007: Limitations Most use cases were in business metrics - data science, model building etc. not possible Only summary data was stored online - details archived away

2008: Move to Hadoop Web Clusters Scribe Mid-Tier Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse

2008: Move to Hadoop Web Clusters Scribe Mid-Tier Batch copier/ loaders NAS Filers Hadoop/Hive Data Warehouse MySQL Clusters RDBMS Data Mart

2008: Immediate Pros Data science at scale became possible For the first time all of the instrumented data could be held online Use cases expanded

2009: Democratizing Data Web Clusters Scribe Mid-Tier NAS Filers Hadoop/Hive Data Warehouse MySQL Clusters RDBMS Data Mart

2009: Democratizing Databee & Chronos: Data Pipeline Framework Data Nectar: instrumentation & schema aware data collection HiPal: Adhoc Queries + Data Discovery Hadoop/Hive Data Warehouse Scrapes: Configuration Driven

2009: Democratizing Data(Nectar) Typical Nectar Pipeline Simple schema evolution built in json encoded short term data decomposing json for long term storage

2009: Democratizing Data (Tools) HiPal - data discovery and query authoring Charting and dashboard generation tools

2009: Democratizing Data (Tools) Databee: Workflow language Chronos: Scheduling tool

2009: Cons of Democratization Isolation to protect against Bad Jobs Fair sharing of the cluster - what is a high priority job and how to enforce it

2010: Controlling Chaos Isolation Reducing operational overhead Better resource utilization Measurement, ownership, accountability

2010: Isolation Web Clusters Scribe Mid-Tier Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters

2010: Isolation Web Clusters Scribe Mid-Tier Platinum Warehouse NAS Filers Hive Replication MySQL Clusters Silver Warehouse

2010: Ops Efficiency Web Clusters Scribe HDFS Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2010: Ops Efficiency Web Clusters Scribe HDFS Platinum Warehouse near real time data consumers Hive Replication MySQL Clusters Silver Warehouse

2010: Ops Efficiency Web Clusters Scribe HDFS ptail: parallel tail on hdfs near real time data consumers Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2010: Resource Utilization (Disk) HDFS-RAID: from 3 replicas to 2.2 replicas RCFile: Row columnar format for compressing Hive tables

2010: Resource Utilization (CPU) Continuous copier/ loaders Incremental scrapes Hive optimizations to save CPU

2010: Monitoring(SLAs) Per job statistics rolled up to owner/group/team Expected time of arrival vs Actual time of arrival of data Simple data quality metrics

2011: New Requirements More real time requirements for aggregations Optimizing resource utilization

2011: Beyond Hadoop Puma for real time analytics Peregrine for simple and fast queries

2010: Puma Web Clusters Scribe HDFS ptail: parallel tail on hdfs near real time data consumers Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2010: Puma Web Clusters Scribe HDFS ptail: parallel tail on hdfs near real time data consumers Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse

2010: Puma

2010: Puma Scribe HDFS ptail: parallel tail on hdfs

2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters

2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster

Other Challenges Of HyperGrowth Moving data centers Moving sustainably fast

HyperGrowth - Moving Data Centers DW Size in TB 30000 25000 22500 15000 7500 8000 15 250 800 0 2007 2008 2009 2010 2011

HyperGrowth - Moving Data Centers Moved 20 PB of data Leverage replication with fast switch 2-3 months to accomplish the entire move Blog Post on FB by Paul Yang: http://www.facebook.com/notes/paul-yang/moving-an-elephant-largescale-hadoop-data-migration-at-facebook/10150246275318920

Questions Contact Information: ashish.thusoo@gmail.com http://www.linkedin.com/pub/ashish-thusoo/0/5a8/50 https://www.facebook.com/athusoo https://twitter.com/ashishthusoo