Data Storage Infrastructure at Facebook

Similar documents
Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Lecture 11 Hadoop & Spark

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Big Data Hadoop Stack

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Facebook

Big Data with Hadoop Ecosystem

Data Warehousing and Analytics Infrastructure at Facebook

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Microsoft Big Data and Hadoop

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

A Review Approach for Big Data and Hadoop Technology

Chase Wu New Jersey Institute of Technology

The Technology of the Business Data Lake. Appendix

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Certified Big Data and Hadoop Course Curriculum

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Importing and Exporting Data Between Hadoop and MySQL

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

microsoft

Big Data Analytics using Apache Hadoop and Spark with Scala

Cloud Computing & Visualization

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Hive SQL over Hadoop

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

An Introduction to Big Data Formats

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

CISC 7610 Lecture 2b The beginnings of NoSQL

A Survey on Big Data

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

HADOOP FRAMEWORK FOR BIG DATA

How to Write Data to HDFS

Expert Lecture plan proposal Hadoop& itsapplication

Chapter 5. The MapReduce Programming Model and Implementation

Certified Big Data Hadoop and Spark Scala Course Curriculum

Hadoop. Introduction / Overview

Hadoop. copyright 2011 Trainologic LTD

Typical size of data you deal with on a daily basis

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Introduction to Hadoop and MapReduce

Advanced Database Technologies NoSQL: Not only SQL

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

50 Must Read Hadoop Interview Questions & Answers

MapReduce. U of Toronto, 2014

Oracle Big Data Connectors

docs.hortonworks.com

Apache Hive for Oracle DBAs. Luís Marques

Shark: Hive (SQL) on Spark

Cloud Computing Techniques for Big Data and Hadoop Implementation

Data Informatics. Seon Ho Kim, Ph.D.

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Map Reduce & Hadoop Recommended Text:

Configuring and Deploying Hadoop Cluster Deployment Templates

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Trafodion Enterprise-Class Transactional SQL-on-HBase

HCatalog. Table Management for Hadoop. Alan F. Page 1

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Big Data Infrastructures & Technologies

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Introduction to BigData, Hadoop:-

Massive Online Analysis - Storm,Spark

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Lily 2.4 What s New Product Release Notes

APACHE HIVE CIS 612 SUNNIE CHUNG

BigData and Map Reduce VITMAC03

Understanding NoSQL Database Implementations

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Cloudera Introduction

Hadoop, Yarn and Beyond

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

CSE 444: Database Internals. Lecture 23 Spark

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HBase Solutions at Facebook

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A Review Paper on Big data & Hadoop

SURVEY ON BIG DATA TECHNOLOGIES

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Exam Questions

Efficiency at Scale. Sanjeev Kumar Director of Engineering, Facebook

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Transcription:

Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung

Outline Strategy of data storage, processing, and log collection Data flow from the source to the data warehouse Storage systems and optimization Data discovery and analysis Challenges in resource sharing

Facebook s Architecture

Facebook s Architecture Hadoop Hbase HayStack Hive MySQL Memcached PHP HipHop compiler Scribe Thrift

Part 1: Strategy for Data Storage, Processing, Log collection Apache Hadoop Apache Hive Scribe

Hadoop, Why? Scalability Able to process multi petabyte datasets Fault Tolerance Node failure is expected everyday Number of nodes is not constant High Availability User can access from nearest node Cost Efficiency Open source Use commodity hardware as a node in Hadoop clusters Eliminates particular technology dependency

Hadoop Architecture HDFS (Hadoop Distributed File System) Map-Reduce Infrastructure

Hive SQL-like analysis tool (HiveQL) on top of Hadoop Dramatically improve the productivity and usage for Hadoop With Hive, users without programming experience can use Hadoop for their work Without Hive, one basic Hadoop data manipulation, like GROUP BY will take >100 lines of Java/Python code Even worse, if the programmer does not have database knowledge, the code will likely use sub-optimal algorithm, often it is pretty sub-optimal

Hive Architecture

Scribe Scalable Logging System Distributed and scalable logging system Combined with HDFS Aggregate logs from thousands of web servers

Part 2: Data Flow Architecture Two Sources of Data Web Server Log data Copy every 5-15 minutes Federated MySQL Information data Copy daily Two different clusters Production Hive-Hadoop cluster Ad-hoc Hive-Hadoop cluster

Deal with Data Delivery Latency Even log data copied at 5-15 minutes interval, the loader will only load data into Hive native table at the end of the day Solution at Facebook: Use Hive s external table feature, create table meta data on the raw HDFS files After data loaded into Hive native table at the end of day, remove raw HDFS files from the external table New solutions are needed to enable continuously log data loading

Part 3: Storage Optimization All data need to compressed to save space Hadoop allows user specific codecs, Facebook using gzip codec to get compression factor at 6-7 HDFS by default use 3 copies of data to prevent data loss Using erasure codes, 2 copies of data and 2 copies of error correction code, this multiple can be brought down to 2.2 Using Hadoop RAID on older data sets and keeping the newer data sets replicated 3 ways

Part 3: Storage Optimization Reduce the memory usage by HDFS NameNode Trade off latency to reduce memory pressure Implement file format to reduce map tasks Data federation Distribute data based on time Data across time boundary will need more join Distribute data based on application Some of the common data have to be replicated

Part 4: Data Discovery and Analysis Hive Provide immense scalability to non-engineering users, such as business analysts, product managers Data discovery Internal tool to enable wiki approach for metadata creation Tools to extract lineage information from query log Periodic Batch Jobs For such job, inner job dependencies and ability to schedule such job are critical

Part 5: Resource Sharing Support the co-existence of interactive jobs and batch jobs on the same Hadoop cluster Implement Hadoop Fair Share Scheduler Isolate ad-hoc queries and periodic batch queries Implement Scheduler to make it more aware of system resource usage caused by poorly written ad-hoc queries

Take Home Message For a data warehouse design What kind of data source, flow architecture What kind of storage architecture What kind of user, what kind of task How to make usage easier How to share the resource between jobs

End Thank you