Data Storage Infrastructure at Facebook

Size: px

Start display at page:

Download "Data Storage Infrastructure at Facebook"

Alison Gallagher
6 years ago
Views:

1 Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung

2 Outline Strategy of data storage, processing, and log collection Data flow from the source to the data warehouse Storage systems and optimization Data discovery and analysis Challenges in resource sharing

3 Facebook s Architecture

4 Facebook s Architecture Hadoop Hbase HayStack Hive MySQL Memcached PHP HipHop compiler Scribe Thrift

5 Part 1: Strategy for Data Storage, Processing, Log collection Apache Hadoop Apache Hive Scribe

6 Hadoop, Why? Scalability Able to process multi petabyte datasets Fault Tolerance Node failure is expected everyday Number of nodes is not constant High Availability User can access from nearest node Cost Efficiency Open source Use commodity hardware as a node in Hadoop clusters Eliminates particular technology dependency

7 Hadoop Architecture HDFS (Hadoop Distributed File System) Map-Reduce Infrastructure

8 Hive SQL-like analysis tool (HiveQL) on top of Hadoop Dramatically improve the productivity and usage for Hadoop With Hive, users without programming experience can use Hadoop for their work Without Hive, one basic Hadoop data manipulation, like GROUP BY will take >100 lines of Java/Python code Even worse, if the programmer does not have database knowledge, the code will likely use sub-optimal algorithm, often it is pretty sub-optimal

9 Hive Architecture

10 Scribe Scalable Logging System Distributed and scalable logging system Combined with HDFS Aggregate logs from thousands of web servers

11 Part 2: Data Flow Architecture Two Sources of Data Web Server Log data Copy every 5-15 minutes Federated MySQL Information data Copy daily Two different clusters Production Hive-Hadoop cluster Ad-hoc Hive-Hadoop cluster

12 Deal with Data Delivery Latency Even log data copied at 5-15 minutes interval, the loader will only load data into Hive native table at the end of the day Solution at Facebook: Use Hive s external table feature, create table meta data on the raw HDFS files After data loaded into Hive native table at the end of day, remove raw HDFS files from the external table New solutions are needed to enable continuously log data loading

13 Part 3: Storage Optimization All data need to compressed to save space Hadoop allows user specific codecs, Facebook using gzip codec to get compression factor at 6-7 HDFS by default use 3 copies of data to prevent data loss Using erasure codes, 2 copies of data and 2 copies of error correction code, this multiple can be brought down to 2.2 Using Hadoop RAID on older data sets and keeping the newer data sets replicated 3 ways

14 Part 3: Storage Optimization Reduce the memory usage by HDFS NameNode Trade off latency to reduce memory pressure Implement file format to reduce map tasks Data federation Distribute data based on time Data across time boundary will need more join Distribute data based on application Some of the common data have to be replicated

15 Part 4: Data Discovery and Analysis Hive Provide immense scalability to non-engineering users, such as business analysts, product managers Data discovery Internal tool to enable wiki approach for metadata creation Tools to extract lineage information from query log Periodic Batch Jobs For such job, inner job dependencies and ability to schedule such job are critical

16 Part 5: Resource Sharing Support the co-existence of interactive jobs and batch jobs on the same Hadoop cluster Implement Hadoop Fair Share Scheduler Isolate ad-hoc queries and periodic batch queries Implement Scheduler to make it more aware of system resource usage caused by poorly written ad-hoc queries

17 Take Home Message For a data warehouse design What kind of data source, flow architecture What kind of storage architecture What kind of user, what kind of task How to make usage easier How to share the resource between jobs

18 End Thank you

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011