Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Similar documents
FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Distributed Filesystem

GFS: The Google File System

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Lecture-16

GFS: The Google File System. Dr. Yingwu Zhu

MapReduce. U of Toronto, 2014

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed File Systems II

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

BigData and Map Reduce VITMAC03

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

CA485 Ray Walshe Google File System

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

The Google File System. Alexandru Costan

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Google File System 2

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

The Google File System (GFS)

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed System. Gang Wu. Spring,2018

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Thank you, and welcome to OSDI! My name is Jeremy Elson and I m going to be talking about a project called flat datacenter storage, or FDS.

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Introduction to Distributed Data Systems

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google File System. By Dinesh Amatya

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

Distributed Systems 16. Distributed File Systems II

Google is Really Different.

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Parallel Computing: MapReduce Jin, Hai

Google File System. Arun Sundaram Operating Systems

2/27/2019 Week 6-B Sangmi Lee Pallickara

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Ceph: A Scalable, High-Performance Distributed File System

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Staggeringly Large Filesystems

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CS November 2017

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Staggeringly Large File Systems. Presented by Haoyan Geng

The Google File System GFS

Distributed File Systems

Dept. Of Computer Science, Colorado State University

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

CS 61C: Great Ideas in Computer Architecture. MapReduce

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Google Disk Farm. Early days

Introduction to MapReduce

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS November 2018

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Lecture 11 Hadoop & Spark

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

Hadoop Distributed File System(HDFS)

BigTable. CSE-291 (Cloud Computing) Fall 2016

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

MI-PDB, MIE-PDB: Advanced Database Systems

Cluster-Level Google How we use Colossus to improve storage efficiency

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Map Reduce Group Meeting

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Architecture of Systems for Processing Massive Amounts of Data

Chapter 11: File System Implementation. Objectives

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Transcription:

Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Motivation Imagine a world with flat data storage Simple, Centralized, and easy to program Unfortunately, datacenter networks were once oversubscribed Shortage of bandwidth Move Computation to Data Programming models like MapReduce, Dryad, etc. These locality constraints hindered efficient resource utilization!

Motivation Data Center Networks are getting faster! New topologies mean networks can support Full Bisection Bandwidth.

Motivation Idea: Design with full bisection bandwidth in mind. All compute nodes can access all storage with equal throughput! Consequence: No need to worry about data locality. FDS read/write performance exceeds 2 GB/s, can recover 92GB of lost data in 6.2 seconds, and broke a world record in sorting in 2012.

FDS Design

Design: Blobs and Tracts Data is stored in logical blobs Byte sequences with a 128-bit Global Unique Identifiers (GUID) Divided into constant sized units called tracts Tracts are sized so random and sequential accesses have same throughput Both tracts and blobs are mutable Disk is managed by a tractserver process Read/write to disk directly without filesystem; tract data cached in memory

Design: System API Reads and writes not guaranteed to appear in the order they are issued Read and writes are atomic API is non-blocking Responds to application using a callback Non-blocking API helps performance: many requests can be issued in parallel, and FDS can pipeline disk reads with network transfers.

Design: Locating a Tract Tractservers can be found deterministically using a Tract Locator Table (TLT). TLT is distributed to clients using a centralized metadata server. To read or write tract i in blob with GUID g: Tract_Locator = (Hash(g) + i) mod TLT_Length Deterministic, and produces uniform disk utilization Don t hash i so a blob uses entries in TLT uniformly.

TLT Example Row Version Number Replica 1 Replica 2 Replica 3 1 234 A F B 2 235 B C L 3 567 E D G 4 13 T A H 5 67 F B G 6 123 G E B 7 86 D V C 8 23 H E F

Replication Each TLT entry is k-way Writes go to all k replicas; reads pick a random entry. Metadata updates are serialized by a primary replica and shared with secondaries using a two-phase commit protocol

Replicated Data Layout O(n 2 ) TLT entries for two replicas mean fast recovery for single failure ~1/n data on each of the remaining disks, so highly parallel recovery. Problem: two failures = guaranteed data loss! Since each pair of disks appears in the TLT, two losses means all disks failed for one TLT entry. Simple solution: O(n 2 ) entries (every possible pair), and k-way replication with k > 2. k - 2 replicas chosen at random.

Design: Metadata Stored on a special tract for each blob, accessed using the TLT Blobs are extended using API calls, which access the metadata tract Appends to metadata are equivalent to record append on GFS

Design: Dynamic Work Allocation Since data and compute no longer need to be co-located, work can be assigned dynamically and with finer granularity. With FDS, a cluster can centrally schedule work for each worker as it nears completion of it s previous unit. Note: Unlike MapReduce, etc., which must take into account where data resides when assigning work! Significant impact on performance.

Replication and Failure Recovery

Failure Recovery TLT carries a version number for each row. On failure: 1. Metadata server detects failure after HeartBeat message times out 2. Current TLT is invalidated by incrementing version 3. Random tractservers are picked to fill gaps in TLT after failure 4. tractservers ACK new assignment, replicate data Clients with stale TLTs request new ones from metadata server No need to wait for replication to finish; just TLT update.

Failure Recovery

Fault Recovery Guarantees Weak Consistency Similar to GFS; trackservers may inconsistent during failure, or if client fails after a write to a subset of replicas Availability Clients only need to wait for an updated trackserver list Partition Tolerance? One active master at a time to prevent corrupted states, but a partitioned network may mean that clients can t write to all replicas

Results: Read + Write Performance

Comparison with GFS Simple metadata server Only stores TLTs, not information about the tracts themselves. So tracts can be arbitrarily small! (Google says 64MB is too big for their chunksize) Master in GFS also a potential bottleneck as scale increases? Single file reads can be issued with very high throughput Since tracts are stored across many disks, reads can be issued in parallel Anything Else?

Takeaways A storage system that takes advantage of new data center properties Namely: Full Network Bisection Bandwidth Each compute node has equal throughput to each storage node, so don t need to design for locality No need to worry about locality simple design and failure recovery, ability to schedule jobs at fine granularity and without wasting resources Results show that the system is fast at recovery, and provides efficient I/O

Bonus Slides

Cluster Growth tractservers can be added at runtime. 1. Increment version number of TLT, begin copying data to new tractserver a. pending phase 2. When finished, TLT entry version incremented again, committing the new disk While in pending state, new writes are added to the new tractserver as well. Failure of new server expunge it and increment version. Failure of existing tractserver run recovery protocol.

Network Uses a full bisection bandwidth network, with ECMP to load balance flows (stochastically guarantees bisection bandwidth). Storage nodes given bandwidth equal to the disk bandwidth Compute nodes given bandwidth equal to the I/O bandwidth RSS, zero-copy used to saturate 10, 20 Gbps links respectively TCP alone not enough!

Experiments and Results

Testbed Setup 14 Racks, Full bisection bandwidth network with 5.5 Tbps. BGP for route selection, IP subnets for each TOR Operating cost: $250,000 Heterogeneous environment with up to 256 Servers, 2 to 24 cores, 12 to 96 GB RAM, 300GB 10,000RPM SAS Drives, and 500GB, 1TB 7200 RPM SATA Drives

Results: Recovery