FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Similar documents
FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Distributed Filesystem

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

BigData and Map Reduce VITMAC03

MapReduce. U of Toronto, 2014

CA485 Ray Walshe Google File System

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

The Google File System

CLOUD-SCALE FILE SYSTEMS

Map-Reduce. Marco Mura 2010 March, 31th

Distributed File Systems II

GFS: The Google File System. Dr. Yingwu Zhu

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

GFS: The Google File System

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Thank you, and welcome to OSDI! My name is Jeremy Elson and I m going to be talking about a project called flat datacenter storage, or FDS.

The Google File System

Programming Models MapReduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MI-PDB, MIE-PDB: Advanced Database Systems

The Google File System

The Google File System

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

NPTEL Course Jan K. Gopinath Indian Institute of Science

Dept. Of Computer Science, Colorado State University

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

CSE 124: Networked Services Lecture-16

Google File System (GFS) and Hadoop Distributed File System (HDFS)

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Cluster-Level Google How we use Colossus to improve storage efficiency

The Google File System

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CSE 124: Networked Services Fall 2009 Lecture-19

Replication. Feb 10, 2016 CPSC 416

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Cloud Computing CS

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Map Reduce. Yerevan.

Chapter 11: File System Implementation. Objectives

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Introduction to MapReduce

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

The Google File System

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Clustering Lecture 8: MapReduce

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

Distributed System. Gang Wu. Spring,2018

BigTable. CSE-291 (Cloud Computing) Fall 2016

Introduction to MapReduce

CS3600 SYSTEMS AND NETWORKS

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Introduction to Map Reduce

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

The Google File System. Alexandru Costan

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed computing: index building and use

Database Architectures

Staggeringly Large Filesystems

Programming Systems for Big Data

Google File System. By Dinesh Amatya

CS 345A Data Mining. MapReduce

File Structures and Indexing

Distributed Computation Models

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Ext3/4 file systems. Don Porter CSE 506

Introduction to Distributed Data Systems

Distributed computing: index building and use

Google File System. Arun Sundaram Operating Systems

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce. Stony Brook University CSE545, Fall 2016

Google File System 2

Parallel Computing: MapReduce Jin, Hai

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Google is Really Different.

Distributed Systems 16. Distributed File Systems II

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

OPERATING SYSTEM. Chapter 12: File System Implementation

15-440/15-640: Homework 3 Due: November 8, :59pm

Distributed Computations MapReduce. adapted from Jeff Dean s slides

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Transcription:

FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568

FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected by network with high bisection bandwidth For big-data processing (like MapReduce) Cluster of 1000s of computers processing data in parallel Compute and storage are logically separate Metadata management Networks Read and write performance Failure recovery Application performance

FDS

FDS the storage system is pretty simple. It's a straight-up blob store, where blobs are arbitrarily large, but are read and written in 8MB chunks called tracts. Blobs are identified with a 128-bit GUID, and tracts are 8MB offsets into the blob. Choosing 8MB means that the disks get essentially full sequential write bandwidth, since seek times are amortized over these long reads. It provides asynchronous APIs for read and write operations; pretty necessary for doing high-performance parallel I/O. Data is placed on storage nodes (aka tractservers) through a simple hashing scheme. There's no Name node like in Hadoop. Instead, clients cache a tract locator table (TLT) which lists all of the tractservers in immutable order. Lookups for a tract in a blob are done by hashing the blob GUID, adding the tract number, and then moduling to index into the TLT. Client access through the TLT is normally going to be sequential, since clients will mostly be sequentially scanning through the tracts of a blob. To find the TLT index, the blob ID is hashed, but the tract number is just added. This is countered by taking 20 copies of the list of tractservers, shuffling each copy, and then concatenating them together. This super-list prevents auto-synchronizing lockstep, and also does a good job of balancing data distribution across tractservers.

High-level design Lots of clients Lots of storage servers ("tractservers") Partition the data Master ("metadata server") controls partitioning Replica groups for reliability

Why is this high-level design useful? 1000s of disks of space Store giant blobs or many big blobs 1000s of servers/disks/arms of parallel throughput Can expand over time - reconfiguration Large pool of storage servers for instant replacement after failure Recreate lost copies from replicas.

Data Placement How do they place blobs on machines? Break each blob into 8 MB tracts "TLT" maintained by metadata server for blob g tract i and TLT size n, slot = (hash(g) + i) mod n TLT[slot] contains list of tractservers w/ copy of the tract Clients and servers all have copies of the latest TLT table

Metadata Properties Metadata server is in critical path only when a client process starts. So tract size can be kept arbitrarily small. TLT can be cached since it changes only on cluster configuration(not each read & write), eliminate all traffic to metadata server in system under normal conditions. The metadata server stores metadata only about the hardware configuration, not about blobs. Because blob traffic is low, its implementation is simple and lightweight. Since the TLT contains random permutations of the list of tractservers, sequential reads and writes by independent clients are highly likely to utilize all tractservers uniformly and are unlikely to organize into synchronized convoys.

Pre-blob metadata Each blob has metadata such as its length and it is stored it in each blob s special metadata tract ( tract 1 ). Clients find a blob s metadata on a tractserver using the same TLT used to find regular data. When a blob is created, the tractserver responsible for its metadata tract and creates that tract on disk and initializes the blob s size to 0. When a blob is deleted, that tractserver deletes the metadata. Newly created blobs have a length of 0 tracts. Applications must extend a blob before writing past the end of it. The extend operation is atomic and is safe to execute concurrently with other clients, and returns the new size of the blob as a result of the client s call. A separate API tells the client the blob s current size. Extend operations for a blob are sent to the tractserver that owns that blob s metadata tract. The tractserver serializes it, atomically updates the metadata, and returns the new size to each caller. If all writers follow this pattern, the extend operation provides a range of tracts and the caller may write without risk of conflict

Dynamic work allocation In any system if a node falls behind, the only options for recovery is restarting its computation elsewhere. So here straggler period can represent a great loss in efficiency if most resources are idle while waiting for a slow task to complete. In FDS, since storage and compute are no longer co-located, the assignment of work to worker can be done dynamically, at fine granularity, during task execution. The best practice for FDS applications is to centrally give small units of work to each worker as it nears completion of its previous unit. This self-clocking system ensures that the maximum dispersion in completion times across the cluster is only the time required for the slowest worker to complete a single unit.

REPLICATION IN FDS To improve durability and availability, FDS supports higher levels of replication. When a disk fails, redundant copies of the lost data are used to restore the data to full replication. What happen to replicas when application writes and reads? When an application writes a tract, the client library finds the appropriate row of the TLT and sends the write to every tractserver it contains. Applications are notified that their writes have completed only after the client library receives write acknowledgments from all replicas. Reads select a single tractserver at random. Applications are notified that their writes have completed only after the client library receives write acknowledgments from all replicas. Replication also requires changes to CreateBlob, ExtendBlobSize, and DeleteBlob. When a tractserver receives one of these operations, it executes a two-phase commit with the other replicas. The primary replica does not commit the change until all other replicas have completed successfully. FDS also supports per-blob variable replication.

Comparison with hadoop In FDS, data is only read and written to disk once. Hadoop's intermediate data is sorted and spilled to disk by the mapper, while FDS keeps this entirely in memory. FDS also streams and sorts data during the read phase, while Hadoop has a barrier between its map and reduce phases. Hadoop can do some preaggregation in the mapper with a combiner, but not as flexibly as in FDS, not on the reduce side, and not without hogging an entire task slot. FDS has better task scheduling, with a dynamic work queue and dynamic work item sizes. FDS uses a single process for both the read and write phases, so the JVM startup cost is gone, and there's no unnecessary movement of data between address spaces.

Questions and Answers 1. Consider a centralized file server in a small computer science department. Data stored by any computer can be retrieved by any other. This conceptual simplicity makes it easy to use: computation can happen on any computer, even in parallel, without regard to first putting data in the right place. Is GFS a centralized file system? To achieve good performance for an I/O-intensive program running on GFS, how does the data placement affect its performance? [Hint: consider the case where processes of the program run on the chunk servers.] Ans. The GFS is a centralized file system. In order to -simplify the design,increase its reliability flexibility. The GFS got a simple centralized management system. As the system has centralized data storage system. But general GFS is a distributed file system. In GFS the data is distributed as chucks. Master does not store all the useful data on a single chunk rather it distributes it to balance load and disk space on chunk server. So in the scenario of the question the GFS can be viewed as centralized file system but generally the GFS is a distributed file system. - In particular, a centralized master makes it much easier to implement sophisticated chunk placement and replication policies since the master already has most of the relevant information and controls how it changes. The fault tolerance by keeping the master state small and fully replicated on other machines. Scalability and high availability (for reads) are currently provided by our shadow master mechanism.

2. The root of this cascade of consequences was the locality constraint, itself rooted in the datacenter bandwidth shortage. Show example consequences of relying on locality in program s execution. Why does a sufficient I/O bandwidth help remove the constraint? Ans. Stragglers is one example. Locality constraints can sometimes even hinder efficient resource utilization. One example is stragglers: if data is singly replicated, a single unexpectedly slow machine can preclude an entire job s timely completion even while most of the resources are idle. The preference for local disks also serves as a barrier to quickly re-tasking nodes: since a CPU is only useful for processing the data resident there, re-tasking requires expensive data movement. If we consider the network is not oversubscribed then we could provide full bisectional bandwidth between the disks. So that there would be no distinction between local and remote disks. So constraints like re-tasking doesn t require expensive data movement

3. In FDS, data is logically stored in blobs.... Reads from and writes to a blob are done in units called tracts. What are blob and tracts? Are they of constant sizes? Ans. Data is stored in logical blobs Byte sequences with a 128-bit Global Unique Identifiers (GUID) Blob can be any length up to system s storage capacity. Divided into constant sized units called tracts Tracts are sized so random and sequential accesses have same throughput Both tracts and blobs are mutable Disk is managed by a tractserver process Read/write to disk directly without filesystem; tract data cached in memory

4. In our cluster, tracts are 8MB. Why is a tract in FDS sized this large? Ans. Tracts are sized such that random and sequential access achieves nearly the same throughput. In our cluster, tracts are 8MB The tract size is set when the cluster is created based upon cluster hardware. For example, if flash were used instead of disks, the tract size could be made far smaller (e.g., 64kB).Every disk is managed by a process called a tract server. 1) If the tract size is small, the client need to write number of times for a large blob. The process would be slow. If the tract size is large, the number of writes will reduce. 2) The researches say that they have done several experiments on 10000 rpm disks. The experiments results that the throughput of the disk is good when the tract size is 8MB.

5. Tractservers do not use a file system. Explain this design choice. The FDS is storage system. It is not a file system. The tractservers are just raw disks, there will be no file system on them. They will just take all the blobs on the disks, divide them into sequential tracts and use those sequential tracks number them sequentially. The client just communicate with the FDS through an API that abstracts some of the complexity around the messaging layer. Tractservers and their network protocol are not exposed directly to FDS applications. Instead, these details are hidden in a client library with a narrow and straightforward interface.

6. Figure 1: FDS API. Read the API. Does FDS support file operations? Is a blob similar to a file in a (distributed) file system? Blob is not similar to a file. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob similar as a file. FDS support several file operations as create blob, delete blob, open blob and close blob. We can also do several operations like writing and reading the tract etc.

7. FDS uses a metadata server, but its role during normal operations is simple and limited: What are drawbacks of using a centralized metadata server? How does FDS address the issue? Metadata server s role during normal operations is simple and limited. When the system is initialized, tractservers locally store their position in the TLT. This means the metadata server does not need to store durable state. So it will simplify its implementation. So, in case of a metadata server failure, the TLT is reconstructed by collecting the table assignments from each tractserver. So we will not lose all our data in case of metadata server failure. It collects a list of system s active tract server and distribute it to TLT with k-replicaton, each entry has k tractservers. To read or write tract number i from a blob with GUID g, Eq:- Tract_Locator = (Hash(g)+i) mod TLT_Length (uses eq to replace big size table)

8.How does FDS locate the track server that stores a particular tract of a given blob? Why does FDS first identify a tract locator (an index to an entry of tract locator table) and then in the entry to find the track server, rather than directly identifying a trackserver using a hash function without having such a table? Tract_Locator= (Hash(g) + i) mod TLT_Length If no hash then, i,g equati on A B C i,g equati on E F G If we do not have such table then we do not have any space to put the replicated data. So, in case where we do not have such table, we have just one(original) copy of the data and if something wrong happened to that data or disk then we cannot able to recover that lost data.

9. To be clear, the TLT does not contain complete information about the location of individual tracts in the system. and in the GFS paper The master maintains less than 64 bytes of metadata for each 64 MB chunk. Compare the TLT table with GFS s use of a full chunk-chunkserver mapping table in the context of efficiency, scalability, and flexibility. [Hint: It is not modified by tract reads and writes. Its size in a singlereplicated system is proportional to the number of tractservers in the system.] In GFS, chunk is large. So, once you ask some data location from master second time you do not need to ask location for other data in this system. This makes data excess very flexible. But single master in a system can still become potential bottleneck. In FDS, it uses a metadata server, but its role during normal operations is simple and limited i.e. it collects a list of the system s active tractservers and distribute it to clients and this list is called tract locator table, or TLT. And when the system is initialized, tractservers locally store their position in the TLT. In case of a metadata server failure, the TLT is reconstructed by collecting the table assignments from each tractserver. So here scalability and efficiency is very good.

10. In our 1,000 disk cluster, FDS recovers 92GB lost from a failed disk in 6.2 seconds. What is normal throughput of a hard disk? What s the throughput of this recovery? How can this be possible? [Hint: Describe the procedure of recovering from a dead tract server to answer this question. See Figure 2 and read Section 3.3] Normal throughput of HD 80-100 MB/S. Approximately 15 GB/S, very high. Possible because of large number of replication