Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/2010

Similar documents
Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

The Google File System

Ceph: A Scalable, High-Performance Distributed File System

The Google File System

CLOUD-SCALE FILE SYSTEMS

The Google File System

Google File System. By Dinesh Amatya

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System

The Google File System (GFS)

Dynamic Metadata Management for Petabyte-scale File Systems

Distributed System. Gang Wu. Spring,2018

The Google File System. Alexandru Costan

Staggeringly Large File Systems. Presented by Haoyan Geng

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

CA485 Ray Walshe Google File System

Distributed File Systems (Chapter 14, M. Satyanarayanan) CS 249 Kamal Singh

Ceph. The link between file systems and octopuses. Udo Seidel. Linuxtag 2012

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Distributed File Systems

The Google File System

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed Systems 16. Distributed File Systems II

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

MapReduce. U of Toronto, 2014

Distributed File Systems. Directory Hierarchy. Transfer Model

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

BigData and Map Reduce VITMAC03

This material is covered in the textbook in Chapter 21.

Distributed File Systems II

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System (GFS) and Hadoop Distributed File System (HDFS)

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

GFS: The Google File System

Lecture XIII: Replication-II

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Staggeringly Large Filesystems

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

NPTEL Course Jan K. Gopinath Indian Institute of Science

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CLIP: A Compact, Load-balancing Index Placement Function

Distributed Filesystem

CSE 124: Networked Services Fall 2009 Lecture-19

Ceph: A Scalable, High-Performance Distributed File System

The Google File System GFS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google Disk Farm. Early days

Google File System 2

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

CSE 124: Networked Services Lecture-16

Ceph: A Scalable, High-Performance Distributed File System

Efficient Metadata Management in Cloud Computing

GFS: The Google File System. Dr. Yingwu Zhu

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Seminar Report On. Google File System. Submitted by SARITHA.S

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Google File System. Arun Sundaram Operating Systems

7680: Distributed Systems

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Current Topics in OS Research. So, what s hot?

CS 345A Data Mining. MapReduce

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CA485 Ray Walshe NoSQL

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Google Cluster Computing Faculty Training Workshop

Cloud Computing CS

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

A Rendezvous Framework for the Automatic Deployment of Services in Cluster Computing

RELIABLE, SCALABLE, AND HIGH PERFORMANCE DISTRIBUTED STORAGE: Distributed Object Storage

Lustre A Platform for Intelligent Scale-Out Storage

A Hybrid Scheme for Object Allocation in a Distributed Object-Storage System

Distributed Systems. GFS / HDFS / Spanner

PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Research on Implement Snapshot of pnfs Distributed File System

Introduction to MapReduce

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

EECS 482 Introduction to Operating Systems

2/27/2019 Week 6-B Sangmi Lee Pallickara

Survey on Novel Load Rebalancing for Distributed File Systems

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Transcription:

Outline DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM Erin Brady and Shantonu Hossain What are the challenges of Distributed File System (DFS) Ceph: A scalable high performance DFS Data Distribution and Placement Problem How resource distribution problem is handled in Ceph: CRUSH (Controlled Replication Under Scalable Hashing) Contrast with other DFS Data Replication Problem How replication is handled in Ceph Contrast with other DFS Summary Challenges of DFS Transparency User has the impression of a single, global file system Scalable performance No degradation of performance as the number of user and volume of data increases Reliability and consistency User can access the same file system from different location at the same file Availability User can access the file system at any time Fault tolerance System can identify and recover from failure Data Replication CEPH A SCALABLE HIGH PERFORMANCE DFS 1

Overview Architecture Open-source petabyte-scale distributed file system Ceph is derived from cephalopods, a class of octopus rightly captures the parallel behavior of DFS Initially proposed by Sage Weil as his PhD dissertation, University of California, Santa Cruz Incorporated with linux kernel (since 2.6.34), March 2010 Designed to provide seamless scaling and massive amount of storage still ensuring strong reliability excellent I/O performance and scalable metadata management supporting more than 250,000 metadata operations/sec under a variety of workloads Clients User of the data Metadata Server Cluster (MDS) Namespace management Metadata operations (open, rename etc.) Ensure security Object Storage Cluster (OSD) Stores all data and metadata Organizes data into flexiblesized containers, called objects Key Design Goals The system is inherently dynamic: Decouples data and metadata Eliminates object list for naming and lookup by a hash-like distribution function CRUSH (Controlled Replication Under Scalable Hashing) Delegates the responsibility of data migration, replication, failure detection and recovery to the OSD cluster Node failures are the norm, rather than an exception: Stores data up to 10000 nodes Changes in the storage cluster size cause automatic (and fast) failure recovery and rebalancing of data with no interruption of service The characters of workloads are constantly shifting over time: As the size and popularity of the file system hierarchy changes over time, the hierarchy is dynamically redistributed over hundreds of MDSs by Dynamic Subtree Partitioning with nearlinear scalability. The system is inevitably built incrementally: File system can be seamlessly expanded by simply adding storage nodes (OSDs). Proactively migrates data onto new devices in order to maintain a balanced distribution of data. Utilizes all available disk bandwidth and avoids data hot spot. 2

Client Operations File I/O Sends 'open for read to MDS cluster MDS translates the file name to file inode and returns it with other metadata information The client then calculates the name and location of the file ands reads the data from corresponding OSD DATA PLACEMENT AND DISTRIBUTION Data Placement and Distribution What is CRUSH? Files are stripped across many objects, grouped into placement groups (PG) and distributed to OSDs First, objects are mapped into placement groups (PG) with hash function (On the order of100 PGs for each OSD) Then object replicas are assigned to OSDs using CRUSH a globally known mapping function A pseudo-random data distribution function that maps each PG to an ordered list of OSDs f(pgid) = list of OSDs > location transparent Anyone (client, OSD, MDS) can calculate the location of any object No need for per-file or per-object directory Small changes in the storage cluster have little impact on existing PG mapping > minimizes data migration 3

How CRUSH Works? Hierarchical Cluster Map Relies on three elements - Placement Group ID (PGID) Cluster Map a hierarchical description of devices comprising the storage cluster devices buckets Placement rules how many replica targets are chosen What restriction are imposed Storage devices are assigned weights to control the amount of data they are responsible for storing Distributes data uniformly among weighted devices Buckets can be composed arbitrarily to construct hierarchy of available storage Data is placed in the hierarchy by recursively selecting nested bucket items via pseudo-random hash like function Replica Placement Rule consists of sequence of operations applied to hierarchy Separates object replicas across different failure domains still maintaining the desired distribution physical proximity shared power source shared network COMPARISON WITH OTHER DFS 4

Meta-data Management Access to shared devices Ceph GFS Ceph GPFS Metadata is separated from data and dynamically distributed Centralized meta-data server, chuck servers are distributed Asymmetric access to shared block level devices object based Symmetric access to shared block level device block based Data Distribution Ceph GPFS A deterministic pseudorandom hash like function that distributes data uniformly among OSDs Relies on compact cluster description for new storage target w/o consulting a central allocator Large files are divided into equal sized blocks and consecutive blocks are placed on different disk on round-robin fashion Lookup is performed via meta-data directory DATA REPLICATION 5

What is Replication? Benefits of Replication Replicating data in a distributed system means hosting copies of a subset of data in multiple locations Data can be replicated to multiple sites and accessed directly from the replica site Data is available and reliable Can access data if one site is unavailable Can access data that is being used in one location Data download speeds can be improved Wide area latency is improved with replica sites Maximize the bandwidth between replica sites Drawbacks of Replication Replica Management Large scale data transfer Time and bandwidth requirements are potentially high Data must be validated once it is replicated Replica management Replicas can be stored on different storage devices Need to determine the location of the metadata will there be one universal structure, or will the metadata be individualized for each replica? Updates or deletions in the original data set must be propagated When original data is changed, we can: Push notifications explicitly contact each site with a replica and send them the updated data set Pull notifications each site can subscribe to the datasets it is hosting, and if changes are registered, they will be notified Can allow for versioning of data if the site chooses not to update version numbers must be recorded so the most current data can be found if desired 6

Ceph Data Placement/Replication Ceph Replication Assume failure will be likely Petabyte or exabyte scale requires many Object Storage Clusters (OSDs) in use Data is replicated by placement groups (PG) that map to ordered list of n OSDs (n-way replication) by placement rules New/updated data is written to first non-failed OSD in list, which is called the "primary OSD Read requests are sent to the primary OSD When new data is written to the primary OSD of a list: Assign a version number Forward the write operation to replicas, wait for response Each replica will acknowledge receiving the update Primary applies the write operation and sends ack to client Once all data has been written, sends commit to client Ceph Replication Ceph Failure Detection All the bandwidth required for replication is on the network that communicates between OSDs Replication structure separates data safety and synchronization Synchronization updates are low latency: once the update has been applied to all the replicas, the client receives a notification Data safety semantics are well defined: once the updates have been written to the disk, client is notified of the commit Failure detection OSDs can sometimes can indicate that they have failed For OSDs that cannot notify others, replication traffic serves to monitor the state of all OSDs in a list Occasionally ping neighbors to check their availability If no response, mark OSD as down and skip it in list After some time, mark OSD as out and replace it with another OSD, replicating all data from previous If an OSD is marked as down or out, all primary responsibilities are passed to next OSD in the list 7

GFS Replication GFS Replication Replication done per chunk Chunk replicas distributed across chunkservers, racks Replication across machines Handle disk/machine failures maximize network bandwidth utilization Replication across racks Handle rack damage Exploit aggregate bandwidth of all the racks Must write across racks, but system is mostly read so this is still beneficial Similar to Ceph Separation of control flow/data flow Primary replica forwards write request to replicas GFS Replication Policies NFS Replication Chunk creation Put replicas on chunkservers with below-average disk space utilization Ensure each chunkserver has only a few new chunks at a time to prevent excessive simultaneous write operations Choose chunkservers that are spread across racks Chunk re-replication Handle corruption, loss of a disk, increase in replica goal Same placement policies, but throttle cloning bandwidth Chunk rebalancing Move replicas to ensure disk space/load balancing Allows new chunkserver to be filled gradually NFS version 3 has no replication Results in a single point of failure 8

AFS Replication Pessimistic replication: read-only replication allow higher availability and more load balancing Replicate executables/system files from upper levels of Vice Read-only replication helps administrators manage system For a collection of servers that host the same read-only volumes, any one can be added or removed from service without affecting others Increased availability/serviceability AFS Replication Replication performed at volume-level Volume cloning is atomic, so consistency of files within a read-only volume is guaranteed Updates are directed to a main server and then asynchronously propagated to read-only replicas Consistency of replica volumes is not guaranteed CODA Replication Optimistic replication: all replicas are read-write Conflicts diverging copies of the same files/directories Local/global conflict local updates can clash with other local updates when they are being uploaded, preventing reintegration Solved by application-specific policies Server/server conflict updates might not occur simultaneously, resulting in some servers having different versions, preventing replication Solved by versioning/resolution Review Comparison of Policies Ceph Write one, read one Client is notified when all replicas have been written AFS Pessimistic - write one, read all Asynchronous propagation consistency of replicas not guaranteed CODA Optimistic - write all, read all Consistency is not guaranteed between replicas GFS Write one, read all Client is notified when all replicas have been written 9

GPFS Parallel, shared-disk file system Centralized Management Forward conflicting operations to a designated node Distributed Locking: Acquire a read/write lock to synchronize GPFS Parallelism in DFS First access to the file writer is given a byte-range token for whole file (indices 0 to infinity) Second access - sends a revoke request to first access writer with desired range of writing File has been closed by first? give second full token File still open on first? Hand off part of byte-range token o2 > o1: Second gets token for o2 infinity o2 < o1: Second gets token for o1-o2 References References All photos included in this presentation are taken from the papers cited for this presentation. This material is intended for the sole purpose of instruction of operating systems at the University of Rochester. All copyrighted materials belong to their original owner(s). Ceph website, http://ceph.newdream.net Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI 06), November 2006. Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, Proceedings of SC 06, November 2006. An Article in Linux Technical Library, M. Tim Jones, "Ceph: A Linux petabyte-scale distributed file system, http://www.ibm.com/developerworks/linux/library/l-ceph/ Wide Area Data Replication for Scientific Collaborations A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, B. Moe, in Proceedings of 6th IEEE/ACM Int'l Workshop on Grid Computing (Grid2005), November 2005. The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19th ACM symposium on Operating Systems Principles, October 19-22, 2003, Bolton Landing, NY, USA. Data Replication in OceanStore, Dennis Geels, U.C. Berkeley Master's Report and Technical Report UCB//CSD-02-1217, November 2002. Scale and performance in a distributed file system, J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West, ACM Transactions on Computer Systems, 6(1):51 81, Feb. 1988. Coda: A Highly Available File System for a Distributed Workstation Environment, Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, David C. Steere, IEEE Transactions on Computers, v.39 n.4, p.447-459, April 1990. 10