Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/ PDF Free Download

Outline DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM Erin Brady and Shantonu Hossain What are the challenges of Distributed File System (DFS) Ceph: A scalable high performance DFS Data Distribution and Placement Problem How resource distribution problem is handled in Ceph: CRUSH (Controlled Replication Under Scalable Hashing) Contrast with other DFS Data Replication Problem How replication is handled in Ceph Contrast with other DFS Summary Challenges of DFS Transparency User has the impression of a single, global file system Scalable performance No degradation of performance as the number of user and volume of data increases Reliability and consistency User can access the same file system from different location at the same file Availability User can access the file system at any time Fault tolerance System can identify and recover from failure Data Replication CEPH A SCALABLE HIGH PERFORMANCE DFS 1

Overview Architecture Open-source petabyte-scale distributed file system Ceph is derived from cephalopods, a class of octopus rightly captures the parallel behavior of DFS Initially proposed by Sage Weil as his PhD dissertation, University of California, Santa Cruz Incorporated with linux kernel (since 2.6.34), March 2010 Designed to provide seamless scaling and massive amount of storage still ensuring strong reliability excellent I/O performance and scalable metadata management supporting more than 250,000 metadata operations/sec under a variety of workloads Clients User of the data Metadata Server Cluster (MDS) Namespace management Metadata operations (open, rename etc.) Ensure security Object Storage Cluster (OSD) Stores all data and metadata Organizes data into flexiblesized containers, called objects Key Design Goals The system is inherently dynamic: Decouples data and metadata Eliminates object list for naming and lookup by a hash-like distribution function CRUSH (Controlled Replication Under Scalable Hashing) Delegates the responsibility of data migration, replication, failure detection and recovery to the OSD cluster Node failures are the norm, rather than an exception: Stores data up to 10000 nodes Changes in the storage cluster size cause automatic (and fast) failure recovery and rebalancing of data with no interruption of service The characters of workloads are constantly shifting over time: As the size and popularity of the file system hierarchy changes over time, the hierarchy is dynamically redistributed over hundreds of MDSs by Dynamic Subtree Partitioning with nearlinear scalability. The system is inevitably built incrementally: File system can be seamlessly expanded by simply adding storage nodes (OSDs). Proactively migrates data onto new devices in order to maintain a balanced distribution of data. Utilizes all available disk bandwidth and avoids data hot spot. 2

Client Operations File I/O Sends 'open for read to MDS cluster MDS translates the file name to file inode and returns it with other metadata information The client then calculates the name and location of the file ands reads the data from corresponding OSD DATA PLACEMENT AND DISTRIBUTION Data Placement and Distribution What is CRUSH? Files are stripped across many objects, grouped into placement groups (PG) and distributed to OSDs First, objects are mapped into placement groups (PG) with hash function (On the order of100 PGs for each OSD) Then object replicas are assigned to OSDs using CRUSH a globally known mapping function A pseudo-random data distribution function that maps each PG to an ordered list of OSDs f(pgid) = list of OSDs > location transparent Anyone (client, OSD, MDS) can calculate the location of any object No need for per-file or per-object directory Small changes in the storage cluster have little impact on existing PG mapping > minimizes data migration 3

How CRUSH Works? Hierarchical Cluster Map Relies on three elements - Placement Group ID (PGID) Cluster Map a hierarchical description of devices comprising the storage cluster devices buckets Placement rules how many replica targets are chosen What restriction are imposed Storage devices are assigned weights to control the amount of data they are responsible for storing Distributes data uniformly among weighted devices Buckets can be composed arbitrarily to construct hierarchy of available storage Data is placed in the hierarchy by recursively selecting nested bucket items via pseudo-random hash like function Replica Placement Rule consists of sequence of operations applied to hierarchy Separates object replicas across different failure domains still maintaining the desired distribution physical proximity shared power source shared network COMPARISON WITH OTHER DFS 4

Meta-data Management Access to shared devices Ceph GFS Ceph GPFS Metadata is separated from data and dynamically distributed Centralized meta-data server, chuck servers are distributed Asymmetric access to shared block level devices object based Symmetric access to shared block level device block based Data Distribution Ceph GPFS A deterministic pseudorandom hash like function that distributes data uniformly among OSDs Relies on compact cluster description for new storage target w/o consulting a central allocator Large files are divided into equal sized blocks and consecutive blocks are placed on different disk on round-robin fashion Lookup is performed via meta-data directory DATA REPLICATION 5

What is Replication? Benefits of Replication Replicating data in a distributed system means hosting copies of a subset of data in multiple locations Data can be replicated to multiple sites and accessed directly from the replica site Data is available and reliable Can access data if one site is unavailable Can access data that is being used in one location Data download speeds can be improved Wide area latency is improved with replica sites Maximize the bandwidth between replica sites Drawbacks of Replication Replica Management Large scale data transfer Time and bandwidth requirements are potentially high Data must be validated once it is replicated Replica management Replicas can be stored on different storage devices Need to determine the location of the metadata will there be one universal structure, or will the metadata be individualized for each replica? Updates or deletions in the original data set must be propagated When original data is changed, we can: Push notifications explicitly contact each site with a replica and send them the updated data set Pull notifications each site can subscribe to the datasets it is hosting, and if changes are registered, they will be notified Can allow for versioning of data if the site chooses not to update version numbers must be recorded so the most current data can be found if desired 6

Ceph Data Placement/Replication Ceph Replication Assume failure will be likely Petabyte or exabyte scale requires many Object Storage Clusters (OSDs) in use Data is replicated by placement groups (PG) that map to ordered list of n OSDs (n-way replication) by placement rules New/updated data is written to first non-failed OSD in list, which is called the "primary OSD Read requests are sent to the primary OSD When new data is written to the primary OSD of a list: Assign a version number Forward the write operation to replicas, wait for response Each replica will acknowledge receiving the update Primary applies the write operation and sends ack to client Once all data has been written, sends commit to client Ceph Replication Ceph Failure Detection All the bandwidth required for replication is on the network that communicates between OSDs Replication structure separates data safety and synchronization Synchronization updates are low latency: once the update has been applied to all the replicas, the client receives a notification Data safety semantics are well defined: once the updates have been written to the disk, client is notified of the commit Failure detection OSDs can sometimes can indicate that they have failed For OSDs that cannot notify others, replication traffic serves to monitor the state of all OSDs in a list Occasionally ping neighbors to check their availability If no response, mark OSD as down and skip it in list After some time, mark OSD as out and replace it with another OSD, replicating all data from previous If an OSD is marked as down or out, all primary responsibilities are passed to next OSD in the list 7

GFS Replication GFS Replication Replication done per chunk Chunk replicas distributed across chunkservers, racks Replication across machines Handle disk/machine failures maximize network bandwidth utilization Replication across racks Handle rack damage Exploit aggregate bandwidth of all the racks Must write across racks, but system is mostly read so this is still beneficial Similar to Ceph Separation of control flow/data flow Primary replica forwards write request to replicas GFS Replication Policies NFS Replication Chunk creation Put replicas on chunkservers with below-average disk space utilization Ensure each chunkserver has only a few new chunks at a time to prevent excessive simultaneous write operations Choose chunkservers that are spread across racks Chunk re-replication Handle corruption, loss of a disk, increase in replica goal Same placement policies, but throttle cloning bandwidth Chunk rebalancing Move replicas to ensure disk space/load balancing Allows new chunkserver to be filled gradually NFS version 3 has no replication Results in a single point of failure 8

AFS Replication Pessimistic replication: read-only replication allow higher availability and more load balancing Replicate executables/system files from upper levels of Vice Read-only replication helps administrators manage system For a collection of servers that host the same read-only volumes, any one can be added or removed from service without affecting others Increased availability/serviceability AFS Replication Replication performed at volume-level Volume cloning is atomic, so consistency of files within a read-only volume is guaranteed Updates are directed to a main server and then asynchronously propagated to read-only replicas Consistency of replica volumes is not guaranteed CODA Replication Optimistic replication: all replicas are read-write Conflicts diverging copies of the same files/directories Local/global conflict local updates can clash with other local updates when they are being uploaded, preventing reintegration Solved by application-specific policies Server/server conflict updates might not occur simultaneously, resulting in some servers having different versions, preventing replication Solved by versioning/resolution Review Comparison of Policies Ceph Write one, read one Client is notified when all replicas have been written AFS Pessimistic - write one, read all Asynchronous propagation consistency of replicas not guaranteed CODA Optimistic - write all, read all Consistency is not guaranteed between replicas GFS Write one, read all Client is notified when all replicas have been written 9

GPFS Parallel, shared-disk file system Centralized Management Forward conflicting operations to a designated node Distributed Locking: Acquire a read/write lock to synchronize GPFS Parallelism in DFS First access to the file writer is given a byte-range token for whole file (indices 0 to infinity) Second access - sends a revoke request to first access writer with desired range of writing File has been closed by first? give second full token File still open on first? Hand off part of byte-range token o2 > o1: Second gets token for o2 infinity o2 < o1: Second gets token for o1-o2 References References All photos included in this presentation are taken from the papers cited for this presentation. This material is intended for the sole purpose of instruction of operating systems at the University of Rochester. All copyrighted materials belong to their original owner(s). Ceph website, http://ceph.newdream.net Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI 06), November 2006. Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, Proceedings of SC 06, November 2006. An Article in Linux Technical Library, M. Tim Jones, "Ceph: A Linux petabyte-scale distributed file system, http://www.ibm.com/developerworks/linux/library/l-ceph/ Wide Area Data Replication for Scientific Collaborations A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, B. Moe, in Proceedings of 6th IEEE/ACM Int'l Workshop on Grid Computing (Grid2005), November 2005. The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19th ACM symposium on Operating Systems Principles, October 19-22, 2003, Bolton Landing, NY, USA. Data Replication in OceanStore, Dennis Geels, U.C. Berkeley Master's Report and Technical Report UCB//CSD-02-1217, November 2002. Scale and performance in a distributed file system, J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West, ACM Transactions on Computer Systems, 6(1):51 81, Feb. 1988. Coda: A Highly Available File System for a Distributed Workstation Environment, Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, David C. Steere, IEEE Transactions on Computers, v.39 n.4, p.447-459, April 1990. 10

Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/2010