Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

Outline Introduction. System Overview. Distributed Object Storage. Problem Statements.

What is Ceph?

Unified Distributed Storage System. Objects. Blocks. Files. Fault Tolerant. Self Managing & Self Healing.

Ceph Object Model Pools : Independent Object namespaces or collections. Objects : Blob of Data (bytes to gigabytes).

How do we design a storage system that scales?

Key Problem : How are we going to distribute the data?

Distributed Object Storage

Data Distribution All objects are replicated n times. Objects are automatically placed, balanced, migrated in a dynamic cluster. We must consider physical infrastructure. We consider three approaches : Pick a spot ; remember where you put it. Pick a spot ; write down where you put it. Calculate where to put it and where to find it.

CRUSH Pseudo Random placement algorithm. Fast calculation and no look up. Statistically Uniform Distribution. Stable Mapping. Rule based configuration.

Problem Statements:

(1) Figure 3: Files are striped across many objects, grouped into placement groups (PGs), and distributed to OSDs via CRUSH, a specialized replica placement function.. Describe how to find the data associated with an inode and an in-file object number ( ino, ono ).

A file is assigned an inode number (INO) from the metadata server, which is a unique identifier for the file. The file is then carved into some number of objects (based on the size of the file). Using the INO and the object number (ONO), each object is assigned an object ID (OID). Using a simple hash over the OID, each object is assigned to a placement group. The mapping of the placement group to object storage devices is a pseudo-random mapping using an algorithm called Controlled Replication under Scalable Hashing (CRUSH). The final component for allocation is the cluster map. The cluster map is an efficient representation of the devices representing the storage cluster. With a PGID and the cluster map, you can locate any object.

(2) Does a mapping method (from an object number to its hosting storage server) relying on block or object list metadata (a table listing all object-server mappings) work as well? What are its Drawbacks? This kind of mapping method works as well, but it has got its limitations. Metadata operations often make up as much as half of file system workloads and lie in the critical path, making the MDS cluster critical to overall performance. Metadata management also presents a critical scaling challenge in distributed file systems. Metadata operations involve a greater degree of interdependence that makes scalable consistency and coherence management more difficult. File and directory metadata in Ceph is very small, consisting almost entirely of directory entries (file names) and inodes making the design complex.

(3.) Why are placement groups (PGs) introduced? Can we construct a hash function mapping an object ( oid ) directly to a list of OSDs? We have a logical collection of objects and the system will hash the name of the object into something called as placement groups. Each of the PG s are logical subset of overall object. And NO, we cannot construct an hash function mapping an object directly to a list of OSD s.

(4) What are inputs of a CRUSH hash function? What can be included in an OSD cluster map? CRUSH is implemented as a pseudo-random, deterministic function that maps an input value, typically an object or object group identifier, to a list of devices on which to store object replicas. The cluster map also includes a list of down or inactive devices and an epoch number, which is incremented each time the map changes. All OSD requests are tagged with the client s map epoch, such that all parties can agree on the current distribution of data. Incremental map updates are shared between cooperating OSDs, and piggyback on OSD replies if the client s map is out of date.

Replication & Data Safety

(5) Figure 4: RADOS responds with an acknowledgement after the write has been applied to the buffer caches on all OSDs replicating the object. Reads are directed at the primary. Is it possible for different clients to see different values of an object at the same time?

Yes, it is possible for different clients to see different values of an object at the same time. Clients are interested in making their updates visible to other clients. Clients are interested in knowing definitively that the data they ve written is safely replicated, on disk, and will survive power or other failures. RADOS disassociates synchronization from safety when acknowledging updates, allowing Ceph to realize both low-latency updates for efficient application synchronization and well-defined data safety semantics.

References Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn. Ceph: A Scalable, High-Performance Distributed File System - University of California, Santa Cruz. http://www.inktank.com/resource/managing-a-distributedstorage-system-at-scale-sage-weil/ Wikipedia.

Thank You!