CS-580K/480K Advanced Topics in Cloud Computing Object Storage 1
When we use object storage When we check Facebook, twitter Gmail Docs on DropBox Check share point Take pictures with Instagram 2
Object Storage is good for Unstructured data workloads Large capacity requirement (e.g., > 100s of Terabytes) Data archiving: documents, emails and backups Storage for photos, videos, virtual machine images Need for granular security and multi-tenacy Need for automation, management, monitoring reporting tools Non-high performance 3
Object Usage Cases Object Storage Overview with architectural examples from Cloudian's 4
Block vs. Object Faster: For hot data Flash-optimized IOPS-centric VM optimized Block Bigger: For cool/cloud data Object-based Scale-out (multi-pb) Software-centric Object 5
Block vs. Object Block Data stored without any concept of data format or type The data is simply a series of 0s and 1s High-level applications or file systems to keep track of data location, context and meaning Object Object consists of an object identifier (OID), data and metadata No object organization system (flat organization) Direct access to individual objects, no need to traverse directories 6
How to build an object storage system Case 1: Swift 7
8
Swift: Storing & Retrieving data Flat namespace: accounts, containers and objects No nested directories Account: collection of containers List containers: GET /v1/accountname/ Create container: PUT /v1/accountname/containername/ Containers: collection of objects List objects: GET /v1/accountname/containername/ Upload object: PUT /v1/accountname/containername/objectname Retrieve object: GET /v1/accountname/containername/objectname 9
Basically 2 parts Proxy Server: Exposes the swift public (REST) API to users and stream to and from the client upon request Storage Nodes: Handle storage, replication, and management of objects, containers, and accounts. 10
Architecture Overview Proxy PUT /v1/account/container/object Rings Object Server Object Server Object Server Container Server Proxy Account Server Container Server Proxy Account Server Container Server Proxy Account Server Disks Disks Disks Disks Disks Disks 11
Proxy Server Shared-nothing architecture, can be scaled as needed Can place load balancer ahead of Proxy servers Objects are streamed between proxy server and client directly There is no cache in between Proxy Proxy Proxy 12
Object Server A very simple blob (i.e., binary large object) storage server that can store, retrieve and delete objects stored on local devices. Objects are stored as binary files on the filesystem Each object is stored using a path derived from the object name s hash and the operation s timestamp. Last write always wins, and ensures that the latest object version will be served. Object Server Obj File Proxy Obj File Obj File 13
Container Server The Container Server s primary job is to handle listings of objects. It doesn t know where those object s are, just what objects are in a specific container. The listings are stored as sqlite database files, and replicated across the cluster similar to how objects are. Statistics are also tracked that include the total number of objects, and total storage usage for that container. Container Server Proxy db1 db2 db3 14
Account Server The Account Server is very similar to the Container Server, excepting that it is responsible for listings of containers rather than objects. Account Server Proxy db1 db2 db3 15
The Rings The Rings: mapping data to physical locations in the cluster 3 rings to store 3 kind of things (accounts, containers and objects) Each ring works in the same way For a given account, container, or object name, the ring returns information on its physical location within storage nodes Device Look-up table: to find out which device contains the target object Device List: to find out which storage node this device belongs to Proxy PUT /v1/account/container/object Rings 16
Mapping using Basic Hash Functions MAPPING OF OBJECTS TO DIFFERENT DRIVES OBJECT Image 1 Image 2 Image 3 Music 1 Music 2 Music 3 Movie 1 Movie 2 HASH VALUE (HEXADECIMAL) b5e7d988cfdb78bc3be 1a9c221a8f744 943359f44dc87f6a169 73c79827a038c 1213f717f7f754f050d0 246fb7d6c43b 4b46f1381a53605fc0f 93a93d55bf8be ecb27b466c32a56730 298e55bcace257 508259dfec6b1544f4a d6e4d52964f59 69db47ace5f026310ab 170b02ac8bc58 c4abbd49974ba44c16 9c220dadbdac71 MAPPING VALUE DRIVE MAPPED TO hash(image 1) % 4 = 2 Drive 2 hash(image 2) % 4 = 3 Drive 3 hash(image 3) % 4 = 3 Drive 3 hash(music 1) % 4 = 1 Drive 1 hash(music 2) % 4 = 0 Drive 0 hash(music 3) % 4 = 0 Drive 0 hash(movie 1) % 4 = 2 Drive 2 hash(movie 2) % 4 = 1 Drive 1 Problem? The MD5 algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. 17
Problem? But what if we have to add/remove drives? The hash values of all objects will stay the same, but we need to recompute the mapping value for all objects, then re-map them to the different drives. 18
SWIFT -- Consistent Hashing Algorithm Consistent hashing algorithm achieves a similar goal but does things differently. Instead of getting the mapping value of each object, each drive will be assigned a range of hash values to store the objects. RANGE OF HASH VALUES FOR EACH DRIVE DRIVE Drive 0 Drive 1 Drive 2 Drive 3 RANGE OF HASH VALUES 0000 ~ 3fff 3fff ~ 7ffe 7fff ~ bffd bffd ~ efff 19
MAPPING OF OBJECTS TO DIFFERENT DRIVES OBJECT HASH VALUE (HEXADECIMAL) DRIVE MAPPED TO Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2 Image 2 943359f44dc87f6a16973c79827a038c Drive 2 Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0 Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1 Music 2 ecb27b466c32a56730298e55bcace257 Drive 3 Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1 Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1 Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3 20
With New Device Each drive will get a new range of hash values it is going to store. Each object s hash value will still remain the same. RANGE OF HASH VALUES FOR EACH DRIVE Any objects whose hash value is within range of its current drive will remain. For any other objects whose hash value is not within range of its current drive will be mapped to another drive But that number of objects is very few using consistent hashing algorithm, compared to the basic hash function. DRIVE Drive 0 Drive 1 Drive 2 Drive 3 RANGE OF HASH VALUES 0000 ~ 3fff 3fff ~ 7ffe 7fff ~ bffd bffd ~ ffff 21
With New Device 22
Problem? Each drive has a large range of hash values Multiple files may map to one (or several) drive Unbalance 23
Multiple Markers in Consistent Hashing Algorithm Instead of having one big hash range for each drive, multiple markers serve to split those large hash range into smaller chunks Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing 24
In Summary: What is Ring doing? Evenly mapping data to physical locations in the cluster Build (re-build) Look-up table (from object hash value to device) Maintain device list (to identify the device location storage node) 25
26
Data durability Ensuring your data is still the same for ages Replicated or Erasure Coded? Depends on your use case Proxy returns data only if content matches stored checksum Continuously running background processes Auditors: ensuring there is no bit-rot Quarantining replicas if checksum mismatch Replicators: ensuring all replicas are stored multiple times on remote nodes (for replication) Reconstructors: recomputing missing erasure-coding fragments (for erasure coding) 27
Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Storage Nodes 28
Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 29
Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 30
Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 Zone3 Region 1 Region 2 31
32
Re-Balancing To ensure a third replica Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 Zone3 Region 1 Region 2 33
Explore More https://docs.openstack.org/swift/latest/ 34
How to build an object storage system Case 2: Ceph 35
System Overview 36
Key Features Decoupled data and metadata CRUSH Files striped onto predictably named objects CRUSH maps objects to storage devices Dynamic Distributed Metadata Management Dynamic subtree partitioning Distributes metadata amongst MDSs Object-based storage OSDs handle migration, replication, failure detection and recovery 37
Client Operation Ceph interface Nearly POSIX Decoupled data and metadata operation User space implementation FUSE or directly linked Filesystem in Userspace (FUSE) is a software interface for Unix-like computer operating systems that lets nonprivileged users create their own file systems without editing kernel code. 38
Client Access Example Client sends open request to MDS MDS returns capability, file inode, file size and stripe information Client read/write directly from/to OSDs Client sends close request, and provides details to MDS 39
Distributed Metadata Metadata operations often make up as much as half of file system workloads Effective metadata management is critical to overall system performance 40
Dynamic Subtree Partitioning Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs) Sharing is dynamic and based on current access patterns Results in near-linear performance scaling in the number of MDSs 41
Distributed Object Storage Files are split across objects Objects are members of placement groups Placement groups are distributed across OSDs. 42
Ceph firsts maps objects into placement groups (PG) using a hash function Placement groups are then assigned to OSDs using a pseudo-random function (CRUSH) 43
CRUSH S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC 06), Tampa, FL, Nov. 2006. ACM 44
Replication Objects are replicated on OSDs within same PG Primary forwards updates to other replicas Sends ACK to client once all replicas have received the update Slow but safe Replicas send final commit once they have committed update to disk 45
Failure Detection and Recovery Down and Out Monitors check for intermittent problems New or recovered OSDs peer with other OSDs within PG 46
Conclusion Ceph and Swift share some similar concept, though implemented differently How to identify object (Rings vs. CRUSH) Distribute object evenly (Rings vs. CRUSH) Provide reliability (replication)
Erasure Code Replication: Full copies of stored objects Erasure coding: One copy plus parity 48
Sources 1. Christian Schwede, Forget everything you knew about Swift Rings, https://www.openstack.org/assets/presentation-media/rings201.pdf 2. Swift 101 https://www.youtube.com/watch?v=vaeu0ld- GIU&feature=youtu.be 3. Ceph 101 https://www.youtube.com/watch?v=oyh1c0c4hzm 49