CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Similar documents
Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

Ceph: A Scalable, High-Performance Distributed File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

SolidFire and Ceph Architectural Comparison

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

ROCK INK PAPER COMPUTER

Dynamic Metadata Management for Petabyte-scale File Systems

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

Ceph. The link between file systems and octopuses. Udo Seidel. Linuxtag 2012

Outline. Challenges of DFS CEPH A SCALABLE HIGH PERFORMANCE DFS DATA DISTRIBUTION AND MANAGEMENT IN DISTRIBUTED FILE SYSTEM 11/16/2010

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

Datacenter Storage with Ceph

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

GlusterFS Architecture & Roadmap

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Handling Big Data an overview of mass storage technologies

an Object-Based File System for Large-Scale Federated IT Infrastructures

A Gentle Introduction to Ceph

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CLIP: A Compact, Load-balancing Index Placement Function

Got Isilon? Need IOPS? Get Avere.

GFS: The Google File System

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

Staggeringly Large Filesystems

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

CSE 153 Design of Operating Systems

Web Object Scaler. WOS and IRODS Data Grid Dave Fellinger

CS3600 SYSTEMS AND NETWORKS

Simplifying Collaboration in the Cloud

Lustre overview and roadmap to Exascale computing

File System Implementation

Ceph: A Scalable, High-Performance Distributed File System

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Distributed File Systems II

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

CS 655 Advanced Topics in Distributed Systems

Ceph: scaling storage for the cloud and beyond

The Google File System

Dynamo: Amazon s Highly Available Key-Value Store

DISTRIBUTED STORAGE AND COMPUTE WITH LIBRADOS SAGE WEIL VAULT

Use Distributed File system as a Storage Tier! Fabrizio Manfred Furuholmen!

CEPHALOPODS AND SAMBA IRA COOPER SNIA SDC

Software-defined Storage: Fast, Safe and Efficient

Chapter 11: File System Implementation. Objectives

Red Hat Storage Server for AWS

Ceph: A Scalable, High-Performance Distributed File System

CA485 Ray Walshe Google File System

Archive Solutions at the Center for High Performance Computing by Sam Liston (University of Utah)

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

A product by CloudFounders. Wim Provoost Open vstorage

CLOUD-SCALE FILE SYSTEMS

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

The File Systems Evolution. Christian Bandulet, Sun Microsystems

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Benchmark of a Cubieboard cluster

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

CS 537 Fall 2017 Review Session

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

Ceph Intro & Architectural Overview. Federico Lucifredi Product Management Director, Ceph Storage Vancouver & Guadalajara, May 18th, 2015

Tools for Social Networking Infrastructures

Cluster-Level Google How we use Colossus to improve storage efficiency

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Changing Requirements for Distributed File Systems in Cloud Storage

Decentralized Distributed Storage System for Big Data

Cloud object storage in Ceph. Orit Wasserman Fosdem 2017

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin

An Introduction to GPFS

GFS: The Google File System. Dr. Yingwu Zhu

Cloud object storage : the right way. Orit Wasserman Open Source Summit 2018

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

RESAR: Reliable Storage at Exabyte Scale Reconsidered

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

Analytics in the cloud

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

SEP sesam Backup & Recovery to SUSE Enterprise Storage. Hybrid Backup & Disaster Recovery

Dynamic Object Routing

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Samba and Ceph. Release the Kraken! David Disseldorp

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

Distributed Systems. Tutorial 9 Windows Azure Storage

Effizientes Speichern von Cold-Data

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Summary optimized CRUSH algorithm more than 10% read performance improvement Design and Implementation: 1. Problem Identification 2.

Distributed File Storage in Multi-Tenant Clouds using CephFS

Using Cloud Services behind SGI DMF

Transcription:

CS-580K/480K Advanced Topics in Cloud Computing Object Storage 1

When we use object storage When we check Facebook, twitter Gmail Docs on DropBox Check share point Take pictures with Instagram 2

Object Storage is good for Unstructured data workloads Large capacity requirement (e.g., > 100s of Terabytes) Data archiving: documents, emails and backups Storage for photos, videos, virtual machine images Need for granular security and multi-tenacy Need for automation, management, monitoring reporting tools Non-high performance 3

Object Usage Cases Object Storage Overview with architectural examples from Cloudian's 4

Block vs. Object Faster: For hot data Flash-optimized IOPS-centric VM optimized Block Bigger: For cool/cloud data Object-based Scale-out (multi-pb) Software-centric Object 5

Block vs. Object Block Data stored without any concept of data format or type The data is simply a series of 0s and 1s High-level applications or file systems to keep track of data location, context and meaning Object Object consists of an object identifier (OID), data and metadata No object organization system (flat organization) Direct access to individual objects, no need to traverse directories 6

How to build an object storage system Case 1: Swift 7

8

Swift: Storing & Retrieving data Flat namespace: accounts, containers and objects No nested directories Account: collection of containers List containers: GET /v1/accountname/ Create container: PUT /v1/accountname/containername/ Containers: collection of objects List objects: GET /v1/accountname/containername/ Upload object: PUT /v1/accountname/containername/objectname Retrieve object: GET /v1/accountname/containername/objectname 9

Basically 2 parts Proxy Server: Exposes the swift public (REST) API to users and stream to and from the client upon request Storage Nodes: Handle storage, replication, and management of objects, containers, and accounts. 10

Architecture Overview Proxy PUT /v1/account/container/object Rings Object Server Object Server Object Server Container Server Proxy Account Server Container Server Proxy Account Server Container Server Proxy Account Server Disks Disks Disks Disks Disks Disks 11

Proxy Server Shared-nothing architecture, can be scaled as needed Can place load balancer ahead of Proxy servers Objects are streamed between proxy server and client directly There is no cache in between Proxy Proxy Proxy 12

Object Server A very simple blob (i.e., binary large object) storage server that can store, retrieve and delete objects stored on local devices. Objects are stored as binary files on the filesystem Each object is stored using a path derived from the object name s hash and the operation s timestamp. Last write always wins, and ensures that the latest object version will be served. Object Server Obj File Proxy Obj File Obj File 13

Container Server The Container Server s primary job is to handle listings of objects. It doesn t know where those object s are, just what objects are in a specific container. The listings are stored as sqlite database files, and replicated across the cluster similar to how objects are. Statistics are also tracked that include the total number of objects, and total storage usage for that container. Container Server Proxy db1 db2 db3 14

Account Server The Account Server is very similar to the Container Server, excepting that it is responsible for listings of containers rather than objects. Account Server Proxy db1 db2 db3 15

The Rings The Rings: mapping data to physical locations in the cluster 3 rings to store 3 kind of things (accounts, containers and objects) Each ring works in the same way For a given account, container, or object name, the ring returns information on its physical location within storage nodes Device Look-up table: to find out which device contains the target object Device List: to find out which storage node this device belongs to Proxy PUT /v1/account/container/object Rings 16

Mapping using Basic Hash Functions MAPPING OF OBJECTS TO DIFFERENT DRIVES OBJECT Image 1 Image 2 Image 3 Music 1 Music 2 Music 3 Movie 1 Movie 2 HASH VALUE (HEXADECIMAL) b5e7d988cfdb78bc3be 1a9c221a8f744 943359f44dc87f6a169 73c79827a038c 1213f717f7f754f050d0 246fb7d6c43b 4b46f1381a53605fc0f 93a93d55bf8be ecb27b466c32a56730 298e55bcace257 508259dfec6b1544f4a d6e4d52964f59 69db47ace5f026310ab 170b02ac8bc58 c4abbd49974ba44c16 9c220dadbdac71 MAPPING VALUE DRIVE MAPPED TO hash(image 1) % 4 = 2 Drive 2 hash(image 2) % 4 = 3 Drive 3 hash(image 3) % 4 = 3 Drive 3 hash(music 1) % 4 = 1 Drive 1 hash(music 2) % 4 = 0 Drive 0 hash(music 3) % 4 = 0 Drive 0 hash(movie 1) % 4 = 2 Drive 2 hash(movie 2) % 4 = 1 Drive 1 Problem? The MD5 algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. 17

Problem? But what if we have to add/remove drives? The hash values of all objects will stay the same, but we need to recompute the mapping value for all objects, then re-map them to the different drives. 18

SWIFT -- Consistent Hashing Algorithm Consistent hashing algorithm achieves a similar goal but does things differently. Instead of getting the mapping value of each object, each drive will be assigned a range of hash values to store the objects. RANGE OF HASH VALUES FOR EACH DRIVE DRIVE Drive 0 Drive 1 Drive 2 Drive 3 RANGE OF HASH VALUES 0000 ~ 3fff 3fff ~ 7ffe 7fff ~ bffd bffd ~ efff 19

MAPPING OF OBJECTS TO DIFFERENT DRIVES OBJECT HASH VALUE (HEXADECIMAL) DRIVE MAPPED TO Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2 Image 2 943359f44dc87f6a16973c79827a038c Drive 2 Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0 Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1 Music 2 ecb27b466c32a56730298e55bcace257 Drive 3 Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1 Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1 Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3 20

With New Device Each drive will get a new range of hash values it is going to store. Each object s hash value will still remain the same. RANGE OF HASH VALUES FOR EACH DRIVE Any objects whose hash value is within range of its current drive will remain. For any other objects whose hash value is not within range of its current drive will be mapped to another drive But that number of objects is very few using consistent hashing algorithm, compared to the basic hash function. DRIVE Drive 0 Drive 1 Drive 2 Drive 3 RANGE OF HASH VALUES 0000 ~ 3fff 3fff ~ 7ffe 7fff ~ bffd bffd ~ ffff 21

With New Device 22

Problem? Each drive has a large range of hash values Multiple files may map to one (or several) drive Unbalance 23

Multiple Markers in Consistent Hashing Algorithm Instead of having one big hash range for each drive, multiple markers serve to split those large hash range into smaller chunks Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing 24

In Summary: What is Ring doing? Evenly mapping data to physical locations in the cluster Build (re-build) Look-up table (from object hash value to device) Maintain device list (to identify the device location storage node) 25

26

Data durability Ensuring your data is still the same for ages Replicated or Erasure Coded? Depends on your use case Proxy returns data only if content matches stored checksum Continuously running background processes Auditors: ensuring there is no bit-rot Quarantining replicas if checksum mismatch Replicators: ensuring all replicas are stored multiple times on remote nodes (for replication) Reconstructors: recomputing missing erasure-coding fragments (for erasure coding) 27

Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Storage Nodes 28

Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 29

Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 30

Failure domains Ensuring high availability and durability Three replicas Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 Zone3 Region 1 Region 2 31

32

Re-Balancing To ensure a third replica Disk 0 Disk 3 Disk 6 Disk 9 Disk 12 Disk 15 Disk Proxy 1 Disk Proxy 4 Disk Proxy 7 Disk Proxy 10 Disk Proxy 13 Disk Proxy 16 Disk 2 Disk 5 Disk 8 Disk 11 Disk 14 Disk 17 Zone1 Zone2 Zone3 Region 1 Region 2 33

Explore More https://docs.openstack.org/swift/latest/ 34

How to build an object storage system Case 2: Ceph 35

System Overview 36

Key Features Decoupled data and metadata CRUSH Files striped onto predictably named objects CRUSH maps objects to storage devices Dynamic Distributed Metadata Management Dynamic subtree partitioning Distributes metadata amongst MDSs Object-based storage OSDs handle migration, replication, failure detection and recovery 37

Client Operation Ceph interface Nearly POSIX Decoupled data and metadata operation User space implementation FUSE or directly linked Filesystem in Userspace (FUSE) is a software interface for Unix-like computer operating systems that lets nonprivileged users create their own file systems without editing kernel code. 38

Client Access Example Client sends open request to MDS MDS returns capability, file inode, file size and stripe information Client read/write directly from/to OSDs Client sends close request, and provides details to MDS 39

Distributed Metadata Metadata operations often make up as much as half of file system workloads Effective metadata management is critical to overall system performance 40

Dynamic Subtree Partitioning Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs) Sharing is dynamic and based on current access patterns Results in near-linear performance scaling in the number of MDSs 41

Distributed Object Storage Files are split across objects Objects are members of placement groups Placement groups are distributed across OSDs. 42

Ceph firsts maps objects into placement groups (PG) using a hash function Placement groups are then assigned to OSDs using a pseudo-random function (CRUSH) 43

CRUSH S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC 06), Tampa, FL, Nov. 2006. ACM 44

Replication Objects are replicated on OSDs within same PG Primary forwards updates to other replicas Sends ACK to client once all replicas have received the update Slow but safe Replicas send final commit once they have committed update to disk 45

Failure Detection and Recovery Down and Out Monitors check for intermittent problems New or recovered OSDs peer with other OSDs within PG 46

Conclusion Ceph and Swift share some similar concept, though implemented differently How to identify object (Rings vs. CRUSH) Distribute object evenly (Rings vs. CRUSH) Provide reliability (replication)

Erasure Code Replication: Full copies of stored objects Erasure coding: One copy plus parity 48

Sources 1. Christian Schwede, Forget everything you knew about Swift Rings, https://www.openstack.org/assets/presentation-media/rings201.pdf 2. Swift 101 https://www.youtube.com/watch?v=vaeu0ld- GIU&feature=youtu.be 3. Ceph 101 https://www.youtube.com/watch?v=oyh1c0c4hzm 49