Balancing storage utilization across a global namespace Manish Motwani Cleversafe, Inc.

Similar documents
Changing Requirements for Distributed File Systems in Cloud Storage

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

CLOUD-SCALE FILE SYSTEMS

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

The Google File System

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

Architecture of a Real-Time Operational DBMS

Distributed File Systems II

Ceph: A Scalable, High-Performance Distributed File System

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

NPTEL Course Jan K. Gopinath Indian Institute of Science

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018

The Google File System

Virtual Memory. control structures and hardware support

Software-defined Storage: Fast, Safe and Efficient

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Intelligent Rebuilds in vsan 6.6 January 08, 2018

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Building the Storage Internet. Dispersed Storage Overview March 2008

MapReduce. U of Toronto, 2014

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

ActiveScale Erasure Coding and Self Protecting Technologies

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Chapter 8: Memory-Management Strategies

CS370 Operating Systems

Chapter 13: I/O Systems

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

Map-Reduce. Marco Mura 2010 March, 31th

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS 550 Operating Systems Spring File System

CA485 Ray Walshe Google File System

File Systems. Chapter 11, 13 OSPP

A brief history on Hadoop

File Systems: Fundamentals

The Google File System

BigData and Map Reduce VITMAC03

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Resiliency at Scale in the Distributed Storage Cloud

Next Generation Erasure Coding Techniques Wesley Leggette Cleversafe

Google File System. Arun Sundaram Operating Systems

Chapter 8: Main Memory

Operating system Dr. Shroouq J.

ActiveScale Erasure Coding and Self Protecting Technologies

The Google File System. Alexandru Costan

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

The Google File System

SMD149 - Operating Systems - File systems

Distributed Systems. Tutorial 9 Windows Azure Storage

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

The Google File System (GFS)

Distributed Systems 16. Distributed File Systems II

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Chapter 8: Virtual Memory. Operating System Concepts

ECS High Availability Design

CS November 2017

Chapter 8: Main Memory

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Chapter 8 Memory Management

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

COMP 530: Operating Systems File Systems: Fundamentals

File Systems: Fundamentals

Current Topics in OS Research. So, what s hot?

Distributed Data Store

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems.

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Chapter 11: File System Implementation. Objectives

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

Memory. Objectives. Introduction. 6.2 Types of Memory

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

GFS: The Google File System. Dr. Yingwu Zhu

It also performs many parallelization operations like, data loading and query processing.

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

NFSv4 as the Building Block for Fault Tolerant Applications

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Introduction to Operating Systems

The Google File System

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Virtual Memory: From Address Translation to Demand Paging

memory management Vaibhav Bajpai

File Structures and Indexing

ECE 598 Advanced Operating Systems Lecture 18

Transcription:

Balancing storage utilization across a global namespace Manish Motwani Cleversafe, Inc.

Agenda Introduction What are namespaces, why we need them Compare different types of namespaces Why we need to rebalance data Rebalancing scenarios, requirements, design Our observations Future work Conclusion 2

Details not covered Data format on disk Network protocols Namespace persistence Security 3

Introduction 4

Name lookup Requirements: - Scalable - Highly available - Highly reliable - High performance - Fault tolerant Some solutions: -HA clusters -More memory -Synchronization - 5

Achieving unlimited scale Static name lookup Large enough namespace to store practically unlimited number of objects Uniform HW utilization Independent scaling of storage capacity and performance Hardware virtualization Tolerate disk/node failures Replace disks/nodes with newer, larger ones 6

Namespaces 7

Namespaces A namespace is a set of all possible names that can uniquely identify objects Used by clients and servers to determine where each piece of data is written to and read from Hierarchical namespaces E.g., file systems. Flat namespaces Dense namespace (e.g., a block device) Sparse namespace Infinite (integers, whole numbers, prime numbers, etc.) Finite (a range of an infinite namespace, for e.g., 0-2 500 ) 8

Contiguous vs. fragmented Fragmented (non-contiguous) namespaces require central metadata management for namemapping At petabyte scale, central metadata management is extremely difficult. At exabyte scale it s impossible. 9

Global namespace A global namespace is a very sparse namespace such that the we can create billions of random names in it every day and still never see a conflict or run out of names Cleversafe namespace 2 384 names, broken into storage and routable names Routable name: Vault identifier, storage sub type, generation id, slice index Storage name: 2 192 names (range: 0 2 192-1) Generally expressed in hex Name is stored in 24 bytes 10

Namespace mapping 11

Data Rebalancing 12

What is data rebalancing? Rebalancing is the changing of the entity responsible for storing some portion of the namespace Usually requires data migration Performed on a live system Done to achieve more uniform utilization for higher write availability E.g., changing ownership of namespace range from one disk to another, or one storage node to another 13

Why do we need to rebalance data? Hardware failures Disk/node removal Disk/node addition/replacement Non-random name generation Data needs to live forever HW doesn t 14

Disk usage example 1-5 6-10 15

Simple rebalancing 1-5 6-10 1-3 4-10 16

Rebalancing components Rebalance policy Look at global view to determine general direction Automatically decide on source and destination based on urgency Data movement Move data from source to destination Guarantee internal consistency 17

Rebalance Policy 18

Rebalance policy Goal Writes should not fail due to full disks if there is space on other disks Optimize on least amount of data movements Do only when necessary Fully automated 0 administrative supervision Design considerations Ongoing writes keep changing the picture Different sized disks Disk failures Data rebuilding Impact on system performance 19

Usage example Ascending name order 20

Usage example Ascending name order 21

Usage example Ascending name order 22

Usage example Ascending name order 23

Rebalance policy Algorithm Determine highest utilized disk (% usage) Determine direction of data movement From the greater of average utilization left of the disk or right of the disk Move small chunk of data (~1GB) Repeat until an imbalance ratio is below a pre-configured threshold 24

Simulation 1 Disk replacement 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Report interval: 1 Hour Write rate 80.000 MB/s Rebalance rate 80.000 MB/s Each disk size: 3 TB 25

Simulation 2 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Report interval: 4 Hours Write rate 5 MB/s Rebalance rate 80 MB/s 3TB disks 26

Making smarter policy decisions Some of the problems encountered Smaller/empty vaults move too much range Disk utilization is not always indicative Rebuild hasn t occurred after Outage Disk replacement SS is running slow so client is dropping writes Too much load on the system Some solutions Move highest utilized vaults first Take remote usage into account Limit number of rebalancing threads in our scheduler 27

Rebalancing Design 28

Implementation requirements All data "available" during rebalancing No operations should be disrupted, however performance might be affected Work with existing storage interface Independent of underlying storage mechanism Maintain namespace contiguity Rebalancing performance is not critical it is usually a background process No additional metadata should be added to namespace mapping (but existing can change) 29

Storage interface Read(name) read all revisions of name from media Write(name, revision, txid, data) write a transactional piece of data to media Commit(txid) commit a transaction Rollback(txid) rollback a transaction List(range) 30

Data move design Copy, Sync, Delete "chunks Asynchronous and lockless can also be used for inter-node rebalancing Source store (local store) is the main actor Policy decides range to move Non-interruptible- entire range is moved; if failure occurs, it s rolled back 31

Data move design (cont.) 32

Future work Inter-node rebalancing Temporary node namespace fragmentation 33

Conclusions Global contiguous namespaces combined with no central metadata management can allow storage systems to scale to exabytes and beyond Data rebalancing is required to compensate for media failures or in heterogeneous environments Automated rebalancing decision algorithms allow for minimal administrative supervision Data movement is faster and more scalable when not dependent on client transactions There s always room for improvement 34

Questions? 35