Pelican: A building block for exascale cold data storage

Similar documents
Pelican: A building block for exascale cold data storage

Flamingo: Enabling Evolvable HDD-based Near-Line Storage

Understanding Rack-Scale Disaggregated Storage

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Giza: Erasure Coding Objects across Global Data Centers

Cold Storage Hardware v0.7 ST-draco-abraxas-0.7

CSE 124: Networked Services Fall 2009 Lecture-19

PCAP: Performance-Aware Power Capping for the Disk Drive in the Cloud

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

3.1. Storage. Direct Attached Storage (DAS)

Using Cloud Services behind SGI DMF

CSE 124: Networked Services Lecture-16

The Fusion Distributed File System

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Pergamum Replacing Tape with Energy Efficient, Reliable, Disk- Based Archival Storage

PC-based data acquisition II

ECS High Availability Design

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CSE 4/521 Introduction to Operating Systems. Lecture 27 (Final Exam Review) Summer 2018

COS 318: Operating Systems. Storage Devices. Jaswinder Pal Singh Computer Science Department Princeton University

Chapter 11: File System Implementation. Objectives

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Atlas: Baidu s Key-value Storage System for Cloud Data

Presented by: Nafiseh Mahmoudi Spring 2017

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

CLOUD-SCALE FILE SYSTEMS

Software-defined Storage: Fast, Safe and Efficient

Adaptec Series 7 SAS/SATA RAID Adapters

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University

Administering VMware Virtual SAN. Modified on October 4, 2017 VMware vsphere 6.0 VMware vsan 6.2

Windows Azure Services - At Different Levels

Storage. Hwansoo Han

u Covered: l Management of CPU & concurrency l Management of main memory & virtual memory u Currently --- Management of I/O devices

HyperScalers JetStor appliance with Raidix storage software

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

I/O CANNOT BE IGNORED

COS 318: Operating Systems. Storage Devices. Vivek Pai Computer Science Department Princeton University

Staggeringly Large Filesystems

Advanced Architectures for Oracle Database on Amazon EC2

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Disk Scheduling COMPSCI 386

Changing Requirements for Distributed File Systems in Cloud Storage

Dell Technologies IoT Solution Surveillance with Genetec Security Center

Storage Optimization with Oracle Database 11g

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Architectural & Engineering Specifications V4.22

QuickSpecs. Models SATA RAID Controller HP 6-Port SATA RAID Controller B21. HP 6-Port SATA RAID Controller. Overview.

Cisco HyperFlex All-Flash Systems for Oracle Real Application Clusters Reference Architecture

Chapter 13: I/O Systems

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

Configuring Storage Profiles

Distributed File Systems II

VMware vsphere with ESX 4 and vcenter

Virtualization with VMware ESX and VirtualCenter SMB to Enterprise

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers

Differential RAID: Rethinking RAID for SSD Reliability

EECS 482 Introduction to Operating Systems

Find the physical addresses of virtual addresses: 0, 8192, 20, 4100, : : : : : 24684

ZD-XL SQL Accelerator 1.6

White Paper. A System for Archiving, Recovery, and Storage Optimization. Mimosa NearPoint for Microsoft

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

The Google File System

Pass4test Certification IT garanti, The Easy Way!

High Density RocketRAID EJ6172 Device Board Data RAID Installation Guide

Sample Routers and Switches. High Capacity Router Cisco CRS-1 up to 46 Tb/s thruput. Routers in a Network. Router Design

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Administering VMware vsan. Modified on October 4, 2017 VMware vsphere 6.5 VMware vsan 6.6.1

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Lecture 16: Storage Devices

RAID SATA II 3Gbps 4 Ports PCI-X Host

Cloud-related Storage Research in Santa Cruz

Decentralized Distributed Storage System for Big Data

SATA RAID For The Enterprise? Presented at the THIC Meeting at the Sony Auditorium, 3300 Zanker Rd, San Jose CA April 19-20,2005

SSD Server Hard Drives for Dell

On Coding Techniques for Networked Distributed Storage Systems

SSD Server Hard Drives for IBM

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Introduction to I/O and Disk Management

NEC M100 Frequently Asked Questions September, 2011

Introduction to I/O and Disk Management

Vendor: EMC. Exam Code: E Exam Name: Cloud Infrastructure and Services Exam. Version: Demo

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Blizzard: A Distributed Queue

Reduce Latency and Increase Application Performance Up to 44x with Adaptec maxcache 3.0 SSD Read and Write Caching Solutions

Transcription:

Pelican: A building block for exascale cold data storage Shobana Balarishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, Antony Rowstron Microsoft Research, Microsoft Przemysław Przybyszewski 1

Cold data 2

Problem A significant portion of the data in the cloud is rarely accessed and known as cold data. Pelican tries to address the need for cost-effective storage of cold data and serves as a basic building block for exabyte scale storage for cold data. 3

Cold data in the cloud 4

Right-provisioning Provision resources just for the cold data workload: Disks: o Archival and SMR instead of commodity drives Power Cooling Bandwidth Enough for the required workload Not to keep all disks spinning Servers o Enough for data management instead of 1 server/ 40 disks 5

Pelican: rack-scale appliance for cold data Mechanical, hardware and storage software stack co-designed. Right-provisioned for cold data workload: Only 8% of the disks are spinning concurrently. Designed to store blobs which are infrequently accessed. 6

Pelican hardware 52U rack 1,152 archival class 3.5 SATA disks Disks are placed in trays of 16 disks Rack consists of six 8U chassis Each chassis contains 12 trays SATA disks in the tray connected to HBA on the tray Tray HBA s connect to PCI switch on the backplane PCI supports virtual switching Each chassis connected to both servers 4.55 TB for each disk (sum up to 5 PB) Goodput of 100 MB/s for each disk (50 active disks required for 40 Gbps) 7

Resource domains Each domain is only provisioned to supply its resource to a subset of disks Each disk uses resources from a set of resource domains Domain-conflicting - Disks that are in the same resource domain. Domain-disjoint - Disks that share no common resource domains. Pelican domains (hard and soft limits) Cooling, Power, Vibration, Bandwidth 8

Impact of right provisioning on resources Systems provisioned for peak performance: Any disk can be active at any time Right provisioned system: Disk part of a domain for each resource Domain supplies limited resources Disk active if enough resources in all its domains Pelican resource limitations: 2 active out of 16 per power domain 1 active out of 12 per cooling domain 1 active out of 2 per vibration domain x6 x16 x12 9

Data placement: maximizing request concurrency In full-provisioned systems: Any two sets of disks can be active No impact of placement on concurrency In right-provisioned systems: Sets can conflict in resource requirements Conflicting sets cannot be concurrently active Need to form sets, that minimize risk of a conflict. 10

Data placement: maximizing request concurrency Blobs are erasure-encoded (Cauchy Reed-Solomon code) on a set of concurrently active disks. Blob of sizes from 200 MB to 1 TB as a key-value pair (key is a 20 byte identifier). Blob split into 128 kb data fragments. K+R fragments of a blob (stripe) on independent drives. Off-Rack metadata called catalog holds information about the blob placement Pelican: K=15, R=3: Good data durability (three concurrent disk failures without data loss) Saturating 10Gbps network link during single blob read Possible to have 18 domain-disjoint disks 11

Data placement: maximizing request concurrency Idea: concentrate all conflicts over a few set of disks Statically partition disk in groups in which disks can be concurrently active Property: Either fully conflicting Or fully independent Blob is entirely stored in one group Probability of conflict proportional to the size of the group 12

Data placement: maximizing request concurrency 13

Groups: Pros All constraints are encapsulated in a group It defines the concurrency, that can be achieved while serving requests Span multiple failure domains Reduce time required to recover a failed disk 14

IO scheduling: spin up is the new seek Four independent schedulers Each scheduler: 12 groups, only one can be active Naive scheduler: FIFO Group activation time: 10-14 seconds High probability of spin-ups after each request (92%) Time is spent on doing spin-ups Pelican scheduler: Reordering and rate limiting Limit on maximum reordering Trade-off between throughput and fairness Weighted fair-share between client and rebuild traffic 15

Implementation Problems: 1. Ensuring that we do not violate the cooling or power constraints until the OS has finished booting and the Pelican service is managing the disks 2. Unexpected spin ups of disks (Disks would be spun up without a request from Pelican, due to services like the SMART disk failure predictor polling the disk) 3. At boot, once the Pelican storage stack controls the drives, it needs to mount and check every drive Solutions: 1. Using Power Up In Standby (PUIS) where the drives establish a PHY link on power up, but require an explicit SATA command to spin up 2. Added a special No Access flag to the Windows Server driver stack that causes all IOs issued to a disk marked to return with a media not ready error code 3. Generating an initialization request for each group and schedule these as the first operation performed on each group. 16

Pelican simulator 48 groups of 24 disks Groups divided into 4 classes Blobs stored using 15+3 erasure encoding Maximum queue depth is 1000 requests Each server runs 2 schedulers and is configured with 20 Gbps 17

Parameters used in simulation 18

Pelican: Tests Pelican vs FP (all disks active) Workload: Poisson proces with lambda from 0.125 to 16 Metrics: Throughput Service time Time to first byte Etc. 19

Throughput 20

Service Time 21

Time to first byte 22

Reject rates EECS 582 W16 23

Cost of fairness - throughput EECS 582 W16 24

Cost of fairness rejected requests 25

Scheduler performance Rebuild time 26

Scheduler performance impact on client requests 27

Disk statistics 28

Rack power consumption Disk states: Standby: 1.56 W Spinup: 21.53 W Active: 9.42 W EECS 582 W16 29

Disk utilization 30

Conclusions Pros and Cons + Reduce capital cost and operating cost + Meet requirements from cold data workload + Erasure codes for fault tolerance - Sensitive to hardware changes - Tight constraints less flexible to changes 31

Q&A 32