Cluster-Level Google How we use Colossus to improve storage efficiency

Similar documents
Distributed File Systems II

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

CA485 Ray Walshe Google File System

The Google File System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

CSE 124: Networked Services Lecture-16

Distributed Filesystem

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System. By Dinesh Amatya

Disks and their Cloudy Future

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CSE 124: Networked Services Fall 2009 Lecture-19

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

The Google File System

Distributed Systems 16. Distributed File Systems II

The Google File System

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Current Topics in OS Research. So, what s hot?

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

CLOUD-SCALE FILE SYSTEMS

Distributed Systems. GFS / HDFS / Spanner

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Map-Reduce. Marco Mura 2010 March, 31th

The Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

The Google File System

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Lessons Learned While Building Infrastructure Software at Google

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Staggeringly Large File Systems. Presented by Haoyan Geng

MapReduce. U of Toronto, 2014

NPTEL Course Jan K. Gopinath Indian Institute of Science

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Distributed System. Gang Wu. Spring,2018

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GFS: The Google File System

CS November 2017

GFS: The Google File System. Dr. Yingwu Zhu

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Outline. Spanner Mo/va/on. Tom Anderson

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

The Google File System. Alexandru Costan

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

BigData and Map Reduce VITMAC03

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

The Google File System

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Google File System. Arun Sundaram Operating Systems

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

CS November 2018

DIVING IN: INSIDE THE DATA CENTER

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Google File System 2

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Introduction to Database Services

CS 655 Advanced Topics in Distributed Systems

Optimizing the Data Center with an End to End Solutions Approach

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

MapReduce & BigTable

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

Isilon Performance. Name

Embedded Technosolutions

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

15-440/15-640: Homework 3 Due: November 8, :59pm

Google Disk Farm. Early days

Analytics in the cloud

RUNNING PETABYTE-SIZED CLUSTERS

Cloud Computing CS

An Introduction to Big Data Formats

CSE 153 Design of Operating Systems

Distributed File Systems

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

BigTable. CSE-291 (Cloud Computing) Fall 2016

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Ceph: A Scalable, High-Performance Distributed File System

Transcription:

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi Senior Staff Software Engineer dserenyi@google.com November 13, 2017 Keynote at the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Intensive Computing Systems

What do you call a few PB of free space?

What do you call a few PB of free space? An emergency low disk space condition

Typical Cluster: 10s of thousands of machines PB of distributed HDD Optional multi-tb local SSD 10 GB/s bisection bandwidth

Part 1: Transition From GFS to Colossus

GFS architectural problems GFS master One machine not large enough for large FS Single bottleneck for metadata operations Fault tolerant, not HA Predictable performance No guarantees of latency

Some obvious GFSv2 goals Bigger! Faster! More predictable tail latency GFS master replaced by Colossus GFS chunkserver replaced by D

Solve an easier problem A file system for Bigtable Append-only Single-writer (multi-reader) No snapshot / rename Directories unnecessary Where to put metadata?

Storage options back then GFS Sharded MySQL with local disk & replication Ads databases Local key-value store with Paxos replication Chubby Bigtable (sorted key-value store on GFS)

Storage options back then GFS Sharded MySQL lacks useful database features poor load balancing, complicated Local key-value store doesn t scale Bigtable hmmmmmm.

Why Bigtable? Bigtable solves many of the hard problems: Automatically shards data across tablets Locates tablets via metadata lookups Easy to use semantics Efficient point lookups and scans File system metadata kept in an in-memory locality group

Metadata in Bigtable (!?!?) Application Bigtable (XX,XXX tabletservers) METADATA user1 tablets metadata user2 tablets... data CFS Bigtable (XXX tabletservers) XX,XXX D chunkservers METADATA GFS metadata GFS master FS META GFS data XXX GFS chunkservers Note: GFS still present, storing file system metadata

GFS master -> CFS CFS curators run in Bigtable tablet servers /cfs/ex-d/home/denis/myfile is-finalized? mtime, ctime,... encoding r=3.2 stripe 0, checksum, length Bigtable row corresponds to a single file chunk0 chunk1 chunk2 stripe 1, checksum, length Stripes are replication groups: open, closed, finalized chunk0 chunk1 chunk2 stripe 2, OPEN chunk0 chunk1 chunk2

Colossus for metadata? Metadata is ~1/10000 the size of data So if we host a Colossus on Colossus 100 PB data 10 TB metadata 10TB metadata 1GB metametadata 1GB metametadata 100KB meta... And now we can put it into Chubby!

Part 2: Colossus and Efficient Storage

Themes Colossus enables scale, declustering Complementary applications cheaper storage Placement of data, IO balance is hard

What s a cluster look like? Machine 1 Machine 2 Machine XX000 YouTube Serving Ads MapReduce YouTube MapReduce GMail Bigtable YouTube Serving CFS Bigtable D Server D Server D Server

Let s talk about money Total Cost of Ownership TCO encompasses much more than the retail price of a disk A denser disk might sell at a premium $/GB but still cheaper to deploy (power, connection overhead, repairs)

The ingredients of storage TCO Most importantly, we care about storage TCO, not disk TCO. Storage TCO is the cost of data durability and its availability, and the cost of serving it We minimize total storage TCO if we keep the disk full and busy

What disk should I buy? Which disks should I buy? We ll have a mix because we re growing We have an overall goal for IOPS and capacity We select disks to bring the cluster and fleet closer to our overall mix

What we want cold data cold data hot data hot data small disk big disk Equal amounts of hot data (spindle is busy) Rest of disk filled with cold data (disks are full)

How we get it Colossus rebalances old, cold data...and distributes newly written data evenly across disks

When stuff works well Each box is a D server Sized by disk capacity Colored by spindle utilization

Rough scheme Buy flash for caching to bring IOPS/GB into disk range Buy disks for capacity and fill them up Hope that the disks are busy otherwise we bought too much flash but not too busy If we buy disks for IOps, byte improvements don t help If cold bytes grow infinitely, we have lots of IO capacity

Filling up disks is hard Filesystem doesn t work well when 100% full Can t remove capacity for upgrades and repairs without empty space Individual groups don t want to run near 100% of quota Administrators are uncomfortable with statistical overcommit Supply chain uncertainty

Applications must change Unlike almost anything else in our datacenters, disk I/O cost is going up Applications that want more accesses than HDDs offer probably need to think about making their hot data hotter (so flash works well) and cold data colder An application written X years ago might cause us to buy smaller disks, increasing storage costs

Conclusion Colossus has been extremely useful for optimizing our storage efficiency Metadata scaling enables declustering of resources Ability to combine disks of various sizes and workloads of varying types is very powerful Looking forward, I/O cost trends will require both applications and storage systems to evolve

Thank you! Denis Serenyi dserenyi@google.com