RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Similar documents
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Low-Latency Datacenters. John Ousterhout

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

CS3600 SYSTEMS AND NETWORKS

Log-structured Memory for DRAM-based Storage

At-most-once Algorithm for Linearizable RPC in Distributed Systems

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

A Brief Introduction to Key-Value Stores

FaRM: Fast Remote Memory

Data Analytics on RAMCloud

The Google File System

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

NPTEL Course Jan K. Gopinath Indian Institute of Science

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Making RAMCloud Writes Even Faster

Chapter 11: Implementing File

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

CA485 Ray Walshe Google File System

OPERATING SYSTEM. Chapter 12: File System Implementation

File System Implementation

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Distributed File Systems II

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

CS370 Operating Systems

Chapter 12: File System Implementation

EECS 482 Introduction to Operating Systems

Chapter 11: Implementing File Systems

Distributed File Systems

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

The Google File System

The Google File System

Chapter 12: File System Implementation

Week 12: File System Implementation

CS370 Operating Systems

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Chapter 10: File System Implementation

Memory-Based Cloud Architectures

Chapter 10: Mass-Storage Systems

The Google File System

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Chapter 12: File System Implementation

Dalí: A Periodically Persistent Hash Map

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

CSE 451: Operating Systems Spring Module 12 Secondary Storage

CS 318 Principles of Operating Systems

CPS 512 midterm exam #1, 10/7/2016

SMORE: A Cold Data Object Store for SMR Drives

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Mass-Storage Structure

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CLOUD-SCALE FILE SYSTEMS

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

EECS 482 Introduction to Operating Systems

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Open vstorage EMC SCALEIO Architectural Comparison

Exploiting Commutativity For Practical Fast Replication

CSE 153 Design of Operating Systems

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

Lecture 18: Reliable Storage

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Single-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

File System Implementations

Chapter 12: Mass-Storage

No compromises: distributed transactions with consistency, availability, and performance

Lightning Talks. Platform Lab Students Stanford University

CS 61C: Great Ideas in Computer Architecture. MapReduce

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

CS307: Operating Systems

GFS: The Google File System. Dr. Yingwu Zhu

V. Mass Storage Systems

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

GFS: The Google File System

Frequently asked questions from the previous class survey

Google File System. Arun Sundaram Operating Systems

Next-Generation Cloud Platform

No compromises: distributed transac2ons with consistency, availability, and performance

OS impact on performance

COS 318: Operating Systems. Journaling, NFS and WAFL

Transcription:

RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University

Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction RAMCloud: new class of storage for low-latency datacenters Potential impact: enable new class of applications June 3, 2014 RAMCloud/IEEE MSST Slide 2

How Many Datacenters? Suppose we capitalize IT at the same level as other infrastructure (power, water, highways, telecom): $1-10K per person? 1-10 datacenter servers/person? U.S. World Servers 0.3-3B 7-70B Datacenters 3000-30,000 70,000-700,000 (assumes 100,000 servers/datacenter) Computing in 10 years: Most non-mobile computing (i.e. Intel processors) will be in datacenters Devices provide user interfaces June 3, 2014 RAMCloud/IEEE MSST Slide 3

Evolution of Datacenters Phase 1: manage scale 10,000-100,000 servers within 50m radius 1 PB DRAM 100 PB disk storage Challenge: how can one application harness thousands of servers? Answer: MapReduce, etc. But, communication latency high: 300-500µs round-trip times Must process data sequentially to hide latency (e.g. MapReduce) Interactive applications limited in functionality June 3, 2014 RAMCloud/IEEE MSST Slide 4

Evolution of Datacenters, cont d Phase 2: low latency Speed-of-light limit: 1µs Round-trip time achievable today: 5-10µs Practical limit (5-10 years): 2-3µs June 3, 2014 RAMCloud/IEEE MSST Slide 5

Application Servers Why Does Latency Matter? Traditional Application Web Application UI App. Logic Data Structures Single machine UI App. Logic Datacenter << 1µs latency 0.5-10ms latency Storage Servers Large-scale apps struggle with high latency Random access data rate has not scaled! Facebook: can only make 100-150 internal requests per page June 3, 2014 RAMCloud/IEEE MSST Slide 6

MapReduce Computation..................... Data Sequential data access high data access rate Not all applications fit this model Offline June 3, 2014 RAMCloud/IEEE MSST Slide 7

Application Servers Goal: Scale and Latency Traditional Application Web Application UI App. Logic Data Structures Single machine << 1µs latency 0.5-10ms latency 5-10µs Enable new class of applications: Large-scale graph algorithms (machine learning?) Collaboration at scale? UI App. Logic Datacenter Storage Servers June 3, 2014 RAMCloud/IEEE MSST Slide 8

Large-Scale Collaboration Region of Consciousness Gmail: email for one user Facebook: 50-500 friends Data for one user Morning commute: 10,000-100,000 cars June 3, 2014 RAMCloud/IEEE MSST Slide 9

Getting To Low Latency Network switches (currently 10-30µs per switch): 10Gbit switches: 500ns per switch Radical redesign: 30ns per switch Must eliminate buffering Software (currently 60µs total): Kernel bypass (direct NIC access from applications): 2µs New protocols, threading architectures: 1µs NIC (currently 2-32µs per transit): Optimize current architectures: 0.75µs per transit New NIC CPU integration: 50ns per transit June 3, 2014 RAMCloud/IEEE MSST Slide 10

Achievable Round-Trip Latency Component Actual Today Possible Today 5-10 Years Switching fabric 100-300µs 5µs 0.2µs Software 60µs 2µs 1µs NIC 8-128µs 3µs 0.2µs Propagation delay 1µs 1µs 1µs Total 200-400µs 11µs 2.4µs June 3, 2014 RAMCloud/IEEE MSST Slide 11

RAMCloud Storage system for low-latency datacenters: General-purpose All data always in DRAM (not a cache) Durable and available Scale: 1000+ servers, 100+ TB Low latency: 5-10µs remote access June 3, 2014 RAMCloud/IEEE MSST Slide 12

RAMCloud Architecture 1000 100,000 Application Servers Appl. Library Appl. Library Appl. Library Appl. Library High-speed networking: 5 µs round-trip Full bisection bwidth Commodity Servers Datacenter Network Coordinator Master Backup Master Backup Master Backup Master Backup 64-256 GB per server 1000 10,000 Storage Servers June 3, 2014 RAMCloud/IEEE MSST Slide 13

Example Configurations 2012 2015-2020 # servers 2000 2000 GB/server 128 GB 512 GB Total capacity 256 TB 1 PB Total server cost $7M $7M $/GB $27 $7 For $100K today: One year of Amazon customer orders (5 TB?) One year of United flight reservations (2 TB?) June 3, 2014 RAMCloud/IEEE MSST Slide 14

Data Model: Key-Value Store Basic operations: read(tableid, key) => blob, version write(tableid, key, blob) => version delete(tableid, key) Other operations: cwrite(tableid, key, blob, version) => version Enumerate objects in table Efficient multi-read, multi-write Atomic increment Not currently available: Atomic updates of multiple objects Secondary indexes (Only overwrite if version matches) Tables Object Key ( 64KB) Version (64b) Blob ( 1MB) June 3, 2014 RAMCloud/IEEE MSST Slide 15

Using Infiniband networking (24 Gb/s, kernel bypass) Other networking also supported, but slower Reads: 100B objects: 5µs 10KB objects: 10µs Single-server throughput (100B objects): 700 Kops/sec. Small-object multi-reads: 1-2M objects/sec. Durable writes: RAMCloud Performance 100B objects: 16µs 10KB objects: 40µs Small-object multi-writes: 400-500K objects/sec. June 3, 2014 RAMCloud/IEEE MSST Slide 16

Data Durability Objects (eventually) backed up on disk or flash Logging approach: Each master stores its objects in a log Log divided into segments Segments replicated on multiple backups Segment replicas scattered across entire cluster For efficiency, updates buffered on backups Assume nonvolatile buffers (flushed during power failures) Segment 1 Segment 2 Segment 3 B3 B24 B19 B45 B7 B11 B12 B3 B28 June 3, 2014 RAMCloud/IEEE MSST Slide 17

1-2 Second Crash Recovery Each master scatters segment replicas across entire cluster On crash: Coordinator partitions dead master s tablets. Partitions assigned to different recovery masters Backups read disks in parallel Log data shuffled from backups to recovery masters Recovery masters replay log entries, incorporate objects into their logs Crashed Master ~1000 Backups ~100 Recovery Masters Fast recovery: 300 MB/s per recovery master Recover 40 GB in 1.8 seconds (80 nodes, 160 SSDs) June 6, 2013 RAMCloud Update and Overview Slide 18

Log-Structured Memory Don t use malloc for memory management Wastes 50% of memory Instead, structure memory as a log Allocate by appending Log cleaning to reclaim free space Control over pointers allows incremental cleaning Can run efficiently at 80-90% memory utilization June 3, 2014 RAMCloud/IEEE MSST Slide 19

New Projects Data model: Can RAMCloud support higher-level features? Secondary indexes Multi-object transactions Graph operations Impact on scale/latency? System architecture for datacenter RPC: Goals: large scale, low latency Current implementation works, but: Too slow (2µs software overhead per RPC) Sacrifices throughput for latency Too much state per connection (1M connections/server in future?) June 3, 2014 RAMCloud/IEEE MSST Slide 20

Threats to Latency Layering Great for software structuring Bad for latency E.g. RAMCloud threading structure costs 200-300ns/RPC Virtualization is potential problem Buffering Network buffers are the enemy of latency TCP will fill them, no matter how large Facebook measured 10 s of ms RPC delay because of buffering Need new networking architectures with no buffers Substitute switching bandwidth for buffers June 3, 2014 RAMCloud/IEEE MSST Slide 21

Datacenter revolution < half over: Scale is here Low latency is coming Next steps: New networking architectures New storage systems Ultimate result: Amazing new applications Conclusion June 3, 2014 RAMCloud/IEEE MSST Slide 22

Extra Slides

The Datacenter Opportunity Exciting combination of features: Concentrated compute power (~100,000 machines) Large amounts of storage: 1 Pbyte DRAM 100 Pbytes disk Potential for fast communication: Low latency (speed-of-light delay < 1µs) High bandwidth Homogeneous Controlled environment enables experimentation: E.g. new network protocols Huge Petri dish for innovation over next decade June 3, 2014 RAMCloud/IEEE MSST Slide 24