Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes
|
|
- Esmond Hardy
- 5 years ago
- Views:
Transcription
1 Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes
2 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays 2
3 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, 3
4 In-Memory Key-Value Stores General purpose in-memory key-value stores are widely used nowadays Recent systems can process millions of requests/ second/machine: E.g. RAMCloud, FaRM, MICA, Key enablers: eliminating network overheads (e.g., kernel bypass) and leveraging multicore architectures 4
5 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing 5
6 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 6
7 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures 7
8 In-Memory Key-Value Stores CPU is becoming a bottleneck in modern kv-stores ->Random memory access, key-value GET/PUT processing Persistent in-memory kv-stores have to replicate data to survive failures ->Replication contends with normal request processing 8
9 In-Memory Key-Value Stores PUT(Y) Client Y copies 9
10 In-Memory Key-Value Stores PUT(Y) Client Y copies copies 10
11 In-Memory Key-Value Stores Replicate X PUT(Y) Client Y copies copies 11
12 In-Memory Key-Value Stores Replicate X PUT(Y) Client X Y copies copies 12
13 In-Memory Key-Value Stores Replicate X PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 13
14 Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores 14
15 Replication in In-Memory Key-Value Stores Replication impedes modern kv-stores and collocated applications Block Storage Graph Processing Stream Processing 15
16 Replication in In-Memory Key-Value Stores How to mitigate replication overhead? 16
17 Replication in In-Memory Key-Value Stores How to mitigate replication overhead? Techniques like Remote Direct Memory Access (RDMA) seem promising 17
18 Replication in In-Memory Key-Value Stores Replicate X NIC PUT Y Client Replicate X GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 18
19 RDMA-Based Replication Replicate X NIC PUT Y Client Replicate X DMA GET Y Client Replicate X X X X Y Y PUT Y Client copies copies 19
20 RDMA-Based Replication Replicate X NIC Replicate X DMA Replicate X X X X Y Y copies copies 20
21 Existing Protocols Many systems use one-sided RDMA for replication FaRM NSDI 14, DARE HPDC 15, HydraDB SC 16 Integrity check Message fully received MSG1 MSG2 Ring Buffer MSG3 Still involves backup s CPU, RDMA defeats RDMA purpose! Receiver polls 21
22 Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 22
23 RDMA-based Replication Why not just perform RDMA and leave target idle? 23
24 RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 24
25 RDMA-based Replication Why not just perform RDMA and leave target idle? PUT (A) Replicate (A) A 25
26 RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) A 26
27 RDMA-based Replication Why not just perform RDMA and leave target idle? Recover (A) Replicate (A) A The backup cannot guess the length of data, it needs some metadata 27
28 RDMA-based Replication Metadata A backup needs to insure the integrity of log-backup data + length of data { A@0x1 A=1kB CRC=0x1 A 28
29 RDMA-based Replication Metadata The last checksum can cover all previous log entries { { B@0x2 B=2kB CRC=0x2 A B 29
30 RDMA-based Replication Metadata The last checksum can cover all previous log entries { { { C@0x4 B=1kB CRC=0x3 A B C 30
31 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Replicate(A) A A@0x1 A=1kB CRC=0x1 31
32 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) Metadata A A@0x1 A=1kB CRC=0x1 Current RDMA implementations can only target a single, contiguous memory region 32
33 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (A) A@0x1 A=1kB CRC=0x1 A 2x Messages, RPCs would be more efficient 33
34 RDMA-based Replication Second Try The primary writes metadata on a remote buffer PUT (B) A@0x1 A=1kB CRC=0x1 Replicate (B) A B 34
35 RDMA-based Replication Second Try The primary writes metadata on a remote buffer Update Metadata A B B@0x2 $##!!! $##!!! 35
36 RDMA-based Replication Second Try The primary fails in the midst of updating metadata! Recover A B B@0x2 $##!!! $##!!! 36
37 RDMA-based Replication Second Try data unusable! Recover A B B@0x2 $##!!! $##!!! 37
38 RDMA-based Replication Third Try The primary replicates A and corresponding metadata with a single RDMA PUT (A) Replicate (A) A 38
39 RDMA-based Replication Third Try The primary replicates B and corresponding metadata, right after A PUT (B) Replicate (B) A B 39
40 RDMA-based Replication Third Try The primary fully replicates C, but partially the metadata PUT (C) Replicate (C) Corrupt metadata invalidates all backup log A B C 40
41 RDMA-based Replication Fourth Try The primary replicates A and corresponding metadata PUT (A) Replicate (A) A 41
42 RDMA-based Replication Fourth Try The primary partially replicates B, then fails PUT (A) Replicate (A) A B 42
43 RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B 43
44 RDMA-based Replication Third Try The backup checks if objects were fully received PUT (A) A B Only fully-received and correct objects are recovered! 44
45 Tailwind Keep the same client-facing interface (RPCs) Strongly-consistent primary-backup systems Appends only a 4-byte CRC32 checksum after each record Relies on Reliable-Connected queue pairs: messages are delivered at most once, in order, and without corruption Stop failures 45
46 RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer NIC Pre-registered buffers 46
47 RDMA Buffers Allocation A primary chooses a backup, and requests an RDMA buffer RPC: Replicate (A) + Get Buffer Ack + Buffer NIC A Pre-registered buffers 47
48 RDMA Buffers Allocation All subsequent replication requests are performed with one-sided RDMA WRITEs RDMA: Replicate (B) Ack + Buffer NIC A B 48
49 RDMA Buffers Allocation When the primary fills a buffer, it will chose a backup and repeat all steps RPC: Replicate (C) + Get Buffer Ack + Buffer Buffers are asynchronously flushed to secondary storage, then they can be reused NIC C A B Pre-registered buffers Buffer filled 49
50 Tailwind: Failures Failures can happen at any moment RDMA complicates primary replica failures Secondary replica failures are naturally dealt with in storage systems 50
51 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred PUT (B) Replicate (B) A B
52 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B
53 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B
54 Failure scenarios: Fully Replicated Objects Case 1: The object + its metadata are correctly transferred A B The last object must always be a checksum + Checksums are not allowed to be zeroed 54
55 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B
56 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B
57 Failure scenarios: Partially Written Checksum Case 2: Partially transferred checksum A B s re-compute checksum during recovery and compare it with the stored one 57
58 Failure scenarios: Partially Written Object Case 3: Partially transferred object A B Metadata act as an end-of-transmission marker 58
59 Outline Context Existing RDMA-based replication protocols Tailwind s design Evaluation Conclusion 59
60 Implementation Implemented on the RAMCloud in-memory kv-store Low latency, large scale, strongly consistent RPCs leverage fast networking and kernel bypass Keeps all data in memory, durable storage for backups 60
61 RAMCloud Threading Architecture Client GET(C) M1 storage A PUT(B) C B 2 Worker Core Non-volatile Buffer Replicate(B) NIC 3 Dispatch Core NIC Replicate(B) NIC NIC M3 storage Poll Dispatch Core C 4 Replicate(B) Worker Core A Non-volatile Buffer Client PUT(B) 1 61
62 Evaluation Configuration Yahoo! Cloud Serving Benchmark (Workloads A (50%PUT), B(5%PUT), WRITE-ONLY) 20 million bytes objects + 30 byte/key Requests generated with a Zipfian distribution RAMCloud replication vs Tailwind (3-way replication) 62
63 Evaluation Goals How much CPU cycles can Tailwind save? Does it improve performance? Is there any overhead? 63
64 Evaluation: CPU Savings CPU cores utilization (dispatch) x Throughput (KOp/s) CPU cores utilization (worker) Tailwind RAMCloud Values are aggregated over a 4-node cluster WRITE-ONLY workload 3x Throughput (KOp/s) 64
65 Evaluation: Latency Improvement Median Latency (µs) Tailwind RAMCloud 2x Durable writes take 16μs under heavy load Throughput (Kops) 99 th Percentile Latency (µs) x Throughput (Kops) 65
66 Evaluation: Throughput Throughput/server in a 4-node cluster Throughput (KOp/S) / Server Tailwind RAMCloud Clients (YCSB-B) 1.3x (YCSB-A) (WRITE-ONLY) 1.7x Even with 5% of PUTs, Tailwind increase throughput by 30% 66
67 Evaluation: Recovery Overhead Recovery Time (s) Tailwind RAMCloud Recoveries with up to 10 million objects 0 1M 10M Recovery Size 67
68 Related Work One-sided RDMA systems: Pilaf ATC 13, HERD SIGCOMM 14, FaRM NSDI 14, DrTM SOSP 15, DrTM + R Eurosys 16, Mitigating replication overheads/tuning consistency: RedBlue OSDI 12, Correctables OSDI 16 Tailwind reduces replication CPU footprint and improves performance without sacrificing durability, availability, or consistency 68
69 Conclusion Tailwind leverages one-sided RDMA to perform replication and leaves backups completely idle Provides backups with a protocol to protect against failure scenarios Reduces replication induced CPU usage while improving performance and latency Tailwind preserves client-facing RPC Thank you! Questions? 69
FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs
FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia (CMU), Michael Kaminsky (Intel Labs), David Andersen (CMU) RDMA RDMA is a network feature that
More informationNo compromises: distributed transactions with consistency, availability, and performance
No compromises: distributed transactions with consistency, availability, and performance Aleksandar Dragojevi c, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam,
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main
More informationDeconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!
Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong
More informationRocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman
Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay Kulkarni, niraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman 1 Introduction Distributed low-latency in-memory key-value stores are
More informationCGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout
CGAR: Strong Consistency without Synchronous Replication Seo Jin Park Advised by: John Ousterhout Improved update performance of storage systems with master-back replication Fast: updates complete before
More informationRDMA and Hardware Support
RDMA and Hardware Support SIGCOMM Topic Preview 2018 Yibo Zhu Microsoft Research 1 The (Traditional) Journey of Data How app developers see the network Under the hood This architecture had been working
More informationExploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout
Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering
More informationLITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang
LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationCharacterizing Performance and Energy-Efficiency of The RAMCloud Storage System
Characterizing Performance and Energy-Efficiency of The RAMCloud Storage System Yacine Taleb, Shadi Ibrahim, Gabriel Antoniu, Toni Cortes To cite this version: Yacine Taleb, Shadi Ibrahim, Gabriel Antoniu,
More informationFarewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware
More informationVoltDB vs. Redis Benchmark
Volt vs. Redis Benchmark Motivation and Goals of this Evaluation Compare the performance of several distributed databases that can be used for state storage in some of our applications Low latency is expected
More informationNo Compromises. Distributed Transactions with Consistency, Availability, Performance
No Compromises Distributed Transactions with Consistency, Availability, Performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, Miguel
More informationThe Google File System
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
More informationRAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationHigh Performance Transactions in Deuteronomy
High Performance Transactions in Deuteronomy Justin Levandoski, David Lomet, Sudipta Sengupta, Ryan Stutsman, and Rui Wang Microsoft Research Overview Deuteronomy: componentized DB stack Separates transaction,
More informationBig and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant
Big and Fast Anti-Caching in OLTP Systems Justin DeBrabant Online Transaction Processing transaction-oriented small footprint write-intensive 2 A bit of history 3 OLTP Through the Years relational model
More informationInexpensive Coordination in Hardware
Consensus in a Box Inexpensive Coordination in Hardware Zsolt István, David Sidler, Gustavo Alonso, Marko Vukolic * Systems Group, Department of Computer Science, ETH Zurich * Consensus IBM in Research,
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationPASTE: A Networking API for Non-Volatile Main Memory
PASTE: A Networking API for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Lars Eggert (NetApp) Douglas Santry (NetApp) TSVAREA@IETF 99, Prague May 22th 2017 More details at our HotNets
More informationRocksteady: Fast Migration for Low-latency In-memory Storage
Rocksteady: Fast Migration for Low-latency In-memory Storage Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, and Ryan Stutsman University of Utah ABSTRACT Scalable in-memory key-value stores
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More informationStateless Network Functions:
Stateless Network Functions: Breaking the Tight Coupling of State and Processing Murad Kablan, Azzam Alsudais, Eric Keller, Franck Le University of Colorado IBM Networks Need Network Functions Firewall
More informationPrincipled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models
Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models Suli Yang*, Jing Liu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau * work done while at UW-Madison
More informationNo compromises: distributed transac2ons with consistency, availability, and performance
No compromises: distributed transac2ons with consistency, availability, and performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nigh2ngale, MaDhew Renzelmann, Alex Shamis, Anirudh Badam,
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions
More informationA Distributed Hash Table for Shared Memory
A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash
More informationGFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures
GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,
More informationAn Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout
An Implementation of the Homa Transport Protocol in RAMCloud Yilong Li, Behnam Montazeri, John Ousterhout Introduction Homa: receiver-driven low-latency transport protocol using network priorities HomaTransport
More informationPASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018
More informationClosing the Performance Gap Between Volatile and Persistent K-V Stores
Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationHIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS
HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW
More informationAn Empirical Evaluation of How The Network Impacts The Performance and Energy Efficiency in RAMCloud
An Empirical Evaluation of How The Network Impacts The Performance and Energy Efficiency in RAMCloud Yacine Taleb, Shadi Ibrahim, Gabriel Antoniu and Toni Cortes Inria Rennes Bretagne-Atlantique, Rennes,
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School
More informationStrata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson
A Cross Media File System Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1 Let s build a fast server NoSQL store, Database, File server, Mail server Requirements
More informationExploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout
Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: replication adds latency and throughput overheads CURP: Consistent Unordered Replication Protocol
More informationFarewell to Servers: Resource Disaggregation
Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Can monolithic Application Hardware servers
More informationBzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory
BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory JOY ARULRAJ JUSTIN LEVANDOSKI UMAR FAROOQ MINHAS PER-AKE LARSON Microsoft Research NON-VOLATILE MEMORY [NVM] PERFORMANCE DRAM VOLATILE
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationExploiting Commutativity For Practical Fast Replication
Exploiting Commutativity For Practical Fast Replication Seo Jin Park Stanford University John Ousterhout Stanford University Abstract Traditional approaches to replication require client requests to be
More informationDalí: A Periodically Persistent Hash Map
Dalí: A Periodically Persistent Hash Map Faisal Nawab* 1, Joseph Izraelevitz* 2, Terence Kelly*, Charles B. Morrey III*, Dhruva R. Chakrabarti*, and Michael L. Scott 2 1 Department of Computer Science
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, Miguel Castro Microsoft Research Abstract We describe the design and implementation of FaRM, a new main memory distributed
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationDeconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better! Xingda Wei, Zhiyuan Dong, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University Contacts:
More informationThe Google File System (GFS)
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
More informationGFS: The Google File System. Dr. Yingwu Zhu
GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can
More informationNPTEL Course Jan K. Gopinath Indian Institute of Science
Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,
More informationThe Google File System
The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file
More informationPresented by: Alvaro Llanos E
Presented by: Alvaro Llanos E Motivation and Overview Frangipani Architecture overview Similar DFS PETAL: Distributed virtual disks Overview Design Virtual Physical mapping Failure tolerance Frangipani
More informationDistributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented
More information18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationCPS 512 midterm exam #1, 10/7/2016
CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say
More informationBCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University
BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation
More informationRAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel Rosenblum and John Ousterhout) a Storage System
More informationNetworks and distributed computing
Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationJANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them
JANUARY 28, 2014, SAN JOSE, CA PRESENTATION James TITLE Pinkerton GOES HERE Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them Why should NVM be Interesting to OS Vendors? New levels of
More informationBlurred Persistence in Transactional Persistent Memory
Blurred Persistence in Transactional Persistent Memory Youyou Lu, Jiwu Shu, Long Sun Tsinghua University Overview Problem: high performance overhead in ensuring storage consistency of persistent memory
More informationAuthors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani
The Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani CS5204 Operating Systems 1 Introduction GFS is a scalable distributed file system for large data intensive
More informationRoGUE: RDMA over Generic Unconverged Ethernet
RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA USER KERNEL Zero Copy Application Application Buffer Buffer HARWARE
More informationGFS: The Google File System
GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationAmazon Aurora Deep Dive
Amazon Aurora Deep Dive Anurag Gupta VP, Big Data Amazon Web Services April, 2016 Up Buffer Quorum 100K to Less Proactive 1/10 15 caches Custom, Shared 6-way Peer than read writes/second Automated Pay
More informationStaggeringly Large Filesystems
Staggeringly Large Filesystems Evan Danaher CS 6410 - October 27, 2009 Outline 1 Large Filesystems 2 GFS 3 Pond Outline 1 Large Filesystems 2 GFS 3 Pond Internet Scale Web 2.0 GFS Thousands of machines
More informationDevice-Functionality Progression
Chapter 12: I/O Systems I/O Hardware I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Incredible variety of I/O devices Common concepts Port
More informationChapter 12: I/O Systems. I/O Hardware
Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations I/O Hardware Incredible variety of I/O devices Common concepts Port
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationMaking RAMCloud Writes Even Faster
Making RAMCloud Writes Even Faster (Bring Asynchrony to Distributed Systems) Seo Jin Park John Ousterhout Overview Goal: make writes asynchronous with consistency. Approach: rely on client Server returns
More informationRecoverability. Kathleen Durant PhD CS3200
Recoverability Kathleen Durant PhD CS3200 1 Recovery Manager Recovery manager ensures the ACID principles of atomicity and durability Atomicity: either all actions in a transaction are done or none are
More informationRemote Persistent Memory SNIA Nonvolatile Memory Programming TWG
Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG Tom Talpey Microsoft 2018 Storage Developer Conference. SNIA. All Rights Reserved. 1 Outline SNIA NVMP TWG activities Remote Access for
More informationGoogle File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo
Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance
More informationCeph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long Presented by Philip Snowberger Department of Computer Science and Engineering University
More informationThe influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte Alain Greiner Univ. Paris 6, France http://mpc.lip6.fr
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationLightweight Replication Through Remote Backup Memory Sharing for In-Memory Key-Value Stores
2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Lightweight Replication Through Remote Backup Memory Sharing for In-Memory Key-Value
More informationThis presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components.
This presentation covers Gen Z Buffer operations. Buffer operations are used to move large quantities of data between components. 1 2 Buffer operations are used to get data put or push up to 4 GiB of data
More informationTolerating Faults in Disaggregated Datacenters
Tolerating Faults in Disaggregated Datacenters Amanda Carbonari, Ivan Beschastnikh University of British Columbia To appear at HotNets17 Current Datacenters 2 The future: Disaggregation 3 The future: Disaggregation
More informationDistributed File Systems. CS432: Distributed Systems Spring 2017
Distributed File Systems Reading Chapter 12 (12.1-12.4) [Coulouris 11] Chapter 11 [Tanenbaum 06] Section 4.3, Modern Operating Systems, Fourth Ed., Andrew S. Tanenbaum Section 11.4, Operating Systems Concept,
More informationRecovering Disk Storage Metrics from low level Trace events
Recovering Disk Storage Metrics from low level Trace events Progress Report Meeting May 05, 2016 Houssem Daoud Michel Dagenais École Polytechnique de Montréal Laboratoire DORSAL Agenda Introduction and
More informationby I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS
by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests
More informationCurrent Topics in OS Research. So, what s hot?
Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general
More informationModule 12: I/O Systems
Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Performance 12.1 I/O Hardware Incredible variety of I/O devices Common
More informationAlgorithms and Data Structures for Efficient Free Space Reclamation in WAFL
Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL Ram Kesavan Technical Director, WAFL NetApp, Inc. SDC 2017 1 Outline Garbage collection in WAFL Usenix FAST 2017 ACM Transactions
More informationLog-structured Memory for DRAM-based Storage
Abstract Log-structured Memory for DRAM-based Storage Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout Stanford University (Draft Sept. 24, 213) Traditional memory allocation mechanisms are not
More informationArachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout
Arachne Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout Granular Computing Platform Zaharia Winstein Levis Applications Kozyrakis Cluster Scheduling Ousterhout Low-Latency RPC
More informationMultifunction Networking Adapters
Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained
More informationDB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011
DB2 purescale: High Performance with High-Speed Fabrics Author: Steve Rees Date: April 5, 2011 www.openfabrics.org IBM 2011 Copyright 1 Agenda Quick DB2 purescale recap DB2 purescale comes to Linux DB2
More informationBuffer Management for XFS in Linux. William J. Earl SGI
Buffer Management for XFS in Linux William J. Earl SGI XFS Requirements for a Buffer Cache Delayed allocation of disk space for cached writes supports high write performance Delayed allocation main memory
More informationRAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, and Ryan Stutsman) DRAM in Storage
More informationREMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS
13th ANNUAL WORKSHOP 2017 REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS Tom Talpey Microsoft [ March 31, 2017 ] OUTLINE Windows Persistent Memory Support A brief summary, for better
More informationAt-most-once Algorithm for Linearizable RPC in Distributed Systems
At-most-once Algorithm for Linearizable RPC in Distributed Systems Seo Jin Park 1 1 Stanford University Abstract In a distributed system, like RAM- Cloud, supporting automatic retry of RPC, it is tricky
More informationBig data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity
/7/6 Big data, little time Goal is to keep (hot) data in memory Requires scale-out approach Each server responsible for one chunk Fast access to local data The Case for RackOut Scalable Data Serving Using
More informationFLAT DATACENTER STORAGE CHANDNI MODI (FN8692)
FLAT DATACENTER STORAGE CHANDNI MODI (FN8692) OUTLINE Flat datacenter storage Deterministic data placement in fds Metadata properties of fds Per-blob metadata in fds Dynamic Work Allocation in fds Replication
More informationGoogle File System, Replication. Amin Vahdat CSE 123b May 23, 2006
Google File System, Replication Amin Vahdat CSE 123b May 23, 2006 Annoucements Third assignment available today Due date June 9, 5 pm Final exam, June 14, 11:30-2:30 Google File System (thanks to Mahesh
More information