Cluster Consensus When Aeron Met Raft. Martin Thompson

Size: px
Start display at page:

Download "Cluster Consensus When Aeron Met Raft. Martin Thompson"

Transcription

1 Cluster When Aeron Met Raft Martin Thompson

2

3 What does mean?

4 con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity Source:

5 con sen sus noun \ kən-ˈsen(t)-səs \ : general agreement : unanimity : the judgment arrived at by most of those concerned Source:

6 on what?

7

8

9 Raft in a Nutshell

10 Roles Follower Candidate Leader

11 RPCs 1. RequestVote RPC Invoked by candidates to gather votes 2. AppendEntries RPC Invoked by leader to replicate and heartbeat

12 Safety Guarantees Election Safety Leader Append-Only Log Matching Leader Completeness State Machine Safety

13 Monotonic Functions

14 Version all the things!

15 Clustering Aeron

16 Is it Guaranteed Delivery???

17 What is the Architect really looking for?

18 Need to know...

19 Guaranteed Processing

20 Client Client Client Client Client

21 Client Client Client Client Client

22 Client Client Client Client Client

23 Client Client Client Client Client

24 NIO Pain!

25 Do servers crash?

26 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

27 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

28 FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null!= channel) { channel.force(true); }

29 Directory Sync Files.force(directory.toPath(), true);

30 Performance

31 Let s consider an RPC design approach

32 Client Client Client Client Client

33 Client Client Client Client Client

34 Client Client Client Client Client

35 Client Client Client Client Client

36 Client Client Client Client Client

37 Client Client Client Client Client

38 Client Client Client Client Client

39 Client Client Client Client Client

40 Client Client Client Client Client

41 Concurrency and parallelism with Replicated State Machines?

42 1. Parallel is the opposite of Serial 2. Concurrent is the opposite of Sequential 3. Vector is the opposite of Scalar John Gustafson

43 Instruction Pipelining Time Fetch

44 Instruction Pipelining Time Fetch Decode

45 Instruction Pipelining Time Fetch Decode Execute

46 Instruction Pipelining Time Fetch Decode Execute Retire

47 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire

48 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire

49 Instruction Pipelining Time Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire

50 Pipeline Time Order

51 Pipeline Time Order Log

52 Pipeline Time Order Log Transmit

53 Pipeline Time Order Log Transmit Commit

54 Pipeline Time Order Log Transmit Commit Execute

55 Pipeline Time Order Log Transmit Commit Execute Order Log Transmit Commit Execute

56 Pipeline Time Order Log Transmit Commit Execute Order Log Transmit Commit Execute Order Log Transmit Commit Execute

57 Client Client Client Client Client

58 Client Client Client Client Client

59 Client Client Client Client Client

60 Client Client Client Client Client

61 Client Client Client Client Client

62 Client Client Client Client Client

63 Client Client Client Client Client

64 Client Client Client Client Client

65 Client Client Client Client Client

66 NIO Pain!

67 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putint(index, value);

68 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putbytes(index, bytes);

69 ByteBuffer byte[] copies ByteBuffer bytebuffer = ByteBuffer.allocate(64 * 1024); bytebuffer.putbytes(index, bytes);

70 How can Aeron help?

71 Message Index => Byte Index

72 Multicast, MDC, and Spy based Messaging

73 Counters => Bounded Consumption

74 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Average overhead per item or operation in batch

75 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% System calls Network round trips Disk writes Expensive computations

76 Interesting Features

77 Timers

78 All state must enter the system as a message!

79 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

80 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

81 Timers public void foo() { // Decide to schedule a timer } cluster.scheduletimer(correlationid, cluster.timems() + TimeUnit.SECONDS.toMillis(5)); public void ontimerevent(final long correlationid, final long timestampms) { // Look up the correlationid associated with the timer }

82 Back Pressure and Stashed Work

83 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

84 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

85 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

86 Back Pressure public ControlledFragmentAssembler.Action onsessionmessage( final DirectBuffer buffer, final int offset, final int length, final long clustersessionid, final long correlationid) { final ClusterSession session = sessionbyidmap.get(clustersessionid); if (null == session session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowms = cachedepochclock.time(); if (session.state() == OPEN && logpublisher.appendmessage(buffer, offset, length, nowms)) { session.lastactivity(nowms, correlationid); return ControlledFragmentHandler.Action.CONTINUE; } } return ControlledFragmentHandler.Action.ABORT;

87 Log Replay and Snapshots

88 Log Replay and Snapshots Distributed File System?

89 Log Replay and Snapshots Distributed File System? Aeron Archive Recorded Streams

90 Multiple s on the same stream

91 Client Client Client Client Client

92 Client Client Client Client Client

93 NIO Pain!

94 1 2 MappedByteBuffer DirectByteBuffer

95 1 2 MappedByteBuffer DirectByteBuffer DirectByteBuffer MappedByteBuffer

96 In Closing

97 What s the Roadmap?

98

99 Questions? A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. - Leslie Lamport

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR: Putting it all together for SMR: Two-Phase Commit, Leader Election RAFT COS 8: Distributed Systems Lecture Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable

More information

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016 Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016 Bezos mandate for service-oriented-architecture (~2002) 1. All teams will henceforth expose their data and functionality through

More information

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers Replicated s, RAFT COS 8: Distributed Systems Lecture 8 Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable service Goal #: Servers should behave just like

More information

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University Designing for Understandability: the Raft Consensus Algorithm Diego Ongaro John Ousterhout Stanford University Algorithms Should Be Designed For... Correctness? Efficiency? Conciseness? Understandability!

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs P2 Recitation Raft: A Consensus Algorithm for Replicated Logs Presented by Zeleena Kearney and Tushar Agarwal Diego Ongaro and John Ousterhout Stanford University Presentation adapted from the original

More information

Byzantine Fault Tolerant Raft

Byzantine Fault Tolerant Raft Abstract Byzantine Fault Tolerant Raft Dennis Wang, Nina Tai, Yicheng An {dwang22, ninatai, yicheng} @stanford.edu https://github.com/g60726/zatt For this project, we modified the original Raft design

More information

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus Recall: Linearizability (Strong Consistency) Consensus COS 518: Advanced Computer Systems Lecture 4 Provide behavior of a single copy of object: Read should urn the most recent write Subsequent reads should

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

Raft and Paxos Exam Rubric

Raft and Paxos Exam Rubric 1 of 10 03/28/2013 04:27 PM Raft and Paxos Exam Rubric Grading Where points are taken away for incorrect information, every section still has a minimum of 0 points. Raft Exam 1. (4 points, easy) Each figure

More information

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking

More information

hraft: An Implementation of Raft in Haskell

hraft: An Implementation of Raft in Haskell hraft: An Implementation of Raft in Haskell Shantanu Joshi joshi4@cs.stanford.edu June 11, 2014 1 Introduction For my Project, I decided to write an implementation of Raft in Haskell. Raft[1] is a consensus

More information

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load

More information

Distributed Consensus Protocols

Distributed Consensus Protocols Distributed Consensus Protocols ABSTRACT In this paper, I compare Paxos, the most popular and influential of distributed consensus protocols, and Raft, a fairly new protocol that is considered to be a

More information

How to make MySQL work with Raft. Diancheng Wang & Guangchao Bai Staff Database Alibaba Cloud

How to make MySQL work with Raft. Diancheng Wang & Guangchao Bai Staff Database Alibaba Cloud How to make MySQL work with Raft Diancheng Wang & Guangchao Bai Staff Database Engineer @ Alibaba Cloud About me Name: Guangchao Bai Location: Beijing, China Occupation: Staff Database Engineer @ Alibaba

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam Percona Live 2015 September 21-23, 2015 Mövenpick Hotel Amsterdam TokuDB internals Percona team, Vlad Lesin, Sveta Smirnova Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

REPLICATED STATE MACHINES

REPLICATED STATE MACHINES 5/6/208 REPLICATED STATE MACHINES George Porter May 6, 208 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These

More information

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers The impact of QED cannot be overestimated.

More information

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017 CSE 24: REPLICATED STATE MACHINES George Porter November 8 and 0, 207 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons

More information

Comparative Analysis of Big Data Stream Processing Systems

Comparative Analysis of Big Data Stream Processing Systems Comparative Analysis of Big Data Stream Processing Systems Farouk Salem School of Science Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 22 June, 2016 Thesis

More information

A Reliable Broadcast System

A Reliable Broadcast System A Reliable Broadcast System Yuchen Dai, Xiayi Huang, Diansan Zhou Department of Computer Sciences and Engineering Santa Clara University December 10 2013 Table of Contents 2 Introduction......3 2.1 Objective...3

More information

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich

Apache ZooKeeper and orchestration in distributed systems. Andrew Kondratovich Apache ZooKeeper and orchestration in distributed systems Andrew Kondratovich andrew.kondratovich@gmail.com «A distributed system is one in which the failure of a computer you didn't even know existed

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

Improving RAFT. Semester Thesis. Christian Fluri. Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich

Improving RAFT. Semester Thesis. Christian Fluri. Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Distributed Computing Improving RAFT Semester Thesis Christian Fluri fluric@ethz.ch Distributed Computing Group Computer Engineering and Networks Laboratory ETH Zürich Supervisors: Georg Bachmeier, Darya

More information

File Systems: Consistency Issues

File Systems: Consistency Issues File Systems: Consistency Issues File systems maintain many data structures Free list/bit vector Directories File headers and inode structures res Data blocks File Systems: Consistency Issues All data

More information

Rectangles All The Way Down. Martin Thompson

Rectangles All The Way Down. Martin Thompson Rectangles All The Way Down Martin Thompson - @mjpt777 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer

More information

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer Group Replication: A Journey to the Group Communication Core Alfranio Correia (alfranio.correia@oracle.com) Principal Software Engineer 4th of February Copyright 7, Oracle and/or its affiliates. All rights

More information

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Is it difficult writing software that has good performance? RDD (Resume Driven Development) http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ) Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.

More information

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft

Distributed Systems. Day 11: Replication [Part 3 Raft] To survive failures you need a raft Distributed Systems Day : Replication [Part Raft] To survive failures you need a raft Consensus Consensus: A majority of nodes agree on a value Variations on the problem, depending on assumptions Synchronous

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Today l Basic distributed file systems l Two classical examples Next time l Naming things xkdc Distributed File Systems " A DFS supports network-wide sharing of files and devices

More information

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. CS 138: Google CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved. Google Environment Lots (tens of thousands) of computers all more-or-less equal - processor, disk, memory, network interface

More information

File-System Interface

File-System Interface File-System Interface Chapter 10: File-System Interface File Concept Access Methods Directory Structure File-System Mounting File Sharing Protection Objectives To explain the function of file systems To

More information

Building Durable Real-time Data Pipeline

Building Durable Real-time Data Pipeline Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

Introduction to Apache Kafka

Introduction to Apache Kafka Introduction to Apache Kafka Chris Curtin Head of Technical Research Atlanta Java Users Group March 2013 About Me 20+ years in technology Head of Technical Research at Silverpop (12 + years at Silverpop)

More information

In Search of an Understandable Consensus Algorithm

In Search of an Understandable Consensus Algorithm In Search of an Understandable Consensus Algorithm Abstract Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efficient as Paxos, but its

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: replication adds latency and throughput overheads CURP: Consistent Unordered Replication Protocol

More information

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering

More information

CS220 Database Systems. File Organization

CS220 Database Systems. File Organization CS220 Database Systems File Organization Slides from G. Kollios Boston University and UC Berkeley 1.1 Context Database app Query Optimization and Execution Relational Operators Access Methods Buffer Management

More information

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space Today CSCI 5105 Coda GFS PAST Instructor: Abhishek Chandra 2 Coda Main Goals: Availability: Work in the presence of disconnection Scalability: Support large number of users Successor of Andrew File System

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout CGAR: Strong Consistency without Synchronous Replication Seo Jin Park Advised by: John Ousterhout Improved update performance of storage systems with master-back replication Fast: updates complete before

More information

Case study: ext2 FS 1

Case study: ext2 FS 1 Case study: ext2 FS 1 The ext2 file system Second Extended Filesystem The main Linux FS before ext3 Evolved from Minix filesystem (via Extended Filesystem ) Features Block size (1024, 2048, and 4096) configured

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Case study: ext2 FS 1

Case study: ext2 FS 1 Case study: ext2 FS 1 The ext2 file system Second Extended Filesystem The main Linux FS before ext3 Evolved from Minix filesystem (via Extended Filesystem ) Features Block size (1024, 2048, and 4096) configured

More information

Implementing a NTP-Based Time Service within a Distributed Middleware System

Implementing a NTP-Based Time Service within a Distributed Middleware System Implementing a NTP-Based Time Service within a Distributed Middleware System ACM International Conference on the Principles and Practice of Programming in Java (PPPJ `04) Hasan Bulut 1 Motivation Collaboration

More information

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008 The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008 HAMMER Quick Feature List 1 Exabyte capacity (2^60 = 1 million terrabytes). Fine-grained, live-view history retention for snapshots

More information

Three modifications for the Raft consensus algorithm

Three modifications for the Raft consensus algorithm Three modifications for the Raft consensus algorithm 1. Background Henrik Ingo henrik.ingo@openlife.cc August 2015 (v0.2, fixed bugs in algorithm, in Figure 1) In my work at MongoDB, I've been involved

More information

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems ZooKeeper & Curator CS 475, Spring 2018 Concurrent & Distributed Systems Review: Agreement In distributed systems, we have multiple nodes that need to all agree that some object has some state Examples:

More information

Reading and Writing Files

Reading and Writing Files Reading and Writing Files 1 Reading and Writing Files Java provides a number of classes and methods that allow you to read and write files. Two of the most often-used stream classes are FileInputStream

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Distributed File Systems. Directory Hierarchy. Transfer Model

Distributed File Systems. Directory Hierarchy. Transfer Model Distributed File Systems Ken Birman Goal: view a distributed system as a file system Storage is distributed Web tries to make world a collection of hyperlinked documents Issues not common to usual file

More information

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime Recap: i-nodes Case study: ext FS The ext file system Second Extended Filesystem The main Linux FS before ext Evolved from Minix filesystem (via Extended Filesystem ) Features (4, 48, and 49) configured

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita Proseminar Distributed Systems Summer Semester 2016 Paxos algorithm stefan.resmerita@cs.uni-salzburg.at The Paxos algorithm Family of protocols for reaching consensus among distributed agents Agents may

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

Development of a cluster of LXC containers

Development of a cluster of LXC containers Development of a cluster of LXC containers A Master s Thesis Submitted to the Faculty of the Escola Tècnica d Enginyeria de Telecomunicació de Barcelona Universitat Politècnica de Catalunya by Sonia Rivas

More information

Announcements. P4: Graded Will resolve all Project grading issues this week P5: File Systems

Announcements. P4: Graded Will resolve all Project grading issues this week P5: File Systems Announcements P4: Graded Will resolve all Project grading issues this week P5: File Systems Test scripts available Due Due: Wednesday 12/14 by 9 pm. Free Extension Due Date: Friday 12/16 by 9pm. Extension

More information

Caching and reliability

Caching and reliability Caching and reliability Block cache Vs. Latency ~10 ns 1~ ms Access unit Byte (word) Sector Capacity Gigabytes Terabytes Price Expensive Cheap Caching disk contents in RAM Hit ratio h : probability of

More information

CPS 310 midterm exam #2, 4/9/2018

CPS 310 midterm exam #2, 4/9/2018 CPS 310 midterm exam #2, 4/9/2018 Your name please: NetID: Sign for your honor: Answer all questions as directed in the spaces provided. If you answer outside of the spaces provided you will lose points.

More information

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible Distributed Consensus: Making Impossible Possible QCon London Tuesday 29/3/2016 Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 What is Consensus? The process

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 4 File Systems File Systems The best way to store information: Store all information in virtual memory address space Use ordinary memory read/write to access information Not feasible: no enough

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads) Google File System goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) focus on multi-gb files handle appends efficiently (no random writes & sequential reads) co-design GFS

More information

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory Byzantine Fault Tolerance and Consensus Adi Seredinschi Distributed Programming Laboratory 1 (Original) Problem Correct process General goal: Run a distributed algorithm 2 (Original) Problem Correct process

More information

CS 425 / ECE 428 Distributed Systems Fall 2017

CS 425 / ECE 428 Distributed Systems Fall 2017 CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy) Nov 7, 2017 Lecture 21: Replication Control All slides IG Server-side Focus Concurrency Control = how to coordinate multiple concurrent

More information

News from Git in Eclipse. Matthias Sohn (SAP)

News from Git in Eclipse. Matthias Sohn (SAP) News from Git in Eclipse Matthias Sohn (SAP) merge strategy extension point JGit 4.0, EGit 4.1 enables external merge strategy used by EMF Compare to provide model merge (Neon) EMF Compare provides model

More information

NFS Design Goals. Network File System - NFS

NFS Design Goals. Network File System - NFS Network File System - NFS NFS Design Goals NFS is a distributed file system (DFS) originally implemented by Sun Microsystems. NFS is intended for file sharing in a local network with a rather small number

More information

Paxos Playground: a simulation to understand a replicated state machine implementation using Paxos

Paxos Playground: a simulation to understand a replicated state machine implementation using Paxos Paxos Playground: a simulation to understand a replicated state machine implementation using Paxos Juan I. Vimberg Stanford Abstract Paxos is probably the most well-known algorithm to achieve consensus.

More information

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS

FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS FAULT TOLERANT LEADER ELECTION IN DISTRIBUTED SYSTEMS Marius Rafailescu The Faculty of Automatic Control and Computers, POLITEHNICA University, Bucharest ABSTRACT There are many distributed systems which

More information

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing

Coordinating distributed systems part II. Marko Vukolić Distributed Systems and Cloud Computing Coordinating distributed systems part II Marko Vukolić Distributed Systems and Cloud Computing Last Time Coordinating distributed systems part I Zookeeper At the heart of Zookeeper is the ZAB atomic broadcast

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems Pre-Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2015 October 2, 2015 CS 417 - Paul Krzyzanowski 1 Selected Questions From Past Exams October 2, 2015 CS 417 - Paul Krzyzanowski

More information

CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED MUTUAL EXCLUSION] Frequently asked questions from the previous class survey

CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED MUTUAL EXCLUSION] Frequently asked questions from the previous class survey CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED MUTUAL EXCLUSION] Shrideep Pallickara Computer Science Colorado State University L23.1 Frequently asked questions from the previous class survey

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Distributed Video Systems Chapter 3 Storage Technologies

Distributed Video Systems Chapter 3 Storage Technologies Distributed Video Systems Chapter 3 Storage Technologies Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 3.1 Introduction 3.2 Magnetic Disks 3.3 Video

More information

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency Local file systems Disks are terrible abstractions: low-level blocks, etc. Directories, files, links much

More information

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q Coordination 1 To do q q q Mutual exclusion Election algorithms Next time: Global state Coordination and agreement in US Congress 1798-2015 Process coordination How can processes coordinate their action?

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Fundamentals Fall 2013 1 basics 2 basics Single process int i; i=i+1; 1 CPU - Steps are strictly sequential - Program behavior & variables state determined by sequence of operations

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED MUTUAL EXCLUSION] Frequently asked questions from the previous class survey Yes. But what really is a second? 1 second ==time for a cesium 133 atom

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester Section Subject Code Subject Name Degree & Branch : I & II : M.E : CP7204 : Advanced Operating Systems : M.E C.S.E. 1. Define Process? UNIT-1

More information

CPS 512 midterm exam #1, 10/7/2016

CPS 512 midterm exam #1, 10/7/2016 CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say

More information

Technical Report. ARC: Analysis of Raft Consensus. Heidi Howard. Number 857. July Computer Laboratory UCAM-CL-TR-857 ISSN

Technical Report. ARC: Analysis of Raft Consensus. Heidi Howard. Number 857. July Computer Laboratory UCAM-CL-TR-857 ISSN Technical Report UCAM-CL-TR-87 ISSN 7-98 Number 87 Computer Laboratory ARC: Analysis of Raft Consensus Heidi Howard July 0 JJ Thomson Avenue Cambridge CB 0FD United Kingdom phone + 700 http://www.cl.cam.ac.uk/

More information