The OceanStore Write Path

Size: px
Start display at page:

Download "The OceanStore Write Path"

Transcription

1 The OceanStore Write Path Sean C. Rhea John Kubiatowicz University of California, Berkeley June 11, 2002

2 Introduction: the OceanStore Write Path

3 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file

4 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them

5 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography

6 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results

7 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring

8 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring The Client Machines Create updates and send them to the inner ring Wait for responses to come down the dissemination tree 1

9 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req Time 1. A client sends an update to the inner ring 2

10 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3

11 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree T disseminate Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3. The results are sent down the dissemination tree and into the archive 4

12 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol

13 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership

14 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership Second Tier and Archive are ignorant of composition of Inner Ring Know only the single public key Allows simple replacement of faulty Inner Ring servers 5

15 Micro Benchmarks: Update Latency vs. Update Size bit keys 512 bit keys slope = 0.6 s/mb Latency (ms) slope = 0.6 s/mb Update Size (kb) Use two key sizes to show effects of Moore s Law on latency 512 bit keys are not secure, but are 4 faster Gives an upper bound on latency three years from now 6

16 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms)

17 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive

18 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive Almost no research into performance of threshold cryptography 7

19 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s

20 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 8

21 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch?

22 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates

23 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive

24 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages

25 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages But makes signatures unwieldy Each signature is now O(batchsize) long For high throughput, we want batch sizes in the hundreds or thousands 9

26 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children

27 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4.

28 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize)

29 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize) Provably secure 10

30 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 11

31 Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching Total Bandwidth (MB/s) Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s

32 Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching Total Bandwidth (MB/s) Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s Introspecting on batch size should further improve small update tput 12

33 Macro Benchmarks: The Andrew Benchmark Andrew Benchmark JVM UL NFS OSRead OSUpdate OSCreate Client Interface fopen fread fwrite etc. READ WRITE GETATTR etc. Linux Kernel Network Replica Tapestry Tapestry Msgs Built a UNIX file system on top of OceanStore Runs as a user-level NFS daemon on Linux Application s use familiar fopen, fwrite, etc. No recompilation. Kernel translates to NFS requests and sends to local daemon Daemon translates to OceanStore requests and sends out on network 13

34 Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload

35 Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload For control, used several competitors Linux user-level NFS daemon: real NFS, ships with Debian GNU/Linux Java-based user-level NFS daemon: uses disk (not OceanStore) 14

36 Macro Benchmarks: Local Andrew 80 Time (s) Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t

37 Macro Benchmarks: Local Andrew 80 Time (s) Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t But with tentative update support and batching, OceanStore pretty good Tentative updates let client go on while waiting for agreements Batching allows inner ring to keep up Within a factor of two of Java-based NFS 15

38 Macro Benchmarks: Nationwide Andrew 300 Phase 5: Compile Source Tree Time (s) Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories Simple 512 Simple Linux NFS In the wide area, OceanStore is its element; NFS isn t Even simple OceanStore is nearly within a factor of two Numbers with batching and tentative updates forthcoming Should outperform NFS 16

39 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet

40 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law

41 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates

42 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient

43 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient Provides a lot more functionality than competition Higher durability and availability than NFS Cryptographic data integrity Versioning allows logical undo 17

Staggeringly Large File Systems. Presented by Haoyan Geng

Staggeringly Large File Systems. Presented by Haoyan Geng Staggeringly Large File Systems Presented by Haoyan Geng Large-scale File Systems How Large? Google s file system in 2009 (Jeff Dean, LADIS 09) - 200+ clusters - Thousands of machines per cluster - Pools

More information

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies USENIX Association Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies San Francisco, CA, USA March 31 April 2, 2003 2003 by The USENIX Association All Rights Reserved For more

More information

Staggeringly Large Filesystems

Staggeringly Large Filesystems Staggeringly Large Filesystems Evan Danaher CS 6410 - October 27, 2009 Outline 1 Large Filesystems 2 GFS 3 Pond Outline 1 Large Filesystems 2 GFS 3 Pond Internet Scale Web 2.0 GFS Thousands of machines

More information

Advantages of P2P systems. P2P Caching and Archiving. Squirrel. Papers to Discuss. Why bother? Implementation

Advantages of P2P systems. P2P Caching and Archiving. Squirrel. Papers to Discuss. Why bother? Implementation Advantages of P2P systems P2P Caching and Archiving Tyrone Nicholas May 10, 2004 Redundancy built in - by definition there are a large number of servers Lends itself automatically to multiple copies of

More information

Improving Bandwidth Efficiency of Peer-to-Peer Storage

Improving Bandwidth Efficiency of Peer-to-Peer Storage Improving Bandwidth Efficiency of Peer-to-Peer Storage Patrick Eaton, Emil Ong, John Kubiatowicz University of California, Berkeley http://oceanstore.cs.berkeley.edu/ P2P Storage: Promise vs.. Reality

More information

ZZ: Cheap Practical BFT using Virtualization

ZZ: Cheap Practical BFT using Virtualization University of Massachusetts, Technical Report TR14-08 1 ZZ: Cheap Practical BFT using Virtualization Timothy Wood, Rahul Singh, Arun Venkataramani, and Prashant Shenoy Department of Computer Science, University

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Practical Byzantine Consensus CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Scenario Asynchronous system Signed messages s are state machines It has to be practical CS 138

More information

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory Byzantine Fault Tolerance and Consensus Adi Seredinschi Distributed Programming Laboratory 1 (Original) Problem Correct process General goal: Run a distributed algorithm 2 (Original) Problem Correct process

More information

ZZ and the Art of Practical BFT Execution

ZZ and the Art of Practical BFT Execution To appear in EuroSys 2 and the Art of Practical BFT Execution Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, And Emmanuel Cecchet Department of Computer Science, University of Massachusetts

More information

Tolerating Latency in Replicated State Machines through Client Speculation

Tolerating Latency in Replicated State Machines through Client Speculation Tolerating Latency in Replicated State Machines through Client Speculation April 22, 2009 1, James Cowling 2, Edmund B. Nightingale 3, Peter M. Chen 1, Jason Flinn 1, Barbara Liskov 2 University of Michigan

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin Zyzzyva Speculative Byzantine Fault Tolerance Ramakrishna Kotla L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin The Goal Transform high-performance service into high-performance

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change

More information

Veritas Storage Foundation and. Sun Solaris ZFS. A performance study based on commercial workloads. August 02, 2007

Veritas Storage Foundation and. Sun Solaris ZFS. A performance study based on commercial workloads. August 02, 2007 Veritas Storage Foundation and Sun Solaris ZFS A performance study based on commercial workloads August 02, 2007 Introduction...3 Executive Summary...4 About Veritas Storage Foundation...5 Veritas Storage

More information

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know:

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know: CIS 21 Final Study Guide Final covers ch. 1-20, except for 17. Need to know: I. Amdahl's Law II. Moore s Law III. Processes and Threading A. What is a process? B. What is a thread? C. Modes (kernel mode,

More information

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks) DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK Subject Code : IT1001 Subject Name : Distributed Systems Year / Sem : IV / VII UNIT I 1. Define distributed systems. 2. Give examples of distributed systems

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals... 2 1.2 Data model and the hierarchical namespace... 3 1.3 Nodes and ephemeral nodes...

More information

ZZ and the Art of Practical BFT Execution

ZZ and the Art of Practical BFT Execution Extended Technical Report for EuroSys 2 Paper and the Art of Practical BFT Execution Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, And Emmanuel Cecchet Department of Computer Science,

More information

Byzantine Fault Tolerance

Byzantine Fault Tolerance Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures

More information

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous

More information

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015 Red Hat Gluster Storage performance Manoj Pillai and Ben England Performance Engineering June 25, 2015 RDMA Erasure Coding NFS-Ganesha New or improved features (in last year) Snapshots SSD support Erasure

More information

Byzantine Techniques

Byzantine Techniques November 29, 2005 Reliability and Failure There can be no unity without agreement, and there can be no agreement without conciliation René Maowad Reliability and Failure There can be no unity without agreement,

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Isilon Performance. Name

Isilon Performance. Name 1 Isilon Performance Name 2 Agenda Architecture Overview Next Generation Hardware Performance Caching Performance Streaming Reads Performance Tuning OneFS Architecture Overview Copyright 2014 EMC Corporation.

More information

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit Memory Management All data in memory before and after processing All instructions in memory in order to execute Memory management determines what is to be in memory Memory management activities Keeping

More information

Secure Distributed Storage in Peer-to-peer networks

Secure Distributed Storage in Peer-to-peer networks Secure Distributed Storage in Peer-to-peer networks Øyvind Hanssen 07.02.2007 Motivation Mobile and ubiquitous computing Persistent information in untrusted networks Sharing of storage and information

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

Byzantine Fault Tolerance Can Be Fast

Byzantine Fault Tolerance Can Be Fast Byzantine Fault Tolerance Can Be Fast Miguel Castro Microsoft Research Ltd. 1 Guildhall St., Cambridge CB2 3NH, UK mcastro@microsoft.com Barbara Liskov MIT Laboratory for Computer Science 545 Technology

More information

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables COS 418: Distributed Systems Lecture 7 Today 1. Peer-to-Peer Systems Napster, Gnutella, BitTorrent, challenges 2. Distributed Hash Tables 3. The Chord Lookup

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

TFS: A Transparent File System for Contributory Storage

TFS: A Transparent File System for Contributory Storage TFS: A Transparent File System for Contributory Storage James Cipar, Mark Corner, Emery Berger http://prisms.cs.umass.edu/tcsm University of Massachusetts, Amherst Contributory Applications Users contribute

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

Peer-to-peer computing research a fad?

Peer-to-peer computing research a fad? Peer-to-peer computing research a fad? Frans Kaashoek kaashoek@lcs.mit.edu NSF Project IRIS http://www.project-iris.net Berkeley, ICSI, MIT, NYU, Rice What is a P2P system? Node Node Node Internet Node

More information

Extremely Fast Distributed Storage for Cloud Service Providers

Extremely Fast Distributed Storage for Cloud Service Providers Solution brief Intel Storage Builders StorPool Storage Intel SSD DC S3510 Series Intel Xeon Processor E3 and E5 Families Intel Ethernet Converged Network Adapter X710 Family Extremely Fast Distributed

More information

Chapter 8 Main Memory

Chapter 8 Main Memory COP 4610: Introduction to Operating Systems (Spring 2014) Chapter 8 Main Memory Zhi Wang Florida State University Contents Background Swapping Contiguous memory allocation Paging Segmentation OS examples

More information

Initial Evaluation of a User-Level Device Driver Framework

Initial Evaluation of a User-Level Device Driver Framework Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

Third Midterm Exam April 24, 2017 CS162 Operating Systems

Third Midterm Exam April 24, 2017 CS162 Operating Systems University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2017 Ion Stoica Third Midterm Exam April 24, 2017 CS162 Operating Systems Your Name: SID AND 162 Login: TA

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

VM & VSE Tech Conference May Orlando Session M70

VM & VSE Tech Conference May Orlando Session M70 VM & VSE Tech Conference May 2000 - Orlando Session M70 Bill Bitner VM Performance 607-752-6022 bitner@vnet.ibm.com Last Updated: April10,2000 RETURN TO INDEX Disclaimer Legal Stuff The information contained

More information

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON. DISTRIBUTED SYSTEMS 121r itac itple TAYAdiets Second Edition Andrew S. Tanenbaum Maarten Van Steen Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON Prentice Hall Upper Saddle River, NJ 07458 CONTENTS

More information

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 16. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2017 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

Third Midterm Exam April 24, 2017 CS162 Operating Systems

Third Midterm Exam April 24, 2017 CS162 Operating Systems University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2017 Ion Stoica Third Midterm Exam April 24, 2017 CS162 Operating Systems Your Name: SID AND 162 Login: TA

More information

Customizable Fault Tolerance for Wide-Area Replication

Customizable Fault Tolerance for Wide-Area Replication Customizable Fault Tolerance for Wide-Area Replication Yair Amir 1, Brian Coan 2, Jonathan Kirsch 1, John Lane 1 1 Johns Hopkins University, Baltimore, MD. {yairamir, jak, johnlane}@cs.jhu.edu 2 Telcordia

More information

Lazy Verification in Fault-Tolerant Distributed Storage Systems

Lazy Verification in Fault-Tolerant Distributed Storage Systems Lazy Verification in Fault-Tolerant Distributed Storage Systems Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, Jay J. Wylie Carnegie Mellon University, Network Appliance,

More information

ZZ and the Art of Practical BFT

ZZ and the Art of Practical BFT University of Massachusetts, Technical Report 29-24 ZZ and the Art of Practical BFT Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, and Emmanuel Cecchet Department of Computer Science,

More information

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov Byzantine fault tolerance Jinyang Li With PBFT slides from Liskov What we ve learnt so far: tolerate fail-stop failures Traditional RSM tolerates benign failures Node crashes Network partitions A RSM w/

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

ZZ and the Art of Practical BFT Execution

ZZ and the Art of Practical BFT Execution and the Art of Practical BFT Execution Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, and Emmanuel Cecchet University of Massachusetts Amherst {twood,rahul,arun,shenoy,cecchet}@cs.umass.edu

More information

Windows Servers In Microsoft Azure

Windows Servers In Microsoft Azure $6/Month Windows Servers In Microsoft Azure What I m Going Over 1. How inexpensive servers in Microsoft Azure are 2. How I get Windows servers for $6/month 3. Why Azure hosted servers are way better 4.

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00)

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00) PBFT: A Byzantine Renaissance Practical Byzantine Fault-Tolerance (CL99, CL00) first to be safe in asynchronous systems live under weak synchrony assumptions -Byzantine Paxos! The Setup Crypto System Model

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems

More information

Robust BFT Protocols

Robust BFT Protocols Robust BFT Protocols Sonia Ben Mokhtar, LIRIS, CNRS, Lyon Joint work with Pierre Louis Aublin, Grenoble university Vivien Quéma, Grenoble INP 18/10/2013 Who am I? CNRS reseacher, LIRIS lab, DRIM research

More information

Virtual Machines Measure Up

Virtual Machines Measure Up Virtual Machines Measure Up Graduate Operating Systems, Fall 2005 Final Project Presentation John Staton Karsten Steinhaeuser University of Notre Dame December 15, 2005 Outline Problem Description Virtual

More information

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38 Atomicity Bailu Ding Oct 18, 2012 Bailu Ding Atomicity Oct 18, 2012 1 / 38 Outline 1 Introduction 2 State Machine 3 Sinfonia 4 Dangers of Replication Bailu Ding Atomicity Oct 18, 2012 2 / 38 Introduction

More information

Tolerating latency in replicated state machines through client speculation

Tolerating latency in replicated state machines through client speculation Tolerating latency in replicated state machines through client speculation Benjamin Wester Peter M. Chen University of Michigan James Cowling Jason Flinn MIT CSAIL Edmund B. Nightingale Barbara Liskov

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

FileBench A Prototype Model Based Workload for File Systems

FileBench A Prototype Model Based Workload for File Systems FileBench A Prototype Model Based Workload for File Systems Work In Progress Report 4/1/2004 Richard McDougall Glenn Colaco Sun Microsystems Benchmarks? For Vendors Product characterization Product design

More information

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Fall 2006 University of California, Berkeley College of Engineering Computer Science Division EECS John Kubiatowicz Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Your

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Improve Web Application Performance with Zend Platform

Improve Web Application Performance with Zend Platform Improve Web Application Performance with Zend Platform Shahar Evron Zend Sr. PHP Specialist Copyright 2007, Zend Technologies Inc. Agenda Benchmark Setup Comprehensive Performance Multilayered Caching

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Appears in the Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999 Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Laboratory

More information

Exploiting Route Redundancy via Structured Peer to Peer Overlays

Exploiting Route Redundancy via Structured Peer to Peer Overlays Exploiting Route Redundancy ia Structured Peer to Peer Oerlays Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony D. Joseph, and John D. Kubiatowicz Uniersity of California, Berkeley Challenges Facing

More information

State of the Linux Kernel

State of the Linux Kernel State of the Linux Kernel Timothy D. Witham Chief Technology Officer Open Source Development Labs, Inc. 1 Agenda Process Performance/Scalability Responsiveness Usability Improvements Device support Multimedia

More information

Reducing the Costs of Large-Scale BFT Replication

Reducing the Costs of Large-Scale BFT Replication Reducing the Costs of Large-Scale BFT Replication Marco Serafini & Neeraj Suri TU Darmstadt, Germany Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

More information

Peer-to-Peer Systems and Distributed Hash Tables

Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables CS 240: Computing Systems and Concurrency Lecture 8 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected

More information

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University HyperDex A Distributed, Searchable Key-Value Store Robert Escriva Bernard Wong Emin Gün Sirer Department of Computer Science Cornell University School of Computer Science University of Waterloo ACM SIGCOMM

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

IBM DS8870 Release 7.0 Performance Update

IBM DS8870 Release 7.0 Performance Update IBM DS8870 Release 7.0 Performance Update Enterprise Storage Performance David Whitworth Yan Xu 2012 IBM Corporation Agenda Performance Overview System z (CKD) Open Systems (FB) Easy Tier Copy Services

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

ArcGIS Enterprise: Performance and Scalability Best Practices. Darren Baird, PE, Esri

ArcGIS Enterprise: Performance and Scalability Best Practices. Darren Baird, PE, Esri ArcGIS Enterprise: Performance and Scalability Best Practices Darren Baird, PE, Esri dbaird@esri.com What is ArcGIS Enterprise What s Included with ArcGIS Enterprise ArcGIS Server the core web services

More information

Ceph: A Scalable, High-Performance Distributed File System

Ceph: A Scalable, High-Performance Distributed File System Ceph: A Scalable, High-Performance Distributed File System S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long Presented by Philip Snowberger Department of Computer Science and Engineering University

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Kernel Level Speculative DSM

Kernel Level Speculative DSM Motivation Main interest is performance, fault-tolerance, and correctness of distributed systems Present our ideas in the context of a DSM system We are developing tools that Improve performance Address

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li Abstract Along with cryptocurrencies become a great success known to the world, how to

More information

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,

More information

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO Agenda Technical challenge Custom product Growth of aspirations Enterprise requirements Making an enterprise cold storage product 2 Technical Challenge

More information

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System GFS: Google File System Google C/C++ HDFS: Hadoop Distributed File System Yahoo Java, Open Source Sector: Distributed Storage System University of Illinois at Chicago C++, Open Source 2 System that permanently

More information

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE Presented by Byungjin Jun 1 What is Dynamo for? Highly available key-value storages system Simple primary-key only interface Scalable and Reliable Tradeoff:

More information

1 of 8 14/12/2013 11:51 Tuning long-running processes Contents 1. Reduce the database size 2. Balancing the hardware resources 3. Specifying initial DB2 database settings 4. Specifying initial Oracle database

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

CS Amazon Dynamo

CS Amazon Dynamo CS 5450 Amazon Dynamo Amazon s Architecture Dynamo The platform for Amazon's e-commerce services: shopping chart, best seller list, produce catalog, promotional items etc. A highly available, distributed

More information

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency Local file systems Disks are terrible abstractions: low-level blocks, etc. Directories, files, links much

More information

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Global Computation Model. Cluster-based Applications

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Global Computation Model. Cluster-based Applications Challenges in the Wide-area Tapestry: Decentralized Routing and Location System Seminar S 0 Ben Y. Zhao CS Division, U. C. Berkeley Trends: Exponential growth in CPU, b/w, storage Network expanding in

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec Software Defined Storage at the Speed of Flash PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec Agenda Introduction Software Technology Architecture Review Oracle Configuration

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Key: Location and Routing. Driving Applications

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Key: Location and Routing. Driving Applications Challenges in the Wide-area Tapestry: Decentralized Routing and Location SPAM Summer 00 Ben Y. Zhao CS Division, U. C. Berkeley! Trends: Exponential growth in CPU, b/w, storage Network expanding in reach

More information