The OceanStore Write Path

Similar documents
Staggeringly Large File Systems. Presented by Haoyan Geng

Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies

Staggeringly Large Filesystems

Advantages of P2P systems. P2P Caching and Archiving. Squirrel. Papers to Discuss. Why bother? Implementation

Improving Bandwidth Efficiency of Peer-to-Peer Storage

ZZ: Cheap Practical BFT using Virtualization

CA485 Ray Walshe Google File System

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

ZZ and the Art of Practical BFT Execution

Tolerating Latency in Replicated State Machines through Client Speculation

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Practical Byzantine Fault

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Veritas Storage Foundation and. Sun Solaris ZFS. A performance study based on commercial workloads. August 02, 2007

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know:

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

ZooKeeper. Table of contents

ZZ and the Art of Practical BFT Execution

Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Byzantine Techniques

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Isilon Performance. Name

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

Secure Distributed Storage in Peer-to-peer networks

10. Replication. Motivation

Byzantine Fault Tolerance Can Be Fast

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

TFS: A Transparent File System for Contributory Storage

Practical Byzantine Fault Tolerance

CS 655 Advanced Topics in Distributed Systems

Peer-to-peer computing research a fad?

Extremely Fast Distributed Storage for Cloud Service Providers

Chapter 8 Main Memory

Initial Evaluation of a User-Level Device Driver Framework

Software-defined Storage: Fast, Safe and Efficient

Google File System. Arun Sundaram Operating Systems

Third Midterm Exam April 24, 2017 CS162 Operating Systems

GFS: The Google File System

VM & VSE Tech Conference May Orlando Session M70

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Third Midterm Exam April 24, 2017 CS162 Operating Systems

Customizable Fault Tolerance for Wide-Area Replication

Lazy Verification in Fault-Tolerant Distributed Storage Systems

ZZ and the Art of Practical BFT

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

The Google File System (GFS)

ZZ and the Art of Practical BFT Execution

Windows Servers In Microsoft Azure

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00)

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Robust BFT Protocols

Virtual Machines Measure Up

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

Tolerating latency in replicated state machines through client speculation

The Google File System

FileBench A Prototype Model Based Workload for File Systems

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

The Google File System

Improve Web Application Performance with Zend Platform

Practical Byzantine Fault Tolerance

Exploiting Route Redundancy via Structured Peer to Peer Overlays

State of the Linux Kernel

Reducing the Costs of Large-Scale BFT Replication

Peer-to-Peer Systems and Distributed Hash Tables

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

The Google File System

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Xen Network I/O Performance Analysis and Opportunities for Improvement

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

IBM DS8870 Release 7.0 Performance Update

I/O Buffering and Streaming

ArcGIS Enterprise: Performance and Scalability Best Practices. Darren Baird, PE, Esri

Ceph: A Scalable, High-Performance Distributed File System

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

OPERATING SYSTEM. Chapter 12: File System Implementation

Kernel Level Speculative DSM

File Size Distribution on UNIX Systems Then and Now

Practical Byzantine Fault Tolerance Consensus and A Simple Distributed Ledger Application Hao Xu Muyun Chen Xin Li

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun


Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CS Amazon Dynamo

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Global Computation Model. Cluster-based Applications

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

Chapter 8: Main Memory

Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Key: Location and Routing. Driving Applications

Transcription:

The OceanStore Write Path Sean C. Rhea John Kubiatowicz University of California, Berkeley June 11, 2002

Introduction: the OceanStore Write Path

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring

Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring The Client Machines Create updates and send them to the inner ring Wait for responses to come down the dissemination tree 1

Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req Time 1. A client sends an update to the inner ring 2

Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3

Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree T disseminate Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3. The results are sent down the dissemination tree and into the archive 4

Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol

Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership

Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership Second Tier and Archive are ignorant of composition of Inner Ring Know only the single public key Allows simple replacement of faulty Inner Ring servers 5

Micro Benchmarks: Update Latency vs. Update Size 140 120 1024 bit keys 512 bit keys slope = 0.6 s/mb 120 100 100 Latency (ms) 80 60 40 slope = 0.6 s/mb 80 60 40 20 20 0 0 4 8 12 16 20 24 28 32 0 Update Size (kb) Use two key sizes to show effects of Moore s Law on latency 512 bit keys are not secure, but are 4 faster Gives an upper bound on latency three years from now 6

Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms)

Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive

Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive Almost no research into performance of threshold cryptography 7

Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second 80 70 60 50 40 30 20 10 Ops/s MB/s 7 6 5 4 3 2 1 Total Bandwidth (MB/s) 0 2 8 32 128 512 2048 Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s

Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second 80 70 60 50 40 30 20 10 Ops/s MB/s 7 6 5 4 3 2 1 Total Bandwidth (MB/s) 0 2 8 32 128 512 2048 Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 8

Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch?

Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates

Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive

Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages

Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages But makes signatures unwieldy Each signature is now O(batchsize) long For high throughput, we want batch sizes in the hundreds or thousands 9

Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children

Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4.

Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize)

Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize) Provably secure 10

Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second 80 70 60 50 40 30 20 10 Ops/s MB/s 7 6 5 4 3 2 1 Total Bandwidth (MB/s) 0 2 8 32 128 512 2048 Size of Update (kb) Using 1024 bit keys Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 11

Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second 80 70 60 50 40 30 20 10 Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching 7 6 5 4 3 2 1 Total Bandwidth (MB/s) 0 2 8 32 128 512 2048 Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s

Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second 80 70 60 50 40 30 20 10 Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching 7 6 5 4 3 2 1 Total Bandwidth (MB/s) 0 2 8 32 128 512 2048 Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s Introspecting on batch size should further improve small update tput 12

Macro Benchmarks: The Andrew Benchmark Andrew Benchmark JVM UL NFS OSRead OSUpdate OSCreate Client Interface fopen fread fwrite etc. READ WRITE GETATTR etc. Linux Kernel Network Replica Tapestry Tapestry Msgs Built a UNIX file system on top of OceanStore Runs as a user-level NFS daemon on Linux Application s use familiar fopen, fwrite, etc. No recompilation. Kernel translates to NFS requests and sends to local daemon Daemon translates to OceanStore requests and sends out on network 13

Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload

Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload For control, used several competitors Linux user-level NFS daemon: real NFS, ships with Debian GNU/Linux Java-based user-level NFS daemon: uses disk (not OceanStore) 14

Macro Benchmarks: Local Andrew 80 Time (s) 60 40 20 Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t

Macro Benchmarks: Local Andrew 80 Time (s) 60 40 20 Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t But with tentative update support and batching, OceanStore pretty good Tentative updates let client go on while waiting for agreements Batching allows inner ring to keep up Within a factor of two of Java-based NFS 15

Macro Benchmarks: Nationwide Andrew 300 Phase 5: Compile Source Tree Time (s) 200 100 Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 1024 Simple 512 Simple Linux NFS In the wide area, OceanStore is its element; NFS isn t Even simple OceanStore is nearly within a factor of two Numbers with batching and tentative updates forthcoming Should outperform NFS 16

Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet

Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law

Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates

Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient

Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient Provides a lot more functionality than competition Higher durability and availability than NFS Cryptographic data integrity Versioning allows logical undo 17