The OceanStore Write Path

Size: px

Start display at page:

Download "The OceanStore Write Path"

Gabriel Blankenship
5 years ago
Views:

1 The OceanStore Write Path Sean C. Rhea John Kubiatowicz University of California, Berkeley June 11, 2002

2 Introduction: the OceanStore Write Path

3 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file

4 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them

5 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography

6 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results

7 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring

8 Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of consistency for a file Performs write access control, serialization Creates archival fragments of new data and disperses them Certifies the results of its actions with cryptography The Second Tier Caches certificates and data produced at the inner ring Self-organizes into an dissemination tree to share results The Archival Storage Servers Store archival fragments generated in the Inner Ring The Client Machines Create updates and send them to the inner ring Wait for responses to come down the dissemination tree 1

9 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req Time 1. A client sends an update to the inner ring 2

10 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3

11 Introduction: the OceanStore Write Path (con t) Inner Ring Archive App Replica App Replica Replica T req T agree T disseminate Time 1. A client sends an update to the inner ring 2. The inner ring performs a Byzantine agreement, applying the update 3. The results are sent down the dissemination tree and into the archive 4

12 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol

13 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership

14 Write Path Details Inner Ring uses Byzantine agreement for fault tolerance Up to f of 3f + 1 servers can fail We use a modified version of the Castro-Liskov protocol Inner Ring certifies decisions with proactive threshold signatures Single public (verification) key Each member has a key share which lets it generate signature shares Need f + 1 signature shares to generate full signature Independent sets of key shares can be used to control membership Second Tier and Archive are ignorant of composition of Inner Ring Know only the single public key Allows simple replacement of faulty Inner Ring servers 5

15 Micro Benchmarks: Update Latency vs. Update Size bit keys 512 bit keys slope = 0.6 s/mb Latency (ms) slope = 0.6 s/mb Update Size (kb) Use two key sizes to show effects of Moore s Law on latency 512 bit keys are not secure, but are 4 faster Gives an upper bound on latency three years from now 6

16 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms)

17 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive

18 Micro Benchmarks: Update Latency Remarks Threshold signatures are expensive Takes 6.3 ms to generate regular 1024 bit signature But takes 73.9 ms to generate 1024 bit threshold signature share (Combining shares takes less than 1 ms) Unfortunately, this is a mathematical fact of life Cannot use Chinese Remainder Theorem in computing shares (4 ) Making individual shares verifiable is expensive Almost no research into performance of threshold cryptography 7

19 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s

20 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys, 60 synchronous clients Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 8

21 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch?

22 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates

23 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive

24 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages

25 Batching: A Solution to the Small Update Problem What if we could combine many small updates into a single batch? Each Inner Ring member Decides result of each update individually Generates a signature share over the results of all of the updates Saves CPU time Generating signature shares is expensive Saves network bandwidth Each Byzantine agreement requires O(ringsize 2 ) messages But makes signatures unwieldy Each signature is now O(batchsize) long For high throughput, we want batch sizes in the hundreds or thousands 9

26 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children

27 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4.

28 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize)

29 Merkle Trees: Making Batching Efficient Path 2 H 2 H 1 H 3 Key: Sign: H i = SHA1 (H 2 i, H 2i +1 ) (n=15, H 1 ) H 4 H 5 H 8 Result 1 H 9 Result 2 H 15 Result 15 Build a Merkle Tree over results Each node is a hash of it s two children Sign only the tree size and the top hash To verify Result 2, need only signature plus H 2, H 4. Signature over any one result is only O(log batchsize) Provably secure 10

30 Micro Benchmarks: Throughput vs. Update Size Total Update Operations per Second Ops/s MB/s Total Bandwidth (MB/s) Size of Update (kb) Using 1024 bit keys Max throughput is a respectable 5 MB/s Berkeley DB through Java can only do about 7.5 MB/s But we have a problem with small updates 13 ops/s is atrocious! 11

31 Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching Total Bandwidth (MB/s) Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s

32 Micro Benchmarks: Throughput vs. Update Size (w/ Batching) Total Update Operations per Second Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching Total Bandwidth (MB/s) Size of Update (kb) Batching works great Amortizes expensive agreements over many updates For small updates, go from 13.5 ops/s to 76 ops/s Introspecting on batch size should further improve small update tput 12

33 Macro Benchmarks: The Andrew Benchmark Andrew Benchmark JVM UL NFS OSRead OSUpdate OSCreate Client Interface fopen fread fwrite etc. READ WRITE GETATTR etc. Linux Kernel Network Replica Tapestry Tapestry Msgs Built a UNIX file system on top of OceanStore Runs as a user-level NFS daemon on Linux Application s use familiar fopen, fwrite, etc. No recompilation. Kernel translates to NFS requests and sends to local daemon Daemon translates to OceanStore requests and sends out on network 13

34 Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload

35 Macro Benchmarks: The Andrew Benchmark Destination Source U. TX GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech 27.7 (2.2) 59.0 (0.20) Rice 61.5 (0.69) Inter-host ping times in milliseconds For more realism, we used a nationwide network Find out whether Byzantine agreement is practical in wide area Ran the Andrew Benchmark Simulates software development workload For control, used several competitors Linux user-level NFS daemon: real NFS, ships with Debian GNU/Linux Java-based user-level NFS daemon: uses disk (not OceanStore) 14

36 Macro Benchmarks: Local Andrew 80 Time (s) Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t

37 Macro Benchmarks: Local Andrew 80 Time (s) Phase 5: Compile Source Tree Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories 0 Linux NFS Java 512 Simple 1024 Simple 512 Batching + Tentative Simple OceanStore performance not so hot In the local area, NFS is in its element; OceanStore isn t But with tentative update support and batching, OceanStore pretty good Tentative updates let client go on while waiting for agreements Batching allows inner ring to keep up Within a factor of two of Java-based NFS 15

38 Macro Benchmarks: Nationwide Andrew 300 Phase 5: Compile Source Tree Time (s) Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories Simple 512 Simple Linux NFS In the wide area, OceanStore is its element; NFS isn t Even simple OceanStore is nearly within a factor of two Numbers with batching and tentative updates forthcoming Should outperform NFS 16

39 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet

40 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law

41 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates

42 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient

43 Conclusion All the basics of the OceanStore write path implemented and working Not doing full recovery yet Performance is good Single update time under 100 ms, improves directly with Moore s Law Throughput great for large updates Batching allows inner ring to amortize signatures over many updates Get large update throughput with small updates Secure and space-efficient Provides a lot more functionality than competition Higher durability and availability than NFS Cryptographic data integrity Versioning allows logical undo 17

Staggeringly Large File Systems. Presented by Haoyan Geng

Staggeringly Large File Systems Presented by Haoyan Geng Large-scale File Systems How Large? Google s file system in 2009 (Jeff Dean, LADIS 09) - 200+ clusters - Thousands of machines per cluster - Pools