Opportunistic Use of Content Addressable Storage for Distributed File Systems Niraj Tolia *, Michael Kozuch, M. Satyanarayanan *, Brad Karp, Thomas Bressoud, and Adrian Perrig * * Carnegie Mellon University, Intel Research Pittsburgh and Denison University
Introduction Using a Distributed File System on a Wide Area Network is slow! However, there seems to be a growth in the number of providers of Content Addressable Storage (CAS) Therefore can we make opportunistic use of these CAS providers to benefit client-server file systems like NFS, AFS, and Coda? 2
Content Addressable Storage Content Addressable Storage is data that is identified by its contents instead of a name Foo.txt Foo.txt 0xd58e23b71b1b... File Cryptographic Hash Content Addressable Name An example of a CAS provider is a Distributed Hash Table (DHT) 3
Motivation Use CAS as a performance enhancement when the file server is remote Convert data transfers from WAN to LAN CAS Providers File Server Client 4
Talk Outline Introduction The CASPER File System Building Blocks Recipes Jukeboxes Recipe Servers Architecture Benchmarks and Performance Fuzzy Matching Conclusions 5
The CASPER File System It can make use of any available CAS provider to improve read performance However it does not depend on the CAS providers In the absence of a useful CAS provider, you are no worse off than you originally were Writes are sent directly to the File Server and not to the CAS provider 6
Recipes 0x330c7eb274a4... 0xf13758906c8d... 0xe13b918d6a50... 0xf9d09794b6d7... 0x1deb72e98470... File Data Cryptographic Hash Content Recipe Addressable Name 7
Building Blocks: Recipes Description of objects in a content addressable way First class entity in the file system Can be cached Uses XML for data representation Compression used over the network Can be maintained lazily as they contain version information 8 lego.com
Recipe Example (XML) <recipe type="file"> <metadata> <version>00 00 01 04 01</version> </metadata> <recipe_choice> <hash_list hash_type="sha-1" block_type="variable" number="5"> <hash size="4189">330c7eb274a4...</hash> </hash_list> </recipe_choice> </recipe> 9
CASPER Architecture Client DFS Client WAN Connection DFS File Server Recipe Server LAN Connection DFS File Server Server CAS Provider Jukebox 10
Building Blocks: Jukeboxes Jukeboxes are abstractions of a Content Addressable Storage provider Provide access to data based on their hash value Provide no guarantee to consumers about persistence or reliability Support a Query() and Fetch() interface MultiQuery() and MultiFetch() also available Examples include your desktop, a departmental jukebox, P2P systems, etc. 11
Building Blocks: Recipe Server This module generates recipe representation of files present in the underlying file system Can be placed either on the Distributed File System server or on any other machine well connected to it Helps in maintaining consistency by informing the client of changes in files that it has reconstructed 12 lego.com
CASPER details The file system is based on Coda Whole file caching, open-close consistency Proxy based layering approach used Coda takes care of consistency, conflict detection, resolution, etc. The file server is the final authoritative source CASPER allows us to service cache misses faster that might be usually possible 13
CASPER Architecture Client Coda Client DFS Client DFS Proxy WAN Connection Recipe Server LAN Connection Coda DFS File Server Server CAS Provider Jukebox 14
CASPER Implementation 2. Recipe Request 1. File Request Client 3. Recipe Response Coda Client DFS Proxy WAN Connection Recipe Server LAN Connection 4. CAS Request 5. CAS Response 6. Missed Block Request 7. Missed Block Response Coda File Server Server CAS Provider Jukebox 15
Talk Outline Introduction The CASPER File System Building Blocks Recipes Jukeboxes Recipe Servers Architecture Benchmarks and Performance Fuzzy Matching Conclusions 16
Experimental Setup Client Jukebox NIST Net Router File Server + Recipe Server WAN bandwidth limitations between the server and client were controlled using NIST Net 10 Mb/s, 1 Mb/s + 10ms, and 100 Kb/s + 100ms The hit-ratio on the jukebox was set to 100%, 66%, 33%, and 0% Clients began with a cold cache (no files or recipes) 17
Benchmarks Binary Install Benchmark Virtual Machine Migration Modified Andrew Benchmark 18
Benchmark Description Binary Install (RPM based) Installed RPMs for the Mozilla 1.1 browser 6 RPMs, Total size of 13.5 MB Virtual Machine migration Time taken to resume a migrated Virtual Machine and execute a MS Office-based benchmark Trace accesses ~1000 files @ 256 KB each No think time modeled 19
Benchmark Description (II) Modified Andrew Benchmark Phases include: Create Directory, Copy, Scan Directory, Read All, and Make However, only the Copy phase will exhibit an improvement Uses Apache 1.3.27 11.36 MB source tree, 977 files 53% of files are less than 4 KB and 71% of them are less than 8 KB in size 20
Mozilla (RPM) Install 1.4 1.2 Time normalized against vanilla Coda Normalized Runtime 1 0.8 0.6 0.4 0.2 Gain most pronounced at lower bandwidth Very low overhead seen for these experiments (between 1-5 %) 0 100 Kb/s 1 Mb/s 10 Mb/s 100% 66% 33% 0% Baseline 1238 sec 150 sec 44 sec 21
Virtual Machine Migration 1.4 1.2 Time normalized against vanilla Coda Normalized Runtime 1 0.8 0.6 0.4 0.2 Large amounts of data show benefit even at higher bandwidths High overhead seen at 10 Mb/s is an artifact of data buffering 0 100 Kb/s 1 Mb/s 10 Mb/s 100% 66% 33% 0% Baseline 21523 sec 2046 sec 203 sec 22
Andrew Benchmark 1.4 1.2 1.5 1.61.8 Only the Copy phase shows benefits Normalized Runtime 1.0 0.8 0.6 0.4 0.2 More than half of the files are fetched over the WAN without talking to the jukebox because of their small size 0.0 100 Kb/s 1 Mb/s 10 Mb/s 100% 66% 33% 0% Baseline 1150 sec 103 sec 13 sec 23
Commonality The question of where and how much commonality can be found is still open However, there are a number of applications that will benefit from this approach Some of the applications that would benefit include: Virtual Machine Migration Binary Installs and Upgrades Software Development 24
Mozilla binary commonality 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% mozilla- 16 mozilla- 17 mozilla- 18 mozilla- 19 mozilla- 20 mozilla- 21 mozilla- 22 mozilla- 23 mozilla- 24 mozilla- 25 mozilla-16 mozilla-17 mozilla-18 mozilla-19 mozilla-20 mozilla-21 mozilla-22 mozilla-23 mozilla-24 mozilla-25 25
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Linux kernel commonality 2.2 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9 2.2.10 2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 2.2.16 2.2.17 2.2.18 2.2.19 2.2.20 2.2.21 2.2.22 2.2.23 2.2.24 2.2.1 2.2.0 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9 2.2.10 2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 2.2.16 2.2.17 2.2.18 2.2.19 2.2.20 2.2.21 2.2.22 2.2.23 2.2.24 26 2.2.0
Related Work Delta Encoding rsync, HTTP, etc. Distributed File Systems NFS, AFS, Coda, etc. P2P Content Addressable Networks Chord, Pastry, Freenet, CAN, etc. Hash based storage and file systems Venti, LBFS, Ivy, EMC s Centera, Farsite, etc. 27
Conclusions Introduction of the concept of recipes Proven benefits of opportunistic use of content providers by traditional distributed file systems on WANs Introduced Fuzzy Matching 28
29
Backup Slides 30
Where did the time go? For the Andrew benchmark Reconstruction of a large number of small files takes 4 roundtrips There is also the overhead of compression, verification, etc. Some part of the system (CAS requests) can be optimized by performing work in parallel 31
Number of round trips 1. Recipe Request Client Recipe Response Coda Client DFS Proxy WAN Connection Recipe Server LAN Connection 2 & 3 CAS Request CAS Response 4. Missed Block Request Missed Block Response Coda File Server Server CAS Provider Jukebox 32
Absolute Andrew Jukebox Hit-Ratio 100 Kb/s Network Bandwidth 1 Mb/s 10 Mb/s 100% 261.3 (0.5) 40.3 (0.9) 17.3 (1.2) 66% 520.7 (0.5) 64.0 (0.8) 20.0 (1.6) 33% 762.7 (0.5) 85.0 (0.8) 21.3 (0.5) 0% 1069.3 (1.7) 108.7 (0.5) 23.7 (0.5) Baseline 1150.7 (0.5) 103.3 (0.5) 13.3 (0.5) Andrew Benchmark: Copy Performance (sec) 33
NFS Implementation? The benchmark results would not change significantly (with the possible exception of the Virtual Machine migration benchmark). It is definitely possible to adopt a similar approach In fact, an NFS proxy (without CASPER) exists. However, the semantics of such a system are still unclear 34
Fuzzy Matching Question: Can we convert an incorrect block into what we need? If there is a block that is near to what is needed, treat it as a transmission error Fix it by applying an error-correcting code Fuzzy Matching needs three components Exact hash Fuzzy hash ECC information 35
Fuzzy Hashes File Data 4 7 8 9 2 0 Shingleprints 3 Sort 0 2 3 4 7 8 9 Fuzzy Hash 36
Fuzzy Hashing and ECCs A fuzzy hash could simply be a shingle Hash a number of features with a sliding window Take the first m hashes after sorting to be representative of the data These m shingles are used to find near blocks After finding a similar block, an ECC that tolerates a number of changes could be applied to recover the original block Definite tradeoff between recipe size and Fuzzy Matching but this approach is promising 37
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Linux kernel commonality - 2.4 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.8 2.4.9 2.4.10 2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 2.4.16 2.4.17 2.4.18 2.4.19 2.4.20 2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.8 2.4.9 2.4.10 2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 2.4.16 2.4.17 2.4.18 2.4.19 2.4.20 38 2.4.1 2.4.0
Linux kernel 2.2 (B/W) 39