Reclaiming Space from Duplicate Files in a Serverless Distributed File System

Size: px

Start display at page:

Download "Reclaiming Space from Duplicate Files in a Serverless Distributed File System"

Andrea Reeves
6 years ago
Views:

1 Reclaiming Space rom Duplicate Files in a Serverless Distributed File System John R. Douceur Atul Adya William J. Bolosky Dan Simon Marvin Theimer July 22 Technical Report MSR-TR-22-3 Microsot Research Microsot Corporation One Microsot Way Redmond, WA 9852

2 Reclaiming Space rom Duplicate Files in a Serverless Distributed File System* John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, Marvin Theimer Microsot Research {johndo, adya, bolosky, dansimon, theimer}@microsot.com Abstract The Farsite distributed ile system provides availability by replicating each ile onto multiple desktop computers. Since this replication consumes signiicant storage space, it is important to reclaim used space where possible. Measurement o over 5 desktop ile systems shows that nearly hal o all consumed space is occupied by duplicate iles. We present a mechanism to reclaim space rom this incidental duplication to make it available or controlled ile replication. Our mechanism includes ) convergent encryption, which enables duplicate iles to coalesced into the space o a single ile, even i the iles are encrypted with dierent users keys, and 2) SALAD, a Sel- Arranging, Lossy, Associative Database or aggregating ile content and location inormation in a decentralized, scalable, ault-tolerant manner. Large-scale simulation experiments show that the duplicate-ile coalescing system is scalable, highly eective, and ault-tolerant.. Introduction This paper addresses the problems o identiying and coalescing identical iles in the Farsite [8] distributed ile system, or the purpose o reclaiming storage space consumed by incidentally redundant content. Farsite is a secure, scalable, serverless ile system that logically unctions as a centralized ile server but that is physically distributed among a networked collection o desktop workstations. Since desktop machines are not always on, not centrally managed, and not physically secured, the space reclamation process must tolerate a high rate o system ailure, operate without central coordination, and unction in tandem with cryptographic security. Farsite s intended purpose is to provide the advantages o a central ile server (a global name space, locationtransparency, reliability, availability, and security) without the attendant disadvantages (additional expense, physical plant, administration, and vulnerability to geographically localized aults). It provides high availability and reliability while executing on a substrate o inherently unreliable machines primarily through a high degree o replication o both ile content and directory inrastructure. Since this deliberate and controlled replication causes a dramatic increase in the space consumed by the ile system, it is advantageous to reclaim storage space due to incidental and erratic duplication o ile content. Since the disk space o desktop computers is mostly unused [3] and becoming less used over time [4], reclaiming disk space might not seem to be an important issue. However, Farsite (like most peer-to-peer systems [32]) relies on voluntary participation o the client machines owners, who may be reluctant to let their machines participate in a distributed ile system that substantially reduces the disk space available or their use. Measurements [8] o 55 desktop ile systems at Microsot show that almost hal o all occupied disk space can be reclaimed by eliminating duplicate iles rom the aggregated set o multiple users systems. Perorming this reclamation in Farsite requires solutions to our problems: ) Enabling the identiication and coalescing o identical iles when these iles are (or security reasons) encrypted with the keys o dierent users. 2) Identiying, in a decentralized, scalable, aulttolerant manner, iles that have identical content. 3) Relocating the replicas o iles with identical content to a common set o storage machines. 4) Coalescing the identical iles to reclaim storage space, while maintaining the semantics o separate iles. The latter two o these problems are addressed by a ile-replica-relocation system [4] and the Windows 2 [34] Single-Instance Store (SIS) [7], which are described in other publications. In the present paper, we describe the irst two problems solutions: ) convergent encryption, a cryptosystem that produces identical ciphertext iles rom identical plaintext iles, irrespective o their encryption keys and 2) SALAD, a Sel-Arranging, Lossy, Associative Database or aggregating and analyzing ile content inormation. Collectively, these components are called the Duplicate-File Coalescing (DFC) subsystem o Farsite. The ollowing section briely describes the Farsite system. Section 3 explains convergent encryption and ormally proves its security. Section 4 describes both the steady-state operation o SALAD and the maintenance o SALAD as machines join and leave the system. Section 5 shows results rom large-scale simulation experiments using ile content data collected rom a set o 585 desktop ile systems. Section 6 discusses related work, and Section 7 concludes. *This paper is an extended version o a paper that appeared in the 22nd IEEE International Conerence on Distributed Computing Systems [5].

3 2. Background the Farsite ile system Farsite [8] is a scalable, serverless, distributed ile system under development at Microsot Research. It provides logically centralized ile storage that is secure, reliable, and highly available, by ederating the distributed storage and communication resources o a set o not-ullytrusted client computers, such as the desktop machines o a large corporation. These machines voluntarily contribute resources to the system in exchange or the ability to store iles in the collective ile store. Every participating machine unctions not only as a client device or its local user but also both as a ile host storing replicas o encrypted ile content on behal o the system and as a member o a directory group storing metadata or a portion o the ile-system namespace. Data privacy in Farsite is rooted in symmetric-key and public-key cryptography [27], and data reliability is rooted in replication. When a client writes a ile, it encrypts the data using the public keys o all authorized readers o that ile, and the encrypted ile is replicated and distributed to a set o untrusted ile hosts. The encryption prevents ile hosts rom unauthorized viewing o the ile contents, and the replication prevents any single ile host rom deliberately (or accidentally) destroying a ile. Typical replication actors are three or our replicas per ile [8, 4]. The integrity o ile content and o the system namespace is rooted in replicated state machines that communicate via a Byzantine-ault-tolerant protocol []. Directories are apportioned among groups o machines. The machines in each directory group jointly manage a region o the ile-system namespace, and the Byzantine protocol guarantees that the directory group operates correctly as long as ewer than one third o its constituent machines ail in any arbitrary or malicious manner. In addition to maintaining directory data and ile metadata, each directory group also determines which ile groups store replicas o the iles contained in its directories, using a distributed ile-replica-placement algorithm [4]. For security reasons, machines communicate with each other over cryptographically authenticated and secured channels, which are established using public-key cryptography. Thereore, each machine has its own public/private key pair (separate rom the key pairs held by users), and each machine computes a large (2-byte) unique identiier or itsel rom a cryptographically strong hash o its public key. Since the corresponding private key is known only by that machine, it is the only machine that can sign a certiicate that validates its own identiier, making machine identiiers veriiable and unorgeable. Each directory group needs to determine which o the iles it manages have contents that are identical to other iles that may be managed by another directory group. Each ile host needs to be able to coalesce identical iles that it stores, even i they have been encrypted separately. 3. Convergent encryption I Farsite were to use a conventional cryptosystem to encrypt its iles, then two identical iles encrypted with dierent users keys would have dierent encrypted representations, and the DFC subsystem could neither recognize that the iles are identical nor coalesce the encrypted iles into the space o a single ile, unless it had access to the users private keys, which would be a signiicant security violation. Thereore, we have developed a cryptosystem called convergent encryption that produces identical ciphertext iles rom identical plaintext iles, irrespective o their encryption keys. To encrypt a ile using convergent encryption, a client irst computes a cryptographically strong hash o the ile content. The ile is then encrypted using this hash value as a key. The hash value is then encrypted using the public keys o all authorized readers o the ile, and these encrypted values are attached to the ile as metadata. Formally, given a symmetric-key encryption unction E, a public-key encryption unction F, a cryptographic hash unction H, and a public/private key pair (K u, K u) or each user u in a set o users U o ile, convergently encrypted ile ciphertext C is a data, metadataò tuple given by unction Χ applied to ile plaintext P : C = ΧK ( P ) = c u, Μ () Where the ile data ciphertext c is the encryption o the ile data plaintext, using the plaintext hash as a key: c = E H ( P )( P ) (2) And the ciphertext metadata Μ is a set o encryptions o the plaintext hash, using the users public keys: { = ( H ( P ) u U } Μ = µ µ (3) u u F Ku Any authorized reader u can decrypt the ile by decrypting the hash value with the reader s private key K u and then decrypting the ile content using the hash as the key: ( ) P = Χ K C E F ( )( c ) u = K u µ u (4) Because the ile is encrypted using its own hash as a key, the ile data ciphertext c is ully determined by the ile data plaintext P. Thereore, the DFC subsystem, without knowledge o users keys, can ) determine that two iles are identical and 2) store them in the space o a single ile (plus a small amount o space per user s key). 3.. Convergent encryption proo o security Convergent encryption deliberately leaks a controlled amount o inormation, namely whether or not the plaintexts o two encrypted messages are identical. In this subsection, we prove a theorem stating that we are not accidentally leaking more inormation than we intend. To obtain a general result regarding our construction, not a result speciic to any realization using particular encryption or hash unctions, our proo is based on the random oracle model [4] o cryptographic primitives. 2

4 We are given a security parameter (key length) n and a plaintext string P o length m n, where P is selected uniormly at random rom a superpolynomial (in n) subset S o all possible strings o length m: P Œ r S ' "s [s Œ S s Œ {,} m ] Ÿ S = n ω(). This assumption is stronger than that allowed in a semantic security proo, since convergent encryption explicitly allows the contents o an encrypted message to be veriied without access to the encryption key. According to the random oracle model, the hash unction H is selected uniormly at random rom the set o all mappings o strings o length m to strings o length n: H Œ r {h ' h: {,} m {,} n }. The encryption unction E is selected uniormly at random rom the set o all permutations that map keys o length n and plaintext strings o length m to ciphertext strings o length m: E Œ r {e ' e: {,} n â {,} m {,} m Ÿ "k,p,p 2 [p p 2 e k (p ) e k (p 2 )] }. Furthermore, since the encryption unction is symmetric, there exists a unction E that is the inverse o unction E given k: "k,p [E k(e k (p)) = p]. (This is not an inverse unction in the traditional sense, since it yields only the plaintext and not the encryption key.) We assume that H, E, and E are all available to an attacker. However, since these unctions are random oracles, the only method by which the attacker can extract inormation rom these unctions is by querying their oracles. In other words, we assume that the attacker cannot break the encryption or hash primitives in any way; i.e., they are ideal cryptographic unctions. We compute c according to the deinition o convergent encryption: c = E H(P) (P). We allow an attacker to execute a program Σ containing a sequence o queries to the random oracles H, E, and E, interspersed with arbitrary amounts o computation between each pair o adjacent queries. We reer to the count o queries in the program as the length o the program. Theorem: Given ciphertext c, there exists no program Σ o length O(n ε ) that can output plaintext P with probability Ω(/n ε ) or any ixed ε, unless the attacker can a priori output P with probability Ω(/n 2ε ). Proo: Assume such a program exists. Let Σ be the shortest such program. Σ either does or does not include a query to oracle H or the value o H(P). I it does not, then the O(n ε ) queries to oracle H in program Σ can, by eliminating alternatives to H(P), increase the probability mass associated with H(P) by a actor o at most O(n ε ), so the probability or H(P) is o(/n ε ). Thus the probability or P = E H(P)(c) is also o(/n ε ) < Ω(/n ε ). Alternatively, i Σ does include a query or H(P), we can construct a new program Σ whose queries are a preix o Σ but which halts ater a random query q < Σ and outputs the value o P queried on step q + o program Σ. With probability Ω(/n ε ), Σ will output P, contradicting the assumption that Σ is the shortest such program. 4. Identiying duplicate iles SALAD Convergent encryption enables identical encrypted iles to be recognized as identical, but there remains the problem o perorming this identiication across a large number o machines in a robust and decentralized manner. We solve this problem by storing ile location and content inormation in a distributed data structure called a SALAD: a Sel-Arranging, Lossy, Associative Database. For scalability, the ile inormation is partitioned and dispersed among all machines in the system; and or ault-tolerance, each item o inormation is stored redundantly on multiple machines. Rather than using central coordination to orchestrate this partitioning, dispersal, and redundancy, SALAD employs simple statistical techniques, which have the unintended eect o making the database lossy. In our application, a small degree o lossiness is acceptable, so we have chosen to retain the (relative) simplicity o the system rather than to include additional machinery to rectiy this lossiness. 4.. SALAD record storage overview Logically, a SALAD appears to be a centralized database. Each record in the database contains inormation about the location and content o one ile. To add a new record to the database, a machine irst computes a ingerprint o a ile by hashing the ile s (convergently encrypted) content and prepending the ile size to the hash value. It then constructs a key, valueò record in which the key is the ile s ingerprint and the value is the identiier o the machine where the ile resides, and it inserts this record into the database. The database is indexed by ingerprint keys, so it can be associatively searched or records with matching ingerprints, thereby identiying and locating iles with (probably) identical contents. (With 2-byte hash values, the probability that a set o F iles contains even one pair o same-sized non-identical iles with the same hash value is F / 2 2 â 8 / 2 F â 24.) Physically, the database is partitioned among all machines in the system. Within the context o SALAD, each machine is called a lea (akin to a lea in a tree data structure). Each record is stored in a set o local databases on zero or more leaves. Leaves are grouped into cells, and all leaves within any given cell are responsible or storing the same set o records. Records are sorted into buckets according to the value o the ingerprint in each record. Each bucket o records is assigned to a particular cell, and all records in that bucket are stored redundantly on all leaves within the cell, as illustrated in Fig.. The number o cells grows linearly with the system size, and since the number o iles also grows linearly with the system size, the expected number o records stored by each lea is constant. 3

5 cell-id = cell-id = cell-id = cell-id = lea ile record cell bucket (machine) Fig. : Buckets o records in cells o leaves A SALAD has two coniguration parameters: its target redundancy actor Λ and its dimensionality D. Since each record is stored redundantly on all leaves in a particular cell, the degree o storage redundancy is equal to the mean number o leaves per cell. This value is known as the actual redundancy actor λ, and it is bounded (via the process described in Subsection 4.2) by the inequality: Λ λ < 2Λ (5) For large systems, it is ineicient or each lea to maintain a direct connection to every other lea, so the leaves are organized (via the process described in Subsection 4.3) into a D-diameter directed graph. Each record is passed rom lea to lea along the edges o the graph until, ater at most D hops, it reaches the appropriate storage leaves SALAD partitioning and redundancy The target redundancy actor Λ is combined with the lea count L (the count o leaves in the SALAD, also called the system size) to determine a cell-id width W, as ollows (where the notation lg means binary logarithm): L W = lg (6) Λ As described in Section 2 above, each lea has a large (2-byte), unique identiier i. The least-signiicant W bits o a lea s identiier or a record s ingerprint orm a value called the cell-id o that lea or record. (For convenience, we sometimes use the term identiier to mean either a lea s identiier or a record s ingerprint.) Formally, the cell-id o identiier i is given by: W c( i) = i mod 2 (7) Two identiiers are cell-aligned i their cell-id values are equal. Cell-aligned leaves share the same cell, and records are stored on leaves with which they are cell-aligned, as illustrated in Fig.. Beore introducing the dimensionality parameter D, we describe the simplest SALAD coniguration, in which D =, as in the example o Fig.. Each lea in the SALAD maintains an estimate o the system size L. From this, it calculates W according to Eq. 6, and it computes cell-id values or each lea in the system (including itsel) according to Eq. 7. Then, or each o its iles, it hashes the ile s content, creates a ingerprint record, and computes a cell-id or the record. The lea then sends each record to all leaves that it believes to be cell-aligned with the record. When each lea receives a record, it stores the record i it considers itsel to be cell-aligned with the record. This example illustrates the statistical partitioning, redundancy, and lossiness o record storage. With no central coordination, records are distributed among all leaves, and records with matching ingerprints end up on the same set o leaves, so their identicality can be detected with a purely local search. Since machine identiiers and ile content ingerprints are cryptographic hash values, they are evenly distributed, so the number o leaves on which each record is stored is governed by a Poisson distribution [2] with a mean o λ. Thereore, with probability e λ, a record will not be stored on any lea. Note that i two leaves have dierent estimates o the system size L, they may disagree about whether they are cell-aligned. However, this disagreement does not cause the SALAD to malunction, only to be less eicient. I a lea underestimates the system size, it may calculate an undersized cell-id width W. With ewer bits in each cell- ID, cell-ids are more likely to match each other, so the lea may store more records than it needs to, and it may send records to leaves that don t need to receive them. I a lea overestimates the system size, it may calculate an oversized cell-id width W, which causes cell-ids to be less likely to match each other, so the lea may store ewer records than it needs to, and it may not send records to leaves that should receive them. Thus, an underestimate o L increases a lea s workload, and an overestimate o L increases a lea s lossiness. Given F iles in the system, the mean count o records stored by each lea is R, calculated as ollows: F R = λ (8) L Since F µ L and λ < 2 Λ, R remains constant as the system size grows SALAD multi-hop inormation dispersal Cells in a SALAD are organized into a D-dimensional hypercube. (Technically, it is a rectilinear hypersolid, since its dimensions are not always equal, but this term is cumbersome.) Coordinates in D-dimensional hyperspace are given with respect to D Cartesian axes. In two- or three-dimensional spaces, it is common to reer to these axes as the x-axis, y-axis, and z-axis, but or arbitrary dimensions, it is more convenient to use numbers instead o letters, so we reer to the -axis, the -axis, and so orth. 4

6 W W identiier or ingerprint: identiier or ingerprint: coordinates: c : c : coordinates: c 2 : c : c : (a) ( W ) 2 ( W ) 2 ( W 2) 3 ( W ) 3 ( W ) 3 Fig. 2: Example extraction o cell-id and coordinates rom an identiier when (a) D = 2, (b) D = 3 Each cell-id is decomposed into D coordinates, as illustrated in Fig. 2. Successive bits o each coordinate are taken rom non-adjacent bits in the cell-id so that when the system size L grows and the width o each coordinate consequently increases, the value o each coordinate undergoes minimal change. A cell s location within the hypercube is determined by the coordinates o its cell-id. For example, in Fig. 2a, a lea with the shown identiier has cell coordinates c = 6 ( 2 ) and c = ( 2 ). Formally, or d < D, the bit width W d o the d-axis coordinate o an identiier is given by: W d W d = (9) D The d-axis coordinate o identiier i is deined by the ollowing ormula (where the notation b n (i) indicates the value o bit n in identiier i, and bit is the LSB), which merely ormalizes the procedure illustrated in Fig. 2: d W d k = k ( i) = 2 b ( i) c () D k + d Fig. 3 shows an example two-dimensional SALAD rom the perspective o the black lea. (Communication paths not relevant to the black lea are omitted rom this igure.) We reer to a row o cells that is parallel to any one o the Cartesian axes as a vector o cells. Two -axis -axis xz = xz = xz = xz = wy = wy = wy = wy = C Fig. 3: SALAD rom black lea s perspective (D=2) D A B E (b) identiiers are d-vector-aligned i they are both in a vector o cells that runs parallel to the d-axis. This means that at least D o their coordinates match, but their d-axis coordinates might not. Formally: ad ( i, j) k [ k d ck ( i) = ck ( j) ] () Identiiers are vector-aligned i they share any vector o cells. Thus, they are d-vector-aligned or some d, like leaves A and C in Fig. 3, but unlike A and E. Formally: a( i, j) d [ ad ( i, j) ] (2) Each lea maintains a lea table o all leaves that are vector-aligned with it, and these are the only leaves with which it communicates. The expected count o leaves in each vector is λ (L/λ) /D, so the mean lea table size is T: D D D T = Dλ ( Lλ) Dλ + λ Dλ L (3) This is not very large. With L =,, λ = 3, and D = 2, the mean lea table size is about 35 entries. Ater a new ile record is generated, it makes its way through the salad by moving in one Cartesian direction per step, until ater a maximum o D hops, it reaches leaves with which it is cell-aligned. A lea perorms the same set o steps either to insert a new record o its own into the SALAD or to deal with a record it receives rom another lea: In outline, each lea determines the highest dimension d in which all o the (less-than-d)-axis coordinates o its own identiier equal those o the record s ingerprint. I d < D, then it orwards the ingerprint record along its d-axis vector to those leaves whose d-axis coordinates equal that o the record s ingerprint. Ater a maximum o D such hops, the ingerprint record will reach leaves that are cell-aligned with the ingerprint. When a lea receives a cell-aligned record, the lea stores the record in its local database, searches or records whose ingerprints match the new record s ingerprint, and notiies the appropriate machines i any matches are ound. For example, when D = 2, cells are organized into a square (a two-dimensional hypercube), and each lea has entries in its lea table or other leaves that are either in its horizontal vector or in its vertical vector. In Fig. 3, the black lea has (binary) cell-id wxyz =, and its coordinates are c = xz = and c = wy =. Thus, it knows other leaves with cell-ids wy or xz, or any values o w, x, y, and z. (The igure shows directed connections to these known leaves via black arrows.) 5

7 receive record,lò // current lea s identiier is I d = while d < D // ind highest dimension d or which all (<d)-coordinates match i c d () c d (I) send record,lò to all leaves j or which a d (I,j) Ÿ c d () = c d (j) exit procedure // orward message along d axis and exit d = d + // this lea is cell-aligned with record s ingerprint i l = I // special case: this lea generated record send record,lò to all leaves j or which c() = c(j) search local database or records matching ingerprint or each matching record,kò notiy machine l that machine k has ile with ingerprint notiy machine k that machine l has ile with ingerprint store record,lò in local database Fig. 4: Procedure or lea I to insert record (,l), locally initiated or upon receipt rom another lea When the black lea generates a record or one i its iles, there are three cases: ) I the record s cell-id equals, the lea stores the record in its own database and sends it to the one other lea in its cell, lea A. 2) I the ingerprint cell-id is wy or wy, then the -axis coordinates are equal but the -axis coordinates are not, so the black lea sends the record along its -axis (horizontal) vector to leaves whose cell-id equals wy. For example, i the ingerprint cell-id is, it is sent directly to lea B. 3) I the ingerprint cell-id is wxyz or xz, then the -axis coordinates are not equal, so the black lea sends the record along its -axis (vertical) vector to leaves whose cell-id equals xz. In this third case, i the ingerprint s -axis coordinate wy does not equal, then the recipient leaves will orward the record (horizontally, via the gray paths in the igure) to the appropriate leaves. For example, i the ingerprint cell-id is, it is sent to leaves C and D, who each orward it to lea E. Formally, the pseudo-code o Fig. 4 illustrates the procedure by which a lea with identiier I inserts a new record, lò indicating that lea l contains a ile with ingerprint into the SALAD. Adding hops to the propagation o ingerprint records increases the system s lossiness. For a two-dimensional SALAD, a record will not be stored i it is not sent to any lea on either the irst or the second hop. When the system size L is very large, nearly all records require two hops, so the loss probability approaches ( e λ ) 2 2 e λ. In general, the loss probability or a D-dimensional salad is: λ D λ Ploss = ( e ) D e (4) For example, with λ = 3 and D = 2, P loss % SALAD maintenance adding leaves There are three aspects to maintaining a SALAD: Adding new leaves, removing dead leaves, and maintaining each lea s estimate o the lea count L. This subsection describes the irst o these operations. Each lea is supposed to know all other leaves with which it is vector-aligned. Thus, when a machine is added to a SALAD as a new lea, it needs to learn o all leaves that are vector-aligned with its identiier so it can add them to its lea table, and these leaves need to add the new lea to their lea tables. The machine irst discovers one or more arbitrary leaves in the SALAD by some out-o-band means (e.g. piggybacking on DHCP []). I the machine cannot ind any extant leaves, it starts a new SALAD with itsel as a singleton lea. I the machine does ind one or more leaves in a SALAD, it sends each o them a join message, and each o these messages is orwarded along D independent pathways o the hypercube until it reaches leaves that are vector-aligned with the new lea s identiier. These vector-aligned leaves send welcome messages to the new lea, which replies with welcomeacknowledge messages. These two types o messages cause the recipient to add an entry to its lea table and to update its estimate o the system size L. To ormalize the join-message orwarding process, we generalize the concepts o cell-alignment and vectoralignment introduced in Subsections 4.2 and 4.3, respectively. Two identiiers are δ-dimensionally-aligned i they are both in the same δ-dimensional hypersquare o a SALAD. (For example, i there is any square that includes them both, they are two-dimensionally aligned, and i there is any vector that includes them both, they are one-dimensionally aligned.) This means that at least D δ o their coordinates match. Formally: ( δ a ) ( i, j) [ ( ) ( )] = δ k k c i = k ck j (5) Note that a(i,j) a () (i,j), so the term vector-aligned is a synonym or one-dimensionally aligned ; similarly, the term cell-aligned is a synonym or zerodimensionally aligned. In Fig. 3, leaves C and D are - dimensionally aligned, leaves C and E are -dimensionally aligned, and leaves B and E are 2-dimensionally aligned. 6

8 We need to deine one more term, namely the eective dimensionality ð o a SALAD. A very small SALAD will not have enough leaves or even a minimal D-dimensional hypercube to contain a mean o Λ leaves per cell. For example, when W =, there are only two cells, irrespective o the value o D, so the salad is eectively one-dimensional. In general, when W < D, W d = or W d < D. Thus, a SALAD s eective dimensionality is: ð = min(w, D) (6) The cell or the new lea is located at the intersection o ð orthogonal vectors; the goal is or all leaves in these vectors to receive the new lea s join message. The general strategy is or one lea to send out ð batches o messages, each o which is routed (by a series o hops through cells with progressively lower dimensional alignment) to one o the vectors that will contain the new lea. When a lea in one o the inal vectors receives the join, it orwards it to all leaves in the vector. In Fig. 3, i a new lea (shown as a gray circle) joins with cell-id =, and i it sends a join message to the black lea, the black lea will send a batch o one message to lea B and a batch o two messages to leaves C and D. Lea B will orward the join to all leaves in its column, and leaves C and D will each orward the join to all leaves in their row. For this to work in a stateless ashion, the lea that initiates the process must not be δ-dimensionally aligned with the new lea or any δ < ð. In Fig. 3, i lea C initially received the new lea s join message, it orwards the message to the leaves in cell ; however, these leaves will not orward the message to the other leaves in the column, because cell-aligned leaves are expecting to be at the end o the chain o orwarded join messages. Thereore, i the initially contacted lea is δ-dimensionally aligned with the new lea or some δ < ð, it orwards a batch o join messages to a cell o leaves that are (δ+)- dimensionally aligned, and this orwarding continues until the join message reaches leaves whose minimal dimensional alignment is ð, which initiates the process described in the previous paragraph. Formally, when lea I receives a join message or a new lea n rom sender s, it perorms the procedure shown in the pseudo-code o Fig. 5. In outline, the lea determines the lowest dimensionality δ or which it is δ- dimensionally-aligned with the new lea, and it determines the lowest dimensionality δ or which the sender is δ dimensionally-aligned with the new lea. (We recalculate this value at each lea rather than passing it along with the message, because dierent leaves may have dierent receive join s,nò // my identiier is I δ = // δ is my lowest dimensional alignment with new lea δ = // δ is sender s lowest dimensional alignment with new lea = // is set o all dimensions in which I and n have non-matching coordinates or all d ' d < ð // loop to calculate above values i c d (n) c d (I) δ = δ + = «{d} i (c d (n) c d (s)) δ = δ + i (s = n) // i join received directly rom new lea... δ = - // sender s dimensional alignment is considered lower than all others i δ > δ // sender has higher dimensional alignment i δ > // i I am not vector-aligned, orward down one degree o alignment or all d Œ {d ' d Œ Ÿ (d + ) mod ð œ } send join I,nÒ to all known leaves j or which c d (n) = c d (j) else i δ = // i I am vector-aligned, orward to leaves in my vector or all d Œ // = δ =, so this loop only executes once send join I,nÒ to all known leaves j or which a d (I,j) else i δ < δ // sender has lower dimensional alignment i δ < ð // i my dimensional alignment < ð, orward up one degree o alignment or one d Œ r {d ' d < ð Ÿ d œ } or one c Œ r {c ' c < 2 Wd Ÿ c c d (n)} send join I,nÒ to all known leaves j or which c d (j) = c else i δ > // unless I m vector-aligned, initiate ð batches o messages or all d Œ send join I,nÒ to all known leaves j or which c d (n) = c d (j) else // I m vector-aligned, and ð =... send join I,nÒ to all known leaves // so orward join to everyone I know i δ < 2 // I m vector aligned with new lea... send welcome message to new lea n // so welcome new lea to SALAD Fig. 5: Procedure or lea I when receiving join message rom sending lea s to insert new lea n 7

9 estimates o L.) The lea then takes steps as described above to orward the join message to leaves o greater, lesser, or identical dimensional alignment, as appropriate. There are several annoying corner cases, which are handled appositely by the pseudo-code in Fig. 5. As L, each join message sent by the new lea results in D λ messages orwarded rom the initially contacted D-dimensionally-aligned lea to the (D )- dimensionally-aligned leaves. Each o these messages spawns λ additional messages or each dimensionality rom D 2 down to. Each d-vector-aligned lea at the end o a orwarding chain then orwards the message to λ /D L /D d-vector-aligned leaves. Thus, each initially contacted lea initiates a series o M messages or each new lea, where M is: D D D M = Dλ L (7) When a vector-aligned (or cell-aligned) lea receives a join message, it sends a welcome message to the new lea. When the new lea receives a welcome message rom an extant lea not in its lea table, then i the extant lea is vector-aligned with the new lea, the new lea takes three actions: ) It adds the extant lea to its lea table; 2) it updates its estimate o the SALAD system size (see 4.2.3); and 3) it replies to the extant lea with a welcomeacknowledge message. When an extant lea receives a welcome-acknowledge message rom a new lea, it perorms the irst two o these actions but does not reply with an acknowledgement SALAD maintenance removing leaves A new lea must explicitly notiy the SALAD that it wants to join, but a lea can depart without notice, particularly i its departure is due to permanent machine ailure. Thus, the SALAD must include a mechanism or removing stale lea table entries. We employ the standard technique [e.g. 6] o sending periodic reresh messages between leaves, and each lea lushes timed-out entries in its lea table. In addition, a lea that cleanly departs the SALAD sends explicit departure messages to all o the leaves in its lea table SALAD maintenance estimating lea count SALAD leaves use an estimate o the system size L to determine an appropriate value or the cell-id width W. Since each lea knows only the leaves or which it has entries in its lea table, it has to estimate L based on the size T o its lea table. The expected relationship between T and L is given by Eq. 3, so the lea eectively inverts this equation. The actual procedure is a little more complicated, because a change to the estimated value o L can cause a change to the value o W, which in turn can cause leaves to be added to or removed rom the lea table, changing the value o T. Whenever lea I modiies its lea table, it recalculates the appropriate cell-id width W via the procedure illustrated in the pseudo-code o Fig. 6. recalculate W // W is cell-id width calculate known lea ratio r // using Eq. 8 L = T / r // estimated lea count Λ = Λ / ( + ξ) // or decreasing width, use attenuated target redundancy Ŵ = lg (L / Λ ) // target cell-id width while Ŵ < W // while target width is less than actual width... W = W // decrement width request identiiers o newly vector-aligned leaves rom neighbors recalculate r L = T / r Ŵ = lg (L / Λ ) Ŵ = lg (L / Λ) // or increasing width, use non-attenuated target redundancy while Ŵ > W // while target width is greater than actual width T = count o remaining known leaves ater increasing width W = W + // tentative cell-id width r = visible lea raction based on W // tentative known lea ratio L = T / r // tentative estimated lea count Ŵ = lg (L / Λ) // tentative target cell-id width i Ŵ < W // i tentative target width less than tentative width... exit procedure // tentative width is unstable, so exit without incrementing W W = W // increment W orget leaves that are no longer vector-aligned r = r // update actual values to relect tentative values L = L Ŵ = Ŵ Fig. 6: Procedure ollowed by lea I to recalculate cell-id width W when lea table size T changes 8

10 In outline, the lea irst calculates the ratio r o vectoraligned cells to total cells in the SALAD, based on the current value o W: D Wd 2 D + d = (8) r = W 2 Since the mean count o leaves per cell is a constant (λ), this ratio is also the expected ratio o leaves in its lea table T to the total lea count L. So rom r and T, it estimates L, and it then calculates a target cell-id width Ŵ according to Eq. I the target cell-id width Ŵ is less than the current cell-id width W, then the lea decrements W, which olds the hypercube in hal along dimension d = (W ) mod D. This increases the count o leaves with which it is d -vector-aligned or all d d. To obtain identiiers and contact inormation or these leaves, the lea requests this inormation rom a set o leaves in its lea table that should have these newly vector-aligned leaves in their lea tables. In particular, it contacts λ leaves j with which it is d-vector aligned and or which c d (I) = c d (j) when calculated according to the new (smaller) value o W but not according to the old (larger) value o W. The lea then recalculates r, L, and Ŵ based on the new values o W and T, and it repeats the above process until the cell-id width is less than or equal to the target cell-id width. I the target cell-id width Ŵ is greater than the current cell-id width W, the procedure is similar to the above in reverse; however, we have to be careul to prohibit an unstable state. Since increasing the cell-id width causes leaves to be orgotten, it is possible that this will result in a smaller lea table size T that yields a smaller system size estimate L that in turn produces a target cell-id width Ŵ that is less than the newly increased cell-id width W. Thereore, the lea irst calculates the lea table size T that would result i the cell-id width were to be incremented. It then calculates tentative values W, r, L, and Ŵ based on the tentative lea table size T. I the tentative target cell-id width Ŵ is less than the tentative new cell-id width W, then the procedure exits instead o transiting to an unstable state. Otherwise, it increments W; orgets about leaves with which it is no longer vector-aligned; updates r, L, and Ŵ to relect the new values o W and T; and repeats the above process until the cell-id width is greater than or equal to the target cell-id width. The above procedure admits rapid luctuation when the lea table size vacillates around a threshold that separates two stable values o W. We prevent this luctuation with hysteresis: When considering a decrease to the cell-id width W, we calculate the target cell-id width Ŵ using an attenuated target redundancy actor Λ : Λ = Λ +ξ (9) This is shown in the pseudo-code o Fig. 6. The damping actor ξ determines the spread between an L large enough to increase W and an L small enough to decrease W. The damping actor ξ should be relatively small, since it allows a reduction in the actual redundancy actor o the SALAD SALAD attack resilience I SALAD were designed such that its leaves cooperatively determine the ranges o ingerprints that each lea stores, it might be possible or a set o malicious leaves to launch a targeted attack against a particular range o values, by arranging or themselves to be the designated record stores or this range. However, because o the SALAD s purely statistical construction, such an attack is greatly limited: Each lea determines its ingerprint range independently rom the ranges o all other leaves, so the most damage a malicious lea can do is to decrease the overall redundancy o the system. For D >, it is possible to target an attack, but only in a airly weak way. By choosing their own identiiers to be vector-aligned with a victim lea, a set o m malicious leaves can increase the size o the victim s lea table, thereby increasing its system size estimate L, which increases the lea s lossiness as described at the end o Subsection 4.2. The eective redundancy actor λ or the victim lea s records will be: m λ = λ (2) L Thus, not only does increasing a SALAD s dimensionality increase the loss probability or a given redundancy actor (Eq. 4), but also it increases the susceptibility o the system to attack. We thereore suggest constructing a SALAD with a dimensionality no higher than that needed to achieve lea tables o a manageably small size. 5. Simulation evaluation Since the current implementation o Farsite is not complete or stable enough to run on a corporate network, we evaluated the DFC subsystem via large-scale simulation using ile content data collected rom 585 desktop ile systems. We distributed a scanning program to a randomly selected set o Microsot employees and asked them to scan their desktop machines. The program computed a 36-byte cryptographically strong hash o each 64-Kbyte block o all iles on their systems, and it recorded these hashes along with ile sizes and other attributes. The scanned systems contain,54,5 iles in 73,87 directories, totaling 685 GB o ile data. There were 4,6,748 distinct ile contents totaling 368 GB o ile data, implying that coalescing duplicates could ideally reclaim up to 46% o all consumed space. D 9

11 consumed space (GB) messages / machine (thousands) K 32K 256K 2M 6M 28M G K 32K 256K 2M 6M 28M G m inim um ile s ize or coale scing (byte s ) m inim um ile s ize or coale scing (byte s ) ideal Λ =.5 Λ = 2 Λ = 2.5 Fig. 7: Consumed space vs. minimum ile size We ran a two-dimensional DFC system on 585 simulated machines, each o which held content rom one o the scanned desktop ile systems. The SALAD was initialized with a single lea, and the remaining 584 machines were each added to the SALAD by the procedure outlined in Subsection 4.4. We recorded the sizes o each machine s lea table and ingerprint database, as well as the number o messages exchanged. By setting a threshold on the minimum ile size eligible or coalescing, we can substantially reduce the message traic and ingerprint database sizes. Fig. 7 shows the consumed space in the system versus this minimum size. The eect on space consumption is negligible or thresholds below 4 Kbytes. This igure also shows that a target redundancy actor o Λ = 2.5 achieves nearly all possible space reclamation. We tested the resilience o the DFC system to machine ailure by randomly ailing the simulated machines. Fig. 8 shows the consumed space versus machine ailure probability. With Λ = 2.5, even when machines ail hal o the time, the system can still reclaim 38% o used space, comparing avorably to the optimal value o 46%. Λ =.5 Λ = 2 Λ = 2.5 Fig. 9: Message count vs. minimum ile size Fig. 9 shows the mean count o messages each machine sent and received versus the minimum ile-size threshold. By setting this threshold to 4 Kbytes, the mean message count is cut in hal without measurably reducing the eectiveness o the system (c. Fig. 7). Fig. shows the cumulative distribution o machines by count o messages sent and received, with no minimum ile-size threshold. The coeicients o variation [2] CoV Λ or these curves are CoV.5 =.64, CoV 2. =.39, and CoV 2.5 =.39. Combined with the smooth shape o the curves, these values imply that the machines share the communication load relatively evenly, especially as Λ is increased. Fig. shows the mean database size on each machine versus the minimum ile-size threshold. As with the message count, setting this threshold to 4 Kbytes halves the mean database size. Fig. 2 shows the cumulative distribution o machines by database size, with no minimum ile-size threshold. The coeicients o variation are CoV.5 =.28, CoV 2. =.3, and CoV 2.5 = 2.4â 5, which are airly small. However, all three curves show bimodal distributions: For Λ =.5 and Λ = 2., most machines store ewer than 4, records but a substantial minority store nearly 8, records. For Λ = 2.5, the consumed space (GB) m achine ailure probabilty cumulative requency m essage count (thousands) Λ =.5 Λ = 2 Λ = 2.5 Λ =.5 Λ = 2 Λ = 2.5 Fig. 8: Consumed space vs. machine ailure rate Fig. : CDF o machines by message count

12 mean database size (records) K 32K 256K 2M 6M 28M G consumed space (GB) m inim um ile s ize or coale scing (byte s ) database size lim it (records) Λ =.5 Λ = 2 Λ = 2.5 Λ =.5 Λ = 2 Λ = 2.5 Fig. : Database size vs. minimum ile size reverse is true. This suggests that the dierences in storage load among machines is due primarily to slight variations in machines estimates o L, iltered through the step discontinuity in the calculation o W (c. Eq. 6), rather than due to non-uniormity in the distribution o machine identiiers or ile content ingerprints. The substantial skew in Fig. 2 reveals an opportunity or reducing the record storage load in the system. We ran a set o experiments in which we limited the database size on each machine. When a machine receives a record that it should store, i its database size limit has been reached, it discards a record in the database with the lowest ingerprint value (corresponding to the smallest ile) and replaces it with the newly received record. I no record in the database has a lower ingerprint value than the new record, the machine discards the new record (ater orwarding it i appropriate). Fig. 3 shows the consumed space versus the database size limit. A limit o 4, records makes no measurable dierence in the consumed space or any Λ. For Λ = 2.5, even with a limit o 8 records (an order o magnitude smaller than the mean database size), the system can still reclaim 38% o used space, compared to the optimum o 46%. Fig. 3: Consumed space vs. database size limit For our inal experiment, we started with a singleton SALAD and incrementally increased the system size up to, simulated machines. Fig. 4 shows the mean lea table size versus system size. The square-root relationship predicted by Eq. 3 is evident in these curves, as is a periodic variation due to the discretization o W. Fig. 5 shows the cumulative distribution o machines by lea table size or L = 585 (the system size or our earlier experiments) and or L =, (the system size or our last experiment). When Λ =.5, the lossiness is so high that a signiicant (i small) raction o machines have nearly empty lea tables. For larger values o Λ, the curves show that there is reasonably close agreement among machines about the estimated system size, except in the case where Λ = 2.5 and L =,. In this case, lg(l/λ) is very close to an integer, so slight variations in machines estimates o L cause Eq. 6 to produce dierent cell-id widths or dierent machines, yielding this bimodal distribution. cumulative requency mean lea table size (entries) database size (records ) system size (m achines ) Λ =.5 Λ = 2 Λ = 2.5 Λ =.5 Λ = 2 Λ = 2.5 Fig. 2: CDF o machines by database size Fig. 4: Lea table size vs. system size

13 cumulative requency cumulative requency lea table size (entries) lea table size (entries) Λ =.5 Λ = 2 Λ = 2.5 Λ =.5 Λ = 2 Λ = 2.5 (a) Fig. 5: CDF o machines by lea table size or (a) L = 585, (b) L =, (b) 6. Related work To our knowledge, coalescing o identical iles is not perormed by any distributed storage system other than Farsite. The resulting increase in available space could beneit server-based distributed ile systems such as AFS [2] and Ficus [9], serverless distributed ile systems such as xfs [2] and Frangipani [37], content publishing systems such as Publius [38] and Freenet [2], and archival storage systems such as Intermemory [8]. Windows 2 [34] has a Single-Instance Store [7] that coalesces identical iles within a local ile system. LBFS [28] identiies identical portions o dierent iles to reduce network bandwidth rather than storage usage. Convergent encryption deliberately leaks inormation. Other research has studied unintentional leaks through side channels [22] such as computational timing [23], measured power consumption [24], or response to injected aults [5]. Like convergent encryption, BEAR [3] derives an encryption key rom a partial plaintext hash. Song et al. [35] developed techniques or searching encrypted data. SALAD has similarities to the distributed indexing systems Chord [36], Pastry [3], and Tapestry [4], all o which are based on Plaxton trees [29]. These systems use O(log n)-sized neighbor tables to route inormation to the appropriate node in O(log n) hops. Also similar is CAN [3], which uses O(d)-sized neighbor tables to route inormation to nodes in O(d n /d ) hops. SALAD complements these approaches by using O(d n /d )-sized neighbor tables to route in O(d) hops. These other systems are not lossy, but they appear less immune to targeted attack than SALAD is. SALAD s conigurable lossiness is similar to that o a Bloom ilter [6], although it yields alse negatives rather than alse positives. Farsite relocates identical iles to the same machines so their contents may be coalesced. Other research on ile relocation has been to balance the load o ile access [9, 39] to migrate replicas near points o high usage [, 7, 25, 33], or to improve ile availability [4, 26]. 7. Summary Farsite is a distributed ile system that provides security and reliability by storing encrypted replicas o each ile on multiple desktop machines. To ree space or storing these replicas, the system coalesces incidentally duplicated iles, such as shared documents among workgroups or multiple users copies o common application programs. This involves a cryptosystem that enables identical iles to be coalesced even i encrypted with dierent keys, a scalable distributed database to identiy identical iles, a ile-relocation system that co-locates identical iles on the same machines, and a single-instance store that coalesces identical iles while retaining separate-ile semantics. Simulation using ile content data rom 585 desktop ile systems shows that the duplicate-ile coalescing system is scalable, highly eective, and ault-tolerant. 2

Reclaiming Space from Duplicate Files in a Serverless Distributed File System

Reclaiming Space from Duplicate Files in a Serverless Distributed File System Reclaiming Space rom Duplicate Files in a Serverless Distributed File System John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, Marvin Theimer Microsot Research {johndo, adya, bolosky, dansimon,