Providing File Services using a Distributed Hash Table Lars Seipel, Alois Schuette University of Applied Sciences Darmstadt, Department of Computer Science, Schoefferstr. 8a, 64295 Darmstadt, Germany lars.seipel@stud.h-da.de alois.schuette@h-da.de http://www.fbi.h-da.de Abstract. This paper describes the redesign of an existing Chord-based P2P infrastructure. The new system employs erasure coding to improve the availability and durability of user data. In addition, we give an example on how to use the infrastructure as a base for building a secure, distributed file system. Keywords: DHT, Peer-to-Peer, Distributed File System, Encryption 1 Introduction The AChord system has emerged out of a student project at Hochschule Darmstadt University of Applied Sciences. It was originally built to support instant messaging applications. This paper describes a redesign of the core infrastructure with the purpose of making it a suitable base for other kinds of distributed applications. Our ambition is to encourage experimentation and ease the development and deployment of new networked peer-to-peer applications. As an example, we present a simple file system that uses AChord as a storage backend for encrypted file system data. 2 Distributed Hash Table In peer-to-peer systems there needs to be a mechanism for locating resources available from peers. To accomplish this task, we make use of the Chord system [6]. Chord provides us with a basic, yet powerful, lookup primitive that answers a single question: given a key, what is the node responsible for dealing with it? published at IS 2015 (The 11th International Conference on Interactive Systems, Ulyanovsk 2015)
2 L. Seipel, A. Schuette Chord is built around a m bits circular identifier space where m corresponds to the output size of a base hash function (e.g. SHA-1 [5]). Both, keys and nodes are mapped into this space. The identifier space can be visualized as a circle with nodes representing points on it, as shown in Fig. 1. 0 7 1 6 2 5 4 3 Fig. 1. Identifier circle of size 2 3 with nodes at 1, 3 and 6 When looking up a key, the node that most closely follows it on the circle (and thereby in the identifier space) is called its successor. This is the node responsible for handling that particular key. Each node is aware of its immediate predecessor and successor. Additional routing information is kept in a data structure called the finger table and is used to speed up the lookup process. It can be shown that, in a Chord network of n nodes, the number of peers to contact for resolving a key to its successor is, with high probability, O(log n) [6]. How the resulting answer is used is generally up to the application. A common theme is the storage of data (in the form of key/value pairs) on participating nodes. Such systems are called distributed hash tables (or DHTs) for their conceptual similarity to hash tables used as in-memory data structures. 3 Block Storage Relying upon the described lookup mechanism we were able to build a distributed system for the storage of data blocks. Nodes provide a interface to users through which blocks can be uploaded to or retrieved from the system. There is no inherent structure to these blocks. As far as the receiving node is concerned, they re just opaque chunks of data. Applications give meaning to blocks by interpreting them in a specific way. A stored block is associated with a key by which it can be retrieved again later. This key is specified by the client when it uploads a block into the system. Thus, the fundamental operations available to a client are put(key, value) and get(key), as shown in table 1.
Providing File Services using a Distributed Hash Table 3 Operation Description get(key) Retrieve value associated with key put(key, value) Store value under key Table 1. Core interface provided to clients The node where a block should be kept is designated by doing a Chord lookup operation for its key. This determines its successor the node responsible for handling a given key. As a result, uploaded blocks are distributed across all participating nodes. When a node joins the system and responsibilities change, the new node can request a list of keys from its peers and download the corresponding values so it can serve them to clients. 3.1 Fragment Maintenance There s an obvious omission so far: nodes can fail and such failures may lead to blocks being (transiently or permanently) unavailable to the system. A solution to this problem involves adding redundancy to the stored data, either by replicating whole blocks or by encoding them using an erasure-resilient code. An erasure code is a form of forward error correction where a chunk of data is encoded into n fragments from which any m (where m < n) are sufficient to rebuild the original source data. We implement Rabin s Information Dispersal Algorithm [4] which has the handy property of being able to generate new fragments that are, with high probability, distinct without the need to specify what fragments were generated previously. The n created fragments are distributed across nodes as follows: if a node is determined to be the successor of a given key, the fragments are stored on it and the n 1 nodes immediately following it in the identifier space. This matches the approach used by MIT s DHash [1] from which we ve also borrowed the block maintenance protocol. The purpose of this protocol is to detect when fragments are misplaced (that is, stored on a node that shouldn t actually have them, a situation that can arise after other nodes joined the system) or can no longer be found in the system at all (e.g. following a node failure). When a node detects that it is in possession of a fragment for a key for which it is not one of the n succesors (or slightly more than that, in anticipation of future failures), it tries to resolve the situation by offering the fragment to any of the correct peers until one of them accepts it (due to having no fragment for the corresponding key). In any case, the misplaced fragment is then deleted to avoid the existence of multiple copies of the same fragment in the system. Another part of the maintenance protocol is involved with the task of recreating fragments that are considered missing. Here, a node synchronizes its local fragment storage with each of its n 1 immediate successors. If a node is the successor for a stored block, then it and the n 1 nodes following it should hold a fragment for that block. When a violation of this rule is discovered during synchronization, the affected node performs a get operation, thereby reconstructing
4 L. Seipel, A. Schuette the block. Now, with the original block data available, it is able to create a new fragment to store locally. Efficient synchronization across nodes makes use of the fact that a cryptographic hash value of a piece of data can be considered a finger print, and can be used for identifying that data. A Merkle tree (also called a hash tree) is a data structure originally proposed by Ralph C. Merkle in the context of digital signatures [3]. Its characteristic property is the labelling of internal nodes with a hash computed over their children. Each node maintains such a tree, in which keys of held fragments conceptually function as the leaves. Thus, the root of the Merkle tree describes the full set of fragments held by a node. The tree is constructed such that the root node s ith child (whereas 0 i < 64) summarizes the set of fragments held in the interval [ i 2160 64, (i+1) 2160 64 ), assuming a 160 bits identifier space [1]. Synchronization is done by exchanging tree nodes between hosts, starting from the root. Descent stops when either matching labels are found (indicating that the corresponding sub-trees are the same and thus can be skipped) or we reach a leaf node that is present on one but not the other host. This triggers the above-mentioned fragment re-creation process on the host that is missing the fragment. 4 AChord File System Interface A library provides its users with a familiar file-like interface. Its main task comprises the mapping of file system calls (open/create, read, write,... ) to get and put operations in the distributed hash table. The library also provides for security by authenticating and en-/decrypting data blocks as they leave and enter the local system. 4.1 Files A file is stored by dividing its contents into blocks of a given size. Before uploading a block, it is encrypted and extended with a message authentication code (MAC) which, on later retrieval, can be used to verify the authenticity and integrity of block data. We compute a cryptographic hash (SHA-1) over the resulting ciphertext and use it as a key to upload the block into the distributed hash table. Given this key, the block can be retrieved again at a later point. Therefore, we can consider this hash to be the block s address. Schemes like this are generally referred to as Content-Addressable Storage. The result of uploading a file s contents is a list of block hashes. To present these to the user as a single entity a mechanism is needed by which it is possible to identify all blocks belonging to a specific file. A fundamental concept with disk-based file systems is the i-node. In [7], Tanenbaum and Bos describe it as listing the attributes and disk addresses of a file s blocks. Substituting disk addresses for keys in the DHT, this description matches our use as well.
Providing File Services using a Distributed Hash Table 5 Assuming 8k blocks, we can only store around 400 SHA-1 hashes in a single block, amounting to just a few megabyte of data. The solution consists of storing the keys for blocks, that themselves contain block addresses. This indirection scheme can be extended to yield double (and triple,... ) indirect blocks, that increase the possible file size the system can handle. As an optimization for small files and to reduce unnecessary overhead, the addresses of the first few blocks are stored directly. The higher levels of indirection are only used when the file reaches a certain size. Taken together with a few pieces of metadata (like the modification time) the first level of block addresses form the i-node block. It and all the indirection blocks it refers to are encrypted and authenticated in the same way as is done for files. As retrieving the i-node block allows to reach all other parts of a file, the file can be addressed by specifying the key used for storing this block. 4.2 Directories In the library interface, the way to refer to files is by user-chosen names. These need to be translated to the keys corresponding to the file s i-node block. The conceptual structure defining this mapping is the directory. We implement simple, linear directories strongly reminiscient of (but stripped from the 14 character file name limit) those used in the UNIX research system [2, 7]. Thus, a directory entry consists of a 20 byte block address (referring to the i-node block) and a (length-prefixed) name of variable size. Directory entries can also refer to other directories. A file path of a/b/c is thus resolved by using the directory a to look up the name b, giving a block address that, when used as DHT key, allows to retrieve the directory called b. The process is repeated for c, yielding the i-node block of the given file. 5 Conclusion The reworked AChord system provides reliable block storage to distributed programs. Through the use of erasure coding it improves on availability and durability of stored user data. The redesign was motivated by the desire to make it a useful base for a wider range of applications. This can be considered a success. The new system already provides the underpinnings for a flock of new applications, including a file synchronization program. It has also taken over with the original instant messaging user base. The system leans heavily on the prior work on Chord and DHash that was done at the Massachusetts Institute of Technology (MIT). It should be noted, though, that our implementation has not seen as much testing (especially in the wide area) as the one created at MIT. The file system interface allows applications to access stored data using a proven programming abstraction. Written data is encrypted and spread across nodes, with the application programmer being able to use a conventional file-like interface.
6 L. Seipel, A. Schuette References 1. Cates, J.: Robust and Efficient Data Management for a Distributed Hash Table. Master s thesis, Massachusetts Institute of Technology (May 2003) 2. Lions, J.: A commentary on the sixth edition unix operating system. Department of Computer Science, The University of New South Wales (1977) 3. Merkle, R.C.: A digital signature based on a conventional encryption function. In: Advances in Cryptology CRYPTO 87. pp. 369 378. Springer (1988) 4. Rabin, M.O.: Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM (JACM) 36(2), 335 348 (1989) 5. Standard, S.H., FIPS, P.: 180-1. National Institute of Standards and Technology (NIST), US Department of Commerce (1995) 6. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. Networking, IEEE/ACM Transactions on 11(1), 17 32 (2003) 7. Tanenbaum, A.S., Bos, H.: Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 4th edn. (2014)