Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/25
Examples AFS NFS SMB/CIFS Coda Intermezzo HDFS WebDAV 9P 2/25
Andrew File System (AFS) Project of Carnegie-Melonie University to interconnect thousands of university workstations 1983, at the beginning part of 4.2BSD Later TransarcDFS IBM Global name space Location independent file names, migration Client side buffering Server replication Kerberos authentication Complex :) 3/25
AFS Clients have their own local name space (the root of the file system, the devices ) Servers together are serving all the global name space Servers and clients are aggregated into clusters which are interconnected over WAN Server to client work delegation, buffering block (64KiB) Client mobility the same global name space visible from any client machine Security authentication and client-server channel encryption Authorization over ACL (access control list) Consists of volumes which are then aggregated into one big file tree 4/25
AFS Design Strategy Files are small Read occurs often and write is rare Sequential read is often and random access read is rare Most of the files are read/write accessed by the same unique user If some file was used once, it will be used again AFS works better when upper mentioned holds. Files which are write accessed by multiple users AFS does not support 5/25
AFS implementation In client machines the client software called Venus In addition in Kernel open(), close() routines interception Server side server software called Vice, uses the same file system being in use by the OS for serving it over network Other implementations: OpenAFS The open source implementation of IBM AFS Arla Independent freeware implementation 6/25
AFS Architecture Data transport and client side buffering is file-wise (not block wise) Opening a file means creation of local copy of the file into local cache on local HDD Periodically cache is cleared from the old files File identifier 96bit length (fid), which is however invisible to the user: 32bit volume id 32bit file id 32bit uniqueness preserving id (uniquifier) Server side manageable coherency 7/25
NFS SUN Network File System Each machine can be client or server (or both) Implemented on top of ONC RPC (SUN RPC) Client mounts some directory exposed by a server into local name space of the client, and from then it is transparent for usage Can be mount from multiple servers and actually only one of the replicated servers is selected Stateless server Used as a glue for different file systems No replication support No file locking support 8/25
NFS Implementation Client has a driver running inside the OS Kernel Server side has user level applications as well as Kernel level implementations File identifiers are 32bit or 64bit length Separate RPC service which returns the ids of the shared file system By directory id the list of contained file ids can be requested File ids support file operations NFS file ids at some level correspond to the ones used in Unix All the security relies on the fact that client does only know file ids belonging to him What belongs to whom the User identifier is the matter of client machine 9/25
Examples of the NFS primitves lookup (dirfh, name) fh, status find a file from directory getattr (fh) attr file or directory attribute request read (fh, offset, count) attr, data file read by rename (dirfh, name, todirfh, toname) status file or directory rename readdir (dirfh, cookie, count) entries returns the containing record names with corresponding ids 10/25
NFS v4 Sate preserving Only TCP support Less protocol overhead UTF-8 names ACL support (access control list) Better security (Kerberos5 and GSS-API by default) Efficient client side buffering: Locking (mandatory) Reserving Delegation Replication and migration (less specified however) NFS v4.1 and pnfs with new transport layers 11/25
SMB/CIFS IBM invention on top of NetBIOS Historically there was introduced lots of dialects CIFS Windows NT4 derivative of SMB (MS SMB dialect) Different transport layers in use, mostly TCP/IP State preserving Designed for Windows (hence heavily supported in winapi) Samba DFS automounter on top of SMB: Global file tree on multiple file servers Location independent name space Migration support Replication support 12/25
SMB additional features File-wise and block-(record-)wise locking File and directory update notification Unicode support Extended attributes Oplocks (opportunistic lock) General remote communication over named pipes and mailboxes Authentication on user level and share level, domain support Printing support Network browsing support (automatic service discovery) Unix compatibility (file attributes, permissions, device files, links) 13/25
Coda From Carnegie-Melonie University, AFS v2 proceeding Design goals: Connection-less work (mobility support) High performance on client side buffering Server replication Security scheme with authentication and access control Fault tolerant in case of servers side failures Scalability Determined file sharing semantics also in cases of network failures 14/25
Coda Design Client has embedded driver in its OS Kernel Client also has client software Venus cache manager between the file system and the network Venus communicates over RPC with server application (Vice) Client side changes sent to server when the file is closed If change submission failed changes are stored locally Part of the file are automatically buffered all the time so these are available also without connectivity with a server Automatic conflict detection, manual resolution however 15/25
Intermezzo Design goals: High availability Server replication Mobile Clients Management of large clusters Connection-less work Automatic restore after network failure Operation logging InterSync synchronization system 16/25
InterSync Software protocol for the file system synchronization between multiple machines Is in use as a separate poller application on server side or as part of OS Kernel File operations are logged Server is a HTTP server (standard or specific) Typical transaction is logfile request and reply, file transport HTTP caching support HTTP can be tunneled over SSH for security 17/25
Intermezzo conflicts Conflict detection by file size and modification timestamp Four types of conflicts: Name/Name Update/Remove Update/Update Rename/Rename Conflict resolution schemes Mobile: server is always has higher priority HA: high availability between servers: active one is always with higher priority then the one failed Resynchronization: in case the logs are not available 18/25
Hadoop Distributed File System Apache Hadoop MapReduce style framework for distributed computing Implemented in Java HDFS (Hadoop Distributed File System) distributed file system for Hadoop Clusters Does not implement the whole POSIX API, but only the ones essential for Hadoop Clusters, speed is preferable over additional features 19/25
HDFS Design features Hardware failures are expected and well handled Important is not a communication latency but overall throughput of the bandwidth Big data (files of terabytes in size, millions of such files) Coherency is simple one writer, once written the multiple reads Easier to deploy operations to data (map) then data to operations Porting on different hardware and software platforms is critical 20/25
HDFS design - metadata Metada is stored on the NameNode The amount of RAM on the NameNode is a main scalability limiting factor One to many copies (with transaction logs) one the local HDDs Checkpoint nodes multiple, do periodically replicate the metadata from the NameNode Backup node - one, and is all the time in sync with NameNode (the whole content) 21/25
HDFS design data itself Data stored on the data nodes Replication occurs between racks (system knows the rack-id of each data store node) NameNode decides where to do the copy (in the same node, in different rack, in random node) Rebalancer force the data relocation if the decision made by NameNode when initially replicating the data not optimal anymore Data checksums are in each copy 22/25
HDFS Protocol RPC-based: client NameNode, client DataNode, DataNode NameNode Big block size (typically 64M), buffering in the local file system Heartbeat and Blockreport are periodically sent from DataNodes JavaAPI + WebDAV protocol 23/25
WebDAV WebDAV Web-based Distributed Authoring and Versioning Collection of HTTP extensions to manage file over HTTP HTTP methods GET, PUT, POST, PROPFIND, PROPPATCH, DELETE, COPY, MOVE, MKCOL, SEARCH, LOCK, UNLOCK Objects have properties Alive (server relies on these objects) Dead (server just saves them) Locking (distributed and exclusive) 24/25
9P Plan9 network protocol (newer version has name 9P2000) Plan9 all resources are files. Files are usable over network Used for IPC as well, for example providing communication with window manager For the network communication the IL protocol is used Reliable and ordered packet transmission On top of IP, and in addition to TCP Fast, less overhead Adaptive socket timeouts 25/25