Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/15

Hajussüsteemid MTAT.08.024 Distributed Systems Distributed File Systems (slides: adopted from Meelis Roos DS12 course) 1/15

Distributed File Systems (DFS) Background Naming and transparency Remote file invocation Stateless vs. state preserving services Files replication 2/15

Background Distributed File System (DFS) distributed implementation of the classical file system, where users can share files and storage space DFS manages some set of distributed (location wise) storage devices Storage resources (machine) contains smaller resources the unique storage elements (specific HDD in the specific machine) Generally there is an association between between specific storage device and specific set of files Service, server, client 3/15

Background Interface primitive file operations Transparency of the interface local and remote file have the same primitive operations 4/15

Naming and Transparency Naming logical and physical name associations Multi-layered naming file abstraction, hiding how and where are the files actually saved Transparent file systems also hides from user also where in the network the files are actually saved The file can be replicated into multiple servers, name resolution in this case returns the list of file locations, however this information is hidden from user 5/15

Transparency Location transparency Simplifies data distribution File name does not contain or point to the actual location of the file File name contains some information (hidden from user) about some particular block in particular machine Explicitly shows on what machine is file actually saved 6/15

Transparency Location independent File name does not changes if the file is actually relocated into different machine within DFS Higher abstraction level Distributing on the block level not on the file level Separated name space hierarchy from the storage device hierarchy File migration In general the static location dependent names are in use 7/15

Naming schemes 3 ways File name contains machine name and local file path inside this machine ( guarantees file name uniqueness across entire DFS) Remote directories are mounted under local directories. As a result there is on big picture, however only mounted directories can be used transparently Automatic mount Entire integration of file server One global name space contains information about all files in all servers in DFS If one of the file servers goes down a corresponding subset of files becomes unavailable 8/15

Remote File Invocation The goal is to reduce the network load, which is produced by the remote file invocation By buffering the last used blocks in the client machine, so that the further invocations done locally on the client side If there is no buffered data corresponding to a specific file than download it from the network Files are still identified by a original copy, but the invocations are local, producing lots of local copies in client machines Cache-consistency issue updating local cache if the main copy is changed Granularity of the cache from the blocks to the entire files Block size depends on the block size of the real FS and the network bandwidth 9/15

Cache location in the client machines HDD based Reliable Remains after crashing Solaris and Linux cachefs Memory based Smaller file invocation time More memory quicker it works Allows no-hdd client machines Servers to buffering of internal HDDs 10/15

Writing polices Write-through in case of writing operations, write data to the HDD at once Delayed-write, write-back write in memory and later write to the HDD of the server asynchronously Reliable Slow Writing fast Optimized the rewriting of the same region Less reliable Variation: write periodically from memory to the server HDD Variation: write to server HDD if file is closed NFSv2 is synchronous, NFSv3 can write asynchronously 11/15

Consistency Is local copy is in sync with the copy on server? Client side based Server side based Client initiates the consistency check Checks periodically on each file operation Server gives permission for buffering on client side and keeps accounting of the client buffering Server notifies clients if some of the buffers has updated SMB and opportunistic locking (delegations) Buffering of read invocations, with less writes Different interfaces for buffering and direct invocations 12/15

State preserving file server Good performance Client opens file Server opens the file internally (by caching some of it in memory) and giving further to client For further invocations are done by server on the in memory copy of the file On the file close client - the actual file opened on server is also closed Disadvantage non active clients keep opened files on server Less HDD calls (file is read once, rest operations in memory) Server knows if file is opened for sequential access and can do read-ahead if required AFS, SMB 13/15

Stateless file server File state is not preserved All the state is returned with each new query Requests are longer Slower No need to keep an opened connections each request is a separate transaction Each query contains file identifier and offset within this file Can survive the server restart for the client it just looks like the request has taken longer this time Client reboot does not affect server anyhow No server side locking is possible NFS 14/15

File replication The copies of the same file located on the multiple servers Increases reliability Can speed-up the request serving delay (select the nearest serveri; RAID across servers) Naming scheme returns the list of copies Next select on copy out of them Higher abstraction layers do not see this copies Lower layer have to evolve additional naming to separate copies of the same file On-demand replication do local copy if only was requested Changes of one copy have to reflected in all the others For instance over primary copy 15/15