COSC 6397 Big Data Analytics Distributed File Systems Edgar Gabriel Spring 2017 What is a file system A clearly defined method that the OS uses to store, catalog and retrieve files Manage the bits that make up a file itself and Metadata Metadata: data about data, e.g. where data is logically placed on hard drive file name organizational hierarchies (i.e. directory) Last modification date Permissions(read,write,execute etc.) 1
UNIX File Model - overview A File is a sequence of bytes When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read. Disk drives read and write data in fixed-sized units (disk sectors) File systems allocate space in blocks, which is a fixed number of contiguous disk sectors. In UNIX based file systems, the blocks that hold data are listed in an inode. An inode contains the information needed to find all the blocks that belong to a file. If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced. Write operations Write: the file systems copies bytes from the user buffer into system buffer. If buffer filled up, system sends data to disk System buffering + allows file systems to collect full blocks of data before sending to disk + File system can send several blocks at once to the disk (delayed write or write behind) - Data not really saved in the case of a system crash - For very large write operations, the additional copy from user to system buffer could/should be avoided 2
Read operations Read: File system determines, which blocks contain requested data Read blocks from disk into system buffer Copy data from system buffer into user memory System buffering: + file system always reads a full block (file caching) + If application reads data sequentially, prefetching (read ahead) can improve performance - Prefetching harmful to the performance, if application has a random access pattern. Hiding disk latency: Caching and buffering Avoids repeated access to the same block Allows a file system to smooth out I/O behavior Helps to hide the latency of the hard drives Lowers the performance of I/O operations for irregular access Non-blocking I/O gives users control over prefetching and delayed writing Initiate read/write operations as soon as possible Wait for the finishing of the read/write operations just when absolutely necessary. 3
Journaling file systems Updating a file takes typically multiple steps. An interruption between the steps leads to an inconsistent file system Example: deleting a file Remove the directory entry Mark the inode blocks as free in the space map A journaling file system keeps track of the changes that will be made in a journal before committing them to the main file system Entries to journal are made before modifying the file sytem After a crash, the journal is re-played and an entry either Succeeds: could be completely re-played during recovery Not re-played: journal entry has not been finished Journal entries often contain a checksum per entry to verify for corruption Journaling file systems (II) Physical journal: Data and metadata are written to the journal before modifying the file system Large overhead -> data written twice Logical journal: Only metadata written to journal Modifications to data written to file system directly -> worst case scenario: data is garbage, but directory structure and file structure are consistent -> trade off between performance and reliability 4
Log structured file systems Conventional file systems lay out files to optimize spatial locality make in-place changes to their data structures in order to perform well on magnetic disks (seek is slow) Log-structured file systems treat storage as a circular buffer Write always occurs to the head of the log Writes create multiple, chronologically-advancing versions of both file data and meta-data Can be used to make old file versions nameable and accessible (snapshotting) Recovery from crashes is simpler: upon its next mount, the file system can reconstruct its state from the last consistent point in the journal not need to walk all its data structures Distributed File Systems The generic term for a client/server file system where the data is not locally attached to a host. Clients, servers, and storage are dispersed across machines. Configuration and implementation may vary Clients should view a DFS the same way they would a centralized FS; the distribution is hidden at a lower level. Performance is concerned with throughput and response time. Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section17-dist_file_sys.ppt 5
Distributed File Systems - Characteristics Naming: mapping between logical and physical objects Example: A filename maps to <cylinder, sector>. In a conventional file system, it's understood where the file actually resides; the system and disk are known. In a transparent DFS, the location of a file, somewhere in the network, is hidden. Location transparency: The name of a file does not reveal any hint of the file's physical storage location. Location independence: The name of a file doesn't need to be changed when the file's physical storage location changes. Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section17-dist_file_sys.ppt Distributed File Systems - Characteristics Caching Reduce network traffic by retaining recently accessed disk blocks in a cache, so that repeated accesses to the same information can be handled locally. If requested data is not already cached, a copy of data is brought from the server to the user. Perform accesses on the cached copy. Files are identified with one master copy residing at the server machine, Copies of (parts of) the file are scattered in different caches. Cache Consistency Problem: Keeping the cached copies consistent with the master file. Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section17-dist_file_sys.ppt 6
Distributed File Systems - Characteristics Typical steps for a read operation: The client makes a request for file access. The request is passed to the server in message format. The server makes the file access. Return messages bring the result back to the client. Cache location: Data can be kept in the local memory or in the local disk. Caching can be done on the client and the server side Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section17-dist_file_sys.ppt Distributed File Systems - Characteristics Stateful: server keeps track of information about client requests. Maintains what files are opened by a client Memory must be reclaimed when client closes file or when client dies. Good for Performance: no need to parse the filename each time, or "open/close" file on every request. Bad for Reliability: stateful server loses everything on crash Stateless: Each client request provides complete information needed by the server (i.e., filename, file offset ). Server maintains information on behalf of the client Stateless remembers nothing so it can start easily after a crash Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section17-dist_file_sys.ppt 7
Example: NFS The Network File System Protocol for a remote file service Stateless server (v3) Communication based on RPC (Remote Procedure Call) NFS provides session semantics changes to an open file are initially only visible to the process that modified the file File locking not part of NFS protocol (v3) but often available through a separate protocol/daemon Client caching not part of the NFS protocol (v3) implementation dependent behavior Image taken from a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/cs502/lectures/section1 7-Dist_File_Sys.ppt Parallel File Systems Parallel File System: data blocks are striped across multiple storage devices on multiple storage servers. Support for parallel applications: all nodes access to the same files at the same time (concurrent read and write capabilities) Three relevant parameters: Stripe factor: number of disks Stripe size: size of each block Which disk contains the first block of the file Block 1 Block 2 Block 3 Block n Disk 1 Disk 2 Disk 3 Disk 4 8
Parallel File Systems: Conceptual overview Compute nodes Meta-data server storage server 0 storage server 1 storage server 2 storage server 3 Parallel File Systems - Concept Metadata server: stores namespace metadata, such as filenames, directories, access permissions, and file layout. Metadata server not necessarily involved in file I/O operations Distributed Metadata server: E.g. multiple metadata server available, each hosting a part of the namespace hashing function on file name or Sub trees of the directory Write operations: Require locking of entire file or file block to ensure consistency Distributed locking protocols can be used 9
Example: Parallel Virtual File System Open source project from Clemson University Lightweight server daemon to provide simultaneous access to storage Each node in the cluster can be a server, a client, or both. Best suited for providing large, fast temporary storage. The basic PVFS2 package consists of three components: a server, a client, and a kernel module. Default stripe size: 64kB In practice: often changed to 1 MB Can be adjusted on a per-directory basis Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039 Example: Parallel Virtual File System Stateless architecture PVFS2 servers do not keep track of typical file system bookkeeping information such as which files have been opened, file positions, etc. No shared lock state to manage Can fail and resume without disturbing the system as a whole. Distributed Metadata server Relies on relaxed consistency semantics Defines semantics of data access without requiring locking Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039 10
Example: Parallel Virtual File System No client-side caching of metadata: status operations (e.g. ls ) take a long time, as the information is retrieved over the network. PVFS2 is better suited for I/O intensive applications, rather than for hosting a home directory. PVFS2 is optimized for efficient reading and writing of large amounts of data, and thus it s very well suited for scientific applications. Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039 Example: Parallel Virtual File System Two methods are provided for accessing PVFS2 file systems. Mount PVFS2 file system. allows user to access file system using regular POSIX commands/function introduces some performance overhead PVFS2 library functions: e.g. used by MPI-IO Doesn t implement POSIX semantics Optimize access to single files by many processes on different nodes. Provides noncontiguous access operations that allow for efficient access to data spread throughout the file. Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039 11