File I/O - Filesystems from a user s perspective Unix Filesystems Seminar Alexander Holupirek Database and Information Systems Group Department of Computer & Information Science University of Konstanz 13. November 2007 Alexander Holupirek (U KN) File I/O 13. November 2007 1 / 62
Introduction Most Unix file I/O boils down to five functions: open, read, write, lseek, and close We speak of unbuffered I/O in contrast to the standard I/O routines. Unbuffered because each read or write invokes a system call. Unbuffered I/O is not part of ANSI C, but part of POSIX and XPG3. Alexander Holupirek (U KN) File I/O 13. November 2007 2 / 62
File Descriptor Kernel refers to open files by file descriptors. File descriptor is non-negative integer. Convention: 0 is stdin, 1 is stdout, 2 is stderr. POSIX defines STDIN FILENO, STDOUT FILENO, STDERR FILENO in unistd.h. Each process has a fixed size descriptor table, which is guaranteed to have at least n slots. The entries in the descriptor table are numbered with small integers starting at 0. The call getdtablesize(3) returns the size of this table. Example Ï Execute and examine fd max.c Alexander Holupirek (U KN) File I/O 13. November 2007 3 / 62
Know about limits Unix is available on a huge number of architectures. Different architecture, different capabilities (E portability). Standards (ANSI C, POSIX 1, XPG3 2 ) are one way to generalize. Minimum values are defined for systems conforming to standard x. How can we find out? 1 Portable Operating Systems Interface 2 X/Open Portability Guide Alexander Holupirek (U KN) File I/O 13. November 2007 4 / 62
Know your limits Two types: compile-time and run-time limits. E.g. compile-time limit: What is the largest value of long? E.g. run-time limits: How many chars allowed in a filename? Determine compile-time options via headers. Determine run-time limits not associated with files or dir via sysconf(3) Determine run-time limits associated with files or dir via pathconf(3) and fpathconf(3) Alexander Holupirek (U KN) File I/O 13. November 2007 5 / 62
Definitions of limits /usr/src/include/lib/libc/compat-43/getdtablesize.c #include <unistd.h> int getdtablesize(void) { return sysconf(_sc_open_max); } Constants beginning with SC are used as arguments to sysconf(3). Constants beginning with PC are used to either pathconf(3) or fpathconf(3). Alexander Holupirek (U KN) File I/O 13. November 2007 7 / 62
Some examples Name of limit Description name argument ARG MAX max. lenght of args to exec in bytes SC ARG MAX OPEN MAX max. number of open files per process SC OPEN MAX NAME MAX max. number of bytes in filename PC PATH MAX PATH MAX max. number of in a relative pathname PC PATH MAX limits.h : #define _POSIX_OPEN_MAX 16 /* must be <= OPEN_MAX <sys/syslimits.h> */ stdio.h : #define FOPEN_MAX 20 /* max open files per process */ sys/syslimits.h: #define OPEN_MAX 64 sys/unistd.h : #define _SC_OPEN_MAX 5 Alexander Holupirek (U KN) File I/O 13. November 2007 9 / 62
open(2) - function description open(2) - open or create a file for reading or writing #include <fcntl.h> int open(const char *path, int flags, mode_t mode); O RDONLY Open for reading only. O WRONLY Open for writing only. O RDWR Open for reading and writing. O NONBLOCK Do not block on open or for data to become available. O APPEND Append on each write. O CREAT Create file if it does not exist. O TRUNC Truncate size to 0. O EXCL Error if create and file exists. O SYNC Perform synchronous I/O operations. O SHLOCK Atomically obtain a shared lock. O EXLOCK Atomically obtain an exclusive lock. O NOFOLLOW If last path element is a symlink, don t follow it. Alexander Holupirek (U KN) File I/O 13. November 2007 11 / 62
open(2) flags are formed by OR ing success: returns a non-negative integer (file descriptor) open guarantees to return the lowest numbered unused descriptor. failure: -1 and errno is set to indicate the error. E.g. ENAMETOOLONG: A component of a pathname exceeded NAME MAX creat(3) is the same as: open(path, O CREAT O TRUNC O WRONLY, mode); Alexander Holupirek (U KN) File I/O 13. November 2007 13 / 62
close(2) close(2) - delete a descriptor #include <unistd.h> int close(int d); When a process exits, all associated file descriptors are freed. close(2) may be used to not run out of active descriptors per process. returns 0 on success, -1 on failure and sets global int errno. close(2) will fail if: Argument is not an active descriptor (EBADF) An interrupt was received (EINTR) Alexander Holupirek (U KN) File I/O 13. November 2007 15 / 62
Current file offset and lseek(2) Every open file has a current file offset. Non-negative integer holding # bytes from the beginning of file. read(2) and write(2) start at (and increment) the current file offset. Default is offset is 0 (unless O APPEND). An open file can be positioned by calling lseek(2). lseek(2) - reposition read/write file offset #include <unistd.h> off_t lseek(int fildes, off_t offset, int whence); Alexander Holupirek (U KN) File I/O 13. November 2007 17 / 62
Primitive System Data Types dev t fd set fpos t gid t ino t mode t nlink t off t pid t size t ssize t time t uid t wchar t device numbers file descriptor sets file position numeric group IDs i-node numbers file type, file creation mode link counts for directory entries file sizes and offsets process IDs and process group IDs size of objects (such as strings) (unsigned) functions that return a count of bytes (signed) (read, write) counter of seconds of calendar time numeric user IDs can represent all distinct character codes The header sys/types.h defines implementation-dependent types. C typedef is used. Provides a layer to hide from implementation details. Alexander Holupirek (U KN) File I/O 13. November 2007 19 / 62
lseek(2) lseek(2) - reposition read/write file offset #include <unistd.h> off_t lseek(int fildes, off_t offset, int whence); whence SEEK SET SEEK CUR SEEK END offset (re-)position the offset is set to offset bytes. the offset is set to current location plus offset bytes. the offset is set to size of the file plus offset bytes. Returns new file offset measured in bytes from BOF or -1 & errno. Alexander Holupirek (U KN) File I/O 13. November 2007 21 / 62
lseek(2) - example seek.c /* Use lseek to check if file is capable of seeking. */ if (lseek(stdin_fileno, 0, SEEK_CUR) == -1) err(errno, "can not seek [%d].", errno); else printf("seek OK.\n"); $./a.out /etc/motd seek OK. $ cat /etc/motd./a.out a.out: can not seek [29].: Illegal seek /usr/include/sys/errno.h #define ESPIPE 29 /* Illegal seek */ Alexander Holupirek (U KN) File I/O 13. November 2007 23 / 62
lseek(2) - black hole file Example Ï Execute and examine hole.c Alexander Holupirek (U KN) File I/O 13. November 2007 25 / 62
write(2) - write output write(2) - synopsis #include <sys/types.h> #include <unistd.h> ssize_t write(int d, const void *buf, size_t nbytes); nbytes of data in buf are written to open file d. Return value is usual equal to nbytes, otherwise an error occurred. Other errors are indicated by -1 and errno. write increments file s offset. Alexander Holupirek (U KN) File I/O 13. November 2007 27 / 62
read(2) - read input read(2) - synopsis #include <sys/types.h> #include <unistd.h> ssize_t read(int d, void *buf, size_t nbytes) read nbytes to buffer buf returns number of bytes read, 0 if EOF, -1 on failure. number of bytes returned is often smaller than requested. read starts at current offset & increments current offset. Alexander Holupirek (U KN) File I/O 13. November 2007 29 / 62
I/O efficiency Example Ï Copy a file with scopy.c. #define BUFFSIZE XXX ssize_t rbytes; char buf[buffsize]; while ((rbytes = read(stdin_fileno, buf, BUFFSIZE)) > 0) if (write(stdout_fileno, buf, rbytes)!= rbytes) err(errno, "write error."); if (rbytes < 0) err(errno, "read error."); return (0); Alexander Holupirek (U KN) File I/O 13. November 2007 31 / 62
File sharing (not P2P) Open files can be shared between different processes. Three (kernel) data structures are involved. 1 Process table 2 File table 3 v-node structure Alexander Holupirek (U KN) File I/O 13. November 2007 33 / 62
Data structures Process table file descriptor flags pointer to a file table entry File table file status flags (read, write, append, sync, nonblocking... ) current file offset pointer to the v-node table entry v-node structure type of file, pointers to functions i-node (read from disk on file open) contains owner, file size, device location, pointers to data blocks Alexander Holupirek (U KN) File I/O 13. November 2007 35 / 62
Kernel data structures for open files Alexander Holupirek (U KN) File I/O 13. November 2007 37 / 62
Two independent processes with the same file open Alexander Holupirek (U KN) File I/O 13. November 2007 39 / 62
Operations revisited write write increments the current file offset in the file table entry. if current file offset > current file size, update i-node. open open with O APPEND sets flag in file table entry. on write, the the current file offset is first set to current file size from i-node table entry. write is forced to append to current EOF lseek lseek only modifies the current file offset in file table. No I/O takes place. Positioning to EOF just copies file size from i-node to file offset in file table. Alexander Holupirek (U KN) File I/O 13. November 2007 41 / 62
File sharing Multiple file descriptor entries can point to same file table entry (dup(2) and fork(2)). File descriptor flags and file status flags live in different places (use fcntl(2) to modify). There is no problem for multiple processes reading the same file. Each process has its file table entry with distinct current file offsets. Problem can arise, when multiple processes write to the same file. To avoid surprises, we need to understand the concept of atomic operations. Alexander Holupirek (U KN) File I/O 13. November 2007 43 / 62
Atomic operations - Appending to a file Scenario: Single process wants to append to the end of a file. Assume the following code: if (lseek(fd, 0L, 2) < 0) /* position to EOF... */ err(errno, "lseek error"); if (write(fd, buf, 100)!= 100) /*... and write */ err(errno, "write error"); Is fine for a single, but will cause problems with multiple processes. Alexander Holupirek (U KN) File I/O 13. November 2007 45 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Since the file size has been extended, the kernel also updates the current file size in the v-node to 1600. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Since the file size has been extended, the kernel also updates the current file size in the v-node to 1600. Kernel switches and A resumes. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Since the file size has been extended, the kernel also updates the current file size in the v-node to 1600. Kernel switches and A resumes. When A calls write, the data is written at current file offset for A, which is 1500. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Since the file size has been extended, the kernel also updates the current file size in the v-node to 1600. Kernel switches and A resumes. When A calls write, the data is written at current file offset for A, which is 1500. This overwrites the data wrote by process B. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Assume processes A and B appending to the same file (without O APPEND flag). Each process has its own file table entry, but they share a single v-node table entry (see 3.3). Assume process A does the lseek and sets current file offset to, say, 1500 (EOF). Kernel switches and schedules B to run. B performs lseek to 1500 (EOF). B performs its write and increments current file offset to 1600. Since the file size has been extended, the kernel also updates the current file size in the v-node to 1600. Kernel switches and A resumes. When A calls write, the data is written at current file offset for A, which is 1500. This overwrites the data wrote by process B. E Lost update anomaly. Alexander Holupirek (U KN) File I/O 13. November 2007 47 / 62
Atomic operations - Lost update Problem: The logical operation position to EOF and write causes two system calls. Solution: Positioning and write has to be an atomic operation. Any operation that requires more than one function call can not be atomic. There is always the possibility that the kernel suspends the process between the two calls. Unix provides an atomic way for our scenario via the O APPEND flag. The kernel positions the file to its current end before write (no need for lseek). In general, the term atomic operation refers to an operation that is composed of multiple steps. If the operation is performed atomically, either all the steps are performed, or none. It must not be possible for a subset of the steps to be performed. Alexander Holupirek (U KN) File I/O 13. November 2007 49 / 62
dup and dup2 - duplicate an existing file descriptor dup(2) and dup2(2) - synopsis #include <unistd.h> int dup(int oldd); int dup2(int oldd, int newd); Returns new file descriptor or -1. New file descriptor is guaranteed to be lowest numbered available. dup2 fd value can be specified. If newd is already opened, it is closed first. If newd equals oldd return newd (without closing). Alexander Holupirek (U KN) File I/O 13. November 2007 51 / 62
Kernel data structures after dup(1) Alexander Holupirek (U KN) File I/O 13. November 2007 53 / 62
fcntl(2) - file control fcntl(2) - synopsis #include <fcntl.h> int fcntl(int fd, int cmd,...); fcntl can change the properties of a file that is already open. It is used for five different purposes: 1 duplicate an existing descriptor (F DUPFD = cmd) 2 get/set file descriptor flags (F GETFD or F SETFD) 3 get/set file status flags (F GETFL or F SETFL) 4 get/set asynchronous I/O ownership (F GETOWN or F SETOWN) 5 get/set record locks (F GETLK, F SETLK or F SETLKW) Alexander Holupirek (U KN) File I/O 13. November 2007 55 / 62
ioctl(2) - control device ioctl(2) - synopsis #include <sys/ioctl.h> (additional device-specific headers) int ioctl(int d, unsigned long request,...); ioctl() manipulates the underlying device parameters of special files. Can controll many operating characteristics of character special files (e.g., terminals). Has always been a catchall function for I/O operations. Anything that couldn t be expressed using one of the other functions. Alexander Holupirek (U KN) File I/O 13. November 2007 57 / 62
ioctl - example /usr/include/sys/ttycom.h #define TIOCM_LE 0001 /* line enable */ #define TIOCM_DTR 0002 /* data terminal ready */ #define TIOCM_RTS 0004 /* request to send */ #define TIOCM_ST 0010 /* secondary transmit */ #define TIOCM_SR 0020 /* secondary receive */ #define TIOCM_CTS 0040 /* clear to send */ #define TIOCM_CAR 0100 /* carrier detect */... terminal I/O disklabels file I/O magnetic tape I/O socket I/O Alexander Holupirek (U KN) File I/O 13. November 2007 59 / 62
Summary We have seen traditional Unix I/O functions. These are often called unbuffered I/O functions. Unbuffered, because each read and write invokes a system call. Atomic operations were introduced. We discussed data structures used by the kernel to share information about open files. Alexander Holupirek (U KN) File I/O 13. November 2007 61 / 62
Lecture Material The tutorials are based on the following material W. Richard Stevens. Advanced Programming in the UNIX Environment. ISBN 0-201-56317-7, 1999, 19th Printing. Addison-Wesley Professional Computing Series. Marshall K. McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman. The Design and Implementation of the 4.4BSD Operating System. ISBN 0-201-54979-4, 1996, Addison-Wesley. Alexander Holupirek (U KN) File I/O 13. November 2007 62 / 62