Exploring the file system. Johan Montelius HT2016

Similar documents
My malloc: mylloc and mhysa. Johan Montelius HT2016

A heap, a stack, a bottle and a rack. Johan Montelius HT2017

The UNIX File System

The UNIX File System

Outline. File Systems. File System Structure. CSCI 4061 Introduction to Operating Systems

CSE 374 Programming Concepts & Tools

CSE 333 SECTION 3. POSIX I/O Functions

CSCE 313 Introduction to Computer Systems

The UNIX File System. File Systems and Directories UNIX inodes Accessing directories Understanding links in directories.

Introduction to Linux

FILE SYSTEMS. Tanzir Ahmed CSCE 313 Fall 2018

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018

Chapter 1 Getting Started

CSE 333 SECTION 3. POSIX I/O Functions

CSC209. Software Tools and Systems Programming.

CptS 360 (System Programming) Unit 6: Files and Directories

The bigger picture. File systems. User space operations. What s a file. A file system is the user space implementation of persistent storage.

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

CSC209. Software Tools and Systems Programming.

Virtual File System. Don Porter CSE 506

Outline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging

Operating System Labs. Yuanbin Wu

Virtual File System. Don Porter CSE 306

CSci 4061 Introduction to Operating Systems. Programs in C/Unix

Lab 8. Follow along with your TA as they demo GDB. Make sure you understand all of the commands, how and when to use them.

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview

Reviewing gcc, make, gdb, and Linux Editors 1

Lecture 03 Bits, Bytes and Data Types

CS 1550 Project 3: File Systems Directories Due: Sunday, July 22, 2012, 11:59pm Completed Due: Sunday, July 29, 2012, 11:59pm

CS 392/681 Lab 6 Experiencing Buffer Overflows and Format String Vulnerabilities

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

File System Definition: file. File management: File attributes: Name: Type: Location: Size: Protection: Time, date and user identification:

Problem Set 1: Unix Commands 1

C Review. MaxMSP Developers Workshop Summer 2009 CNMAT

OPERATING SYSTEMS: Lesson 12: Directories

5/8/2012. Creating and Changing Directories Chapter 7

Exercise Session 6 Computer Architecture and Systems Programming

CS-537: Midterm Exam (Fall 2008) Hard Questions, Simple Answers

377 Student Guide to C++

File System Basics. Farmer & Venema. Mississippi State University Digital Forensics 1

Introduction to Supercomputing

Links, basic file manipulation, environmental variables, executing programs out of $PATH

Starting to Program in C++ (Basics & I/O)

Lecture 12 CSE July Today we ll cover the things that you still don t know that you need to know in order to do the assignment.

File Management 1/34

Last Week: ! Efficiency read/write. ! The File. ! File pointer. ! File control/access. This Week: ! How to program with directories

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

Implementation should be efficient. Provide an abstraction to the user. Abstraction should be useful. Ownership and permissions.

File Systems Ch 4. 1 CS 422 T W Bennet Mississippi College

Laboratory 1 Semester 1 11/12

Project #1 Exceptions and Simple System Calls

Chapter 11 Introduction to Programming in C

Crash Course in Unix. For more info check out the Unix man pages -orhttp:// -or- Unix in a Nutshell (an O Reilly book).

Learning Objectives. A Meta Comment. Exercise 1. Contents. From CS61Wiki

Chapter 11 Introduction to Programming in C

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

UNIX File Hierarchy: Structure and Commands

Operating System Labs. Yuanbin Wu

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CSE 390a Lecture 2. Exploring Shell Commands, Streams, Redirection, and Processes

1 Installation (briefly)

Source level debugging. October 18, 2016

CSE 410: Systems Programming

Section Notes - Week 1 (9/17)

CS354 gdb Tutorial Written by Chris Feilbach

CSE 374 Midterm Exam 2/6/17 Sample Solution. Question 1. (12 points, 4 each) Suppose we have the following files and directories:

Practical Malware Analysis

EE355 Lab 5 - The Files Are *In* the Computer

File Systems: Naming

Week 3. This is CS50. Harvard University. Fall Cheng Gong

CS61C Machine Structures. Lecture 4 C Pointers and Arrays. 1/25/2006 John Wawrzynek. www-inst.eecs.berkeley.edu/~cs61c/

Homework 4 Answers. Due Date: Monday, May 27, 2002, at 11:59PM Points: 100. /* * macros */ #define SZBUFFER 1024 /* max length of input buffer */

Hello, World! in C. Johann Myrkraverk Oskarsson October 23, The Quintessential Example Program 1. I Printing Text 2. II The Main Function 3

Important From Last Time

CSE 333 Midterm Exam 7/29/13

Chapter 11 Introduction to Programming in C

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

CS 220: Introduction to Parallel Computing. Input/Output. Lecture 7

Other array problems. Integer overflow. Outline. Integer overflow example. Signed and unsigned

Heap Arrays. Steven R. Bagley

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Introduction: The Unix shell and C programming

CSC209: Software Tools and Systems Programming. Richard Krueger Office hours: BA 3234

HW1 due Monday by 9:30am Assignment online, submission details to come

Working with Basic Linux. Daniel Balagué

Arrays and Memory Management

File Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Important From Last Time

CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES

Programming Tips for CS758/858

Lab 03 - x86-64: atoi

Lecture 2, September 4

Intro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy

HOT-Compilation: Garbage Collection

Processor. Lecture #2 Number Rep & Intro to C classic components of all computers Control Datapath Memory Input Output

Cache Lab Implementation and Blocking

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root.

Recitation #11 Malloc Lab. November 7th, 2017

Transcription:

1 Introduction Exploring the file system Johan Montelius HT2016 This is a quite easy exercise but you will learn a lot about how files are represented. We will not look to the actual content of the files but on the so call meta-data that we have of each file. In order to understand how much data there is, we will implement a command that you have probably used many times in the shell. We will also gather some statistics on the size of files and present the numbers in some nice graphs. 2 List a directory The command that we will implement is a version of the well known ls command, the command you use to list the content of a directory. Our first try will not be very impressive but at least it will do something. Let s call it myls so we can use the regular ls to verify that we do the right thing. Create a file myls.c and start coding. 2.1 read the directory To our help we have a library call openddir() that will make things a bit easier for us. This procedure will return a pointer to a directory stream i.e. a sequence of directory entries. We can then access each of these entries using the library procedure readdir(). You should by now be able to figure out what header files you need to include, so we will not list them in the code sections. int main ( int argc, char argv [ ] ) { i f ( argc < 2 ) { p e r r o r ( usage myls <dir >\n ) ; return 1; char path = argv [ 1 ] ; DIR dirp = opendir ( path ) ; struct d i r e n t entry ; 1

while ( ( entry = r e a d d i r ( dirp ) )!= NULL) { p r i n t f ( type %u, entry >d type ) ; p r i n t f ( \ t i n o d e %lu, entry >d ino ) ; p r i n t f ( \tname %s \n, entry >d name ) ; Do man readddir and you will see what the dirent structure looks like and what we could find more than what we have printed (not much). If you compile and run the program you will see your first attempt of mimicking ls, it works but we re far from there. This is what my attempt looked like. > gcc -o myls myls.c >./myls./ type 4 inode 917888 name. type 8 inode 945912 name myls type 8 inode 963825 name myls.c type 4 inode 940418 name.. So what do we have here a type, the inode number and the name. We can make it a bit more readable by interpreting the type information. 2.2 the type The interpretation is found in the man pages for readdir() and we can print them out using a switch statement. while ( ( entry = r e a d d i r ( dirp ) )!= NULL) { switch ( entry >d type ) { case DT BLK // This i s a b l o c k d e v i c e. p r i n t f ( b ) ; case DT CHR // This i s a c h a r a c t e r d e v i c e. p r i n t f ( c ) ; case DT DIR // This i s a d i r e c t o r y. p r i n t f ( d ) ; case DT FIFO // This i s a named pipe. p r i n t f ( p ) ; case DT LNK // This i s a symbolic l i n k. p r i n t f ( l ) ; case DT REG // This i s a r e g u l a r f i l e. 2

p r i n t f ( f ) ; case DT SOCK // This i s a UNIX domain s o c k e t. p r i n t f ( s ) ; case DTUNKNOWN // The f i l e type i s unknown. p r i n t f ( u ) ; p r i n t f ( \ t i n o d e %lu, entry >d ino ) ; p r i n t f ( \tname %s \n, entry >d name ) ; As seen in the code above we are not talking about the type in terms of pdf, txt or a C source file. The type we are talking about is to differentiate directories and symbolic links etc from regular files. To find out if a particular file is a pdf file we would have to look at its name and see if ends in.pdf; this is only a convention and you re of course free to name a pdf-file to foo.jpg if you want to; doing so will however make it hard to automatically determine which application to open when you want to view the content of the file. This is all there is to the directory, it s a mapping from names to inodes. We do not now anything about the file more than the name and remember that the name is just something that is valid in the directory that we are currently looking at. 2.3 more information To find more information about a particular file we use the system call fstatat(). This procedure will populate a stat structure with all the information about the file we are looking for. There is also a stat() procedure but this procedure would look-up a file name in the current directory which will probably not be the directory that we are looking at. The fstatat() procedure allows us to specify in which directory we should do the look-up. struct s t a t f i l e s t ; f s t a t a t ( d i r f d ( dirp ), entry >d name, &f i l e s t, 0 ) ; If we insert this above the printf() statement we can write out more information about the file, for example its size. p r i n t f ( \ t i n o d e %lu, entry >d ino ) ; p r i n t f ( \ t s i z e %lu, f i l e s t. s t s i z e ) ; p r i n t f ( \tname %s \n, entry >d name ) ; 3

Take a look in the man pages of fstatat() and you will find more information about the file. 3 Things are not that simple To disturb your world of comfort, where directories map names to inodes and inodes are something that is on some disk we will complicate things a bit. First we will create a dummy directory called dram and then mount a tmpfs file system using the directory as a mount point. Let s first create a directory and see what it looks like. > mkdir foo >./myls. As you see the directory is given a new inode number and we can verify that we do the right thing by looking at the output of the regular ls command. > ls -il. Now let s try this - mount a tmpfs file system using the dram directory as the destination. > sudo mount -t tmpfs tmpfs./dram Nothing much has changed and this can be verified my looking at the output from myls. But check what you see when you try the regular command ls -li - hmm, what is going on? If you wan to get even more confused look at the output when we look inside the dram directory. >./myls dram > ls -ila dram The regular ls command does not report what we see with myls, but instead collects the information from the stat structure. Try the following and things might be a bot more clear. 4

p r i n t f ( \ t i n o d e %lu, entry >d ino ) ; p r i n t f ( \ tdev 0x%l x, f i l e s t. s t d e v ) ; p r i n t f ( \ t i n o d e %lu, f i l e s t. ino ) ; p r i n t f ( \ t s i z e %lu, f i l e s t. s t s i z e ) ; p r i n t f ( \tname %s \n, entry >d name ) ; As you see ls uses the inode number found in the stat structure rather than in the directory listing. Also not that the inode numbers of the mounted directory, 2, is the same as in your regular root directory. If you try the command df you will see a list of all file systems currently mounted in your system and if do some more investigation you will find that all of them have a root directory with inode number 2. Inode numbers are local to a file system and the operating system needs to keep track of which file system we are talking about; or, what device the file system is found at. The hierarchical name space were directories serves as mount points for different file system, provides a seamless name space. You re normally not aware of which file system that is activated. If you re regular Windows user you might be used to the driver letters C, D etc, these are the Windows equivalent of mounted file systems. You might wonder where A and B went but if you attach a floppy disk drive you might have it mounted as the A drive. 3.1 the device We printed the st dev value in hex and we did so for a reason. The value that was printed, let s assume 0x801, is interpreted as disk 8, partition 1. If you examine the /dev directory you probably see that the file system that it is referring to is your main disk drive. This might be different depending on which machine you run but you should be able to work out the details. > ls -l /dev/sda* So the stat structure contains a local inode number and which device this inode node number belongs to. This indirection from the true inodes that are on disk is called virtual inodes or vnodes; an abstraction layer that allows your Unix file system to present several different file systems with the same interface. Enough about mounted file systems. If you unmount dram we shall try to gather some statistics of your files on your machine. 4 Traverse the tree Let s write a program that counts the number of files in a directory including all sub-directories. We have most of the components we need and if we can 5

only avoid the most obvious pit-falls we should be done i ten minutes. Create a new file total.c and start coding. 4.1 a recursive solution Since we know that the directory tree is not very deep we will implement a recursive procedure. The procedure count() will open a directory, count the number of files and recursively count the number of files in its subdirectories. We do a quick and dirty implementation where we assume that no directory path is longer than 1024 characters - a proper solution would allocate the required amount of memory on the heap. unsigned long count ( char path ) { unsigned long t o t a l = 0 ; DIR dirp = opendir ( path ) ; char subdir [ 1 0 2 4 ] ; struct d i r e n t entry ; struct s t a t f i l e s t ; while ( ( entry = r e a d d i r ( dirp ) )!= NULL) { switch ( entry >d type ) { case DT DIR // This i s a d i r e c t o r y. s p r i n t f ( subdir, %s/%s, path, entry >d name ) ; t o t a l += count ( s ubdir ) ; case DT REG // This i s a r e g u l a r f i l e. t o t a l ++; default c l o s e d i r ( dirp ) ; 6

return t o t a l ; We have to think a bit here - the directory contains two very special entries ".." and ".". If we follow these paths we will for sure wait for a very long time before we receive any results. We need to prevent the count() procedure from going into an endless loop so we insert the following check. i f ( ( strcmp ( entry >d name,. ) == 0) ( strcmp ( entry >d name,.. ) == 0 ) ) { ; A small main() procedure and we re done. If you have added all the right include directives you should be able to compile and run your file counter. int main ( int argc, char argv [ ] ) { i f ( argc < 2 ) { p e r r o r ( usage t o t a l <dir >\n ) ; return 1; char path = argv [ 1 ] ; unsigned long t o t a l = count ( path ) ; p r i n t f ( The d i r e c t o r y %s c o n t a i n s %lu f i l e s \n, path, t o t a l ) ; 4.2 exceeding you rights If you try to run you program on for example /etc you will probably get the a segmentation fault. Instead of fixing this directly (I know the reason) we can try to use gdb to try to find out what happened../total /etc segmentation fault (core dumped) We first compile the program with the -g flag set. This will produce a binary with some additional debug information that will allow us to use gdb. > gcc -g -o total total.c We then start gdb giving the program as an argument. > gdb total 7

We now have a (gdb) prompt and can start using the debugger. In this small example we will only run the program, let it crash and then try to figure out what happened. You start the program with the run command, giving /etc as an argument. It will look something like the following. (gdb) run /etc Starting program.../src/total /etc Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7ad58a2 in readdir (dirp=0x0) at../sysdeps/posix/readdir.c44 44../sysdeps/posix/readdir.c No such file or directory. (gdb) This tells us that the segmentation fault occurred in the system call readdir where dirp was a null pointer. To see where in our code this happened we step up in the call stack. (gdb) up #1 0x000000000040084e in count (path=0x7fffffffd9f0 "/etc/...") at total.c66 66 while((entry = readdir(dirp))!= NULL) { (gdb) Now we see were we are in our code - the while statement in the count() procedure. We can confirm that dirp is actually a null pointer by printing its value. (gdb) p dirp $1 = (DIR *) 0x0 The question is why; we received the pointer from the opendir() procedure so the question is what value path had when we called it. Printing the value of path will (in my case) reveal that something strange happened when we tried to open "/etc/polkit-1/localauthority". (gdb) p path $2 = 0x7fffffffd9f0 "/etc/polkit-1/localauthority" (gdb) If you look at the directory that caused your problem (if there was one) you might find the reason for the failure. > ls -ld /etc/polkit-1/localauthority drwx------ 7 root root 4096 2016-04-21 0011 /etc/polkit-1/localauthority 8

Hmmm, owned by root with the access rights "rwx------". No wonder a regular user could not read that directory. Let s fix our code so we take care of the case where we, for some reason, will not be able to read the directory. i f ( dirp == NULL) { p r i n t f ( not a ble to open %s \n, path ) ; return 0 ; Ok, give it a try. 4.3 double counting What we re counting is the number of links to files, if a file object is linked to by several links this object will be counted twice. We could of course keep track of which files that we have seen and make sure that we only count them twice. If you tried to implemented this, how would you identify unique files? Would the inode number be enough? 5 How large are files? If we forget about the problem with double counting, we can implement a program that generates some statistics of file sizes. We start with some assumptions and a global data structure where will store the result. Create a file called freq.c, or rather take copy of total.c since we re going to reuse most of the code. A frequency table will keep track of how many files we have of certain sizes. We re not particularly interested in the exact distribution so we only keep track of the size in steps of power of two. We do not count files of size 0 and only keep track of files up to the size 2 FREQ MAX. The last entry in the table will contain all larger files and if the machine that you re running on is not also the machine that keeps all your movies, I guess 2 32 will be fine for our purposes. #define FREQ MAX 32 unsigned long f r e q [FREQ MAX] ; void a d d t o f r e q ( unsigned long s i z e ) { i f ( s i z e!= 0) { int j = 0 ; int n = 2 ; while ( s i z e / n!= 0 & j < FREQ MAX) { n = 2 n ; 9

j ++; f r e q [ j ]++; We now change the procedure count() so that it will call add to freq() with the size of each file that it sees. We don t have to return anything so there are only small changes. We check that fstatat() succeeds before using file st, it could be a file that we do not have read permission to. void count ( char path ) { case DT REG // This i s a r e g u l a r f i l e. i f ( f s t a t a t ( d i r f d ( dirp ), entry >d name, &f i l e s t, 0) == 0) { a d d t o f r e q ( f i l e s t. s t s i z e ) ; Almost done, some small changes to the main() procedure and we re done. p r i n t f ( #The d i r e c t o r y %s number o f f i l e s s m a l l e r than 2ˆk \ n, path ) ; p r i n t f ( #k\tnumber\n ) ; for ( int j= 0 ; j < FREQ MAX; j++) { p r i n t f ( %d\ t%lu \n, ( j +1), f r e q [ j ] ) ; Hmm, should work - try with a smaller directory first and then gather some statistics of a larger directory (try /usr). If everything works we should have nice printout of a table and we can of course not resist to explore this data using gnuplot. 6 Some nice graphs Start by saving the histogram in a file called freq.dat and then you can generate your first graph in one line of code. 10

>./freq /usr > freq.dat > gnuplot gnuplot> plot "freq.dat" using 12 with boxes If you graph looks anything like mine you see that most file are between 1K and 2K bytes (in the 2 11 bucket). There are plenty of smaller files but few that are smaller than 32 bytes; files above one megabyte are also rare. It looks like half of the files are between 512 and 4K bytes. An alternative way of presenting these numbers is as a cumulative frequency diagram. Try the following gnuplot> a=0 gnuplot> cumulative_freq(x)=(a=a+x,a) gnuplot> plot "freq.dat" u 1(cumulative_freq($2)) w linespoints This diagram adds the frequencies as we go and shows how many files are less than a certain size. If we know that there were 280000 file in the /usr directory we could get a nice y-axis that would give us the percentage of all files. gnuplot> plot "freq.dat" u 1(cumulative_freq($2)/280000) w linespoints You could of course wonder how much of the hard drive is taken up by files of what size and we can give a hint of this by multiplying the frequency with the average size of the category. If we adapt the function to take the size into account we have the following gnuplot> a=0 gnuplot> cumulative_size(k, x)=(a=a+(x*((2**k)-((2**(k-1)))/2)),a) gnuplot> plot "freq.dat" u 1(cumulative_size($1,$2)) w linespoints The graphs that we have generated now will give you a quick overview of what the file system looks like. If you would use them in a presentation you would of course do some more work to fix the scales, titles etc. The graph should contain all information needed to interpret it; it should not be open for guessing what the x-axis really means. 7 Summary So with some simple coding I hope that you have learned some more about the file system and even if it was nothing completely new, at least it gave you some first hand experience of what it looks like. It s one thing to look at the power-point slide, another actually doing it. 11