LING115 Lecture Note Session #5: Files, Functions and Modules 1. Introduction A corpus comes packaged as a set of files. Obviously, we must know how to read data from a file into our program. At the same time, it is convenient to save the output of our analysis of the corpus to a file. Let s learn how to work with files in section 2 and how to write our programs more efficiently using functions and modules in sections 3 and 4. 2. Files We sometimes want to read the data from a file or want to write the output of our program to a file. This can be done by using the file data-type. We create a file object by opening a file with a particular name. To this end, we use the open function. We specify the name of the file and whether we want to read data from the file or write data to the file. For example, the following will open a file named foo.txt under /home/ling115/python_examples/ so we can read its data: f=open( /home/ling115/python_examples/foo.txt, r ) By default, it is assumed that we open a file to read data. So the above is equivalent to the following: f=open( /home/ling115/python_examples/foo.txt ) The following will create a file called blah.txt under /home/hahnkoo/ so we can write data to it: f=open( /home/hahnkoo/blah.txt, w ) If a file with the same name already exists, opening a file with the w parameter is equivalent to deleting the old file and starting a new file with the same name. To open a file so that we can append new to the existing data, we open a file with a parameter. For example, the following will allow us to add more data to foo.txt under /home/hahnkoo/, assuming the file exists: f=open( /home/hahnkoo/foo.txt, a ) Note that I specified the path to the directory in addition to the filename. If the directory path is not specified, Python assumes the directory is the current working directory. A file object is created as a result of opening a file. In the above examples, the file object is called f. Having an object of a particular data-type means we can use its methods. Below are some useful methods specific to file objects. close() 1
Close the file. Unless we need to have multiple files open at the same, it is recommended that you close the file as soon as you are done with it. For example, after we read lines from foo.txt into a list, we want to close the file as follows: >>> foo=open( /home/ling115/python_examples/foo.txt ) >>> lines=foo.readlines() >>> foo.close() readline() Read one line from the file. Each time this method is called, it returns the next line in order starting from the first line of the file. Try the following, for example. >>> foo=open( /home/ling115/python_examples/foo.txt ) >>> foo.readline() >>> foo.readline() >>> foo.close() readlines() Read all lines from the file into a list. write(string) Write string to the file. >>> f=open( temp.txt, w ) >>> a= a string >>> f.write(a) >>> f.close() writelines(sequence) Write a sequence of strings to the file. For example, the following will create a file called foo.txt which contains one line astringwithoutspace. >>> f=open( foo.txt, w ) >>> a=[ a, string, without, space ] >>> f.writelines(a) >>> f.close() The following, on the other hand, will create a file called foo2.txt which contains three lines. >>> f=open( foo2.txt, w ) >>> a=[ line1\n, line2\n, line3\n ] >>> f.writelines(a) >>> f.close() 2
3. Functions Roughly speaking, a function is like a program inside a program: it receives input arguments, does something with them, and returns a value. For example, we could have a function called avg which takes a list of numbers as its argument and returns its arithmetic mean. We could perhaps use it to see how often a word appears in the given corpus on average. A function must be defined first in order to use it. def N(A): B return X N, A, B, X refer to the name of the function, its arguments, the block of code that defines the function, and the value that the function returns, respectively. For example, the avg function would be defined as follows: def avg(list): sum=0.0 for number in list: sum=sum+number n=len(list) return sum/n A function returns a value, so it can be used in expressions. For example, we can subtract average from a value to calculate its deviation as follows: list=[1,2,3,4] deviation = list[2] - avg(list) As you can see from the above example, we use a function by calling it: specify the name of a function and its arguments in parentheses. 4. Modules A module is a file that contains function definitions so that we can use functions defined in another program file. For example, we define the avg function only once and then use it in any programs where we want to calculate the average of a list. In order to call a function defined in a module, we must first import the module. That is, add the following line: import <module> We have already seen an example when we imported the sys module to process standard input. 3
import sys Once we import the module, we call its function in the following format: <module>.<function> For example, we can use the log function defined in the math module as follows: import math math.log(100) There are modules like sys which are built-in. See the list at http://docs.python.org/modindex.html for more. For these, we just need to enter import <module> as above. However, to use a function defined in a file we created, we must do the following: 1) Tell Python the directory that contains the file. This is done by first importing sys and then entering sys.path.append(<directory-path>). 2) Enter import <module>, where <module> is the name of your file without.py. For example, suppose we defined the avg function in a file named ling115_stat.py under /home/ling115/python_examples/. To import the module, import sys sys.path.append( /home/ling115/python_examples/ ) import ling115_stat 5. Exercise 1 Suppose we wanted to count the number of words in each file under a specified directory. Get a list of files under the specified directory. For each file in the list, do the following: o Count the number of words in the file. Let s first define how to count the number of words in each file. Define a counting variable named word_count. Initialize it to zero. Open the file. Read the lines in the file into a list. For each line in list, do the following: o Remove leading or trailing control characters. o Split the line by white space and store it in a list of words. o Increase word_count by the number of words in the list. 4
We can define a function that captures the counting process. Let s call it count_words. def count_words(file): f=open(file) lines=f.readlines() f.close() word_count=0 for line in lines: words=line.strip().split() word_count=word_count+len(words) return word_count With the definition above, we can write the program as follows: import sys import os directory_name=sys.argv[1] file_list=os.listdir(directory_name) for file in file_list: count=count_words(directory_name+file) print file+ \t +str(count) In the program above, we import two built-in Python modules: sys and os. The sys module is imported to process command-line arguments as can be seen in directory_name=sys.argv[1]. The os module is imported to list the files in the specified directory: os.listdir(directory_name). Note the use of count_words function in count=count_words(directory_name+file). 6. Exercise 2 Now suppose we wanted to calculate the average number of words in each file using the avg function defined in /home/ling115/python_examples/ling115_stat.py. Instead of printing the number of words in each file, we add the word-count to a list while we re in the for-loop and calculate the mean of the list afterwards. In addition to the definition of count_words mentioned in the previous section, our code should include the following: import sys import os sys.path.append( /home/ling115/python_examples/ ) import ling115_stat directory_name=sys.argv[1] file_list=os.listdir(directory_name) count_list=[] for file in file_list: 5
count=count_words(directory_name+file) count_list.append(count) print ling115_stat.avg(count_list) Note that the path to the directory containing ling115_stat.py must be first added to sys.path in order for Python to import ling115_stat to our program. 6