Chapter 10: Strings and Hashtables This chapter describes the string and hashtable data types in detail. Strings hold text-- words and phrases-- and are used in all applications with natural language processing. Hashtables are like lists, but they allow for non-integer indexes-- you can find information using a key word instead of by number. Strings In computer science, the term character refers to a symbol that appears on the keyboard-- a letter, digit, or punctuation mark. The term string refers to sequences of characters. In most programming languages, including Python, string literals are specified within quotation marks. 'dog', 'honey I ate the kids', and '327' are examples of string literals. Quotations marks are used to distinguish such string literals from variables, as variable names are also sequences of characters In the assignment statement: animal = 'dog' animal is a variable and its value is dog. If we were tracing the program, we'd write: animal 'dog' Strings like '327' can be confusing to beginning programmers. Consider the following code: number = '327' doublenumber = 2*number print doublenumber What do you think will be printed out? If you said 654, you'd be wrong. The correct answer is '327327'. The reason is that, to the computer, '327' is not a number, it's just a sequence of characters, a '3' followed by a '2' and then a '7'. '327' is a string literal, so when you assign the variable number to it, the variable number is marked as type string. When you multiply any variable of type string, you just repeat the characters of the string n times,
where n is the multiplier. Thus, 2*number=2*'327'='327327' If you left off the quotes, you'd get your 654: number = 327 doublenumber = 2*number print doublenumber Python does provide the int() function for converting a string like '327' into its integer equivalent, so you can write: number = '327' doublenumber= 2*int(number) print doublenumber and get 654. Conversely, you can also convert an integer into a string using the str() function. ASCII Table A string is a sequence of characters-- symbols on the keyboard. But how does a computer store each of the characters of a string? For the string 'aaa', does it store three images that look like the letter 'a'? The answer is no. Instead, there is a mapping table, the ASCII table, that maps a number to each symbol. So the letter 'a' is represented by the number 97, 'b' is 98, and so on. The digit '0' is represented by the number 48, '1' is 49, etc. The string 'aaa' is actually stored as four numbers: 97 97 97 0 As the examples shows, a 0 is stored to denote the end of the string. We call it the end-of-string character. The string 'cat' is: 99 97 116 0 Early computer systems stored the number representation for each character in 8 bits, but that only allowed for 2 8 = 256 possible characters. This worked fine for the English language and Arabic character set, but was insufficient for internationalization and characters sets like those in Japan and China. Now computer systems store each character with a 16 bit
number in a table known as Unicode. In most programs, the fact that the letter 'a' is really stored as 97 is inconsequential. However, there are times when the ASCII mapping table is needed, such as when converting a string representation of a number to the number itself. Python provides two functions, ord(s) and chr(s), that give programmers access to the ASCII mapping table. ord returns the ascii number of a character, while chr returns the character for a given number. So ord('a') is 97, and chr(97) is 'a'. Concatenation The familiar use of the plus sign (+) is to add numbers-- everyone understands 1+1. In Python, you can also apply the plus sign to strings. Adding two strings is called concatenation and results in the two strings being joined together, e.g., the result of the expression: "abc"+"def" is a new string, "abcdef". Let's consider an example: say we have a list holding a players golf scores, and we want to display the scores in the form hole:score, hole:score, etc. So if the golfer scored a 3 on hole 1 and a 5 on hole 2, our code would build the string: hole 1:3, hole 2:5 Here's the code: golfscorelist = [3,5] # could be any list of numbers allscores="" # begin with the empty string i=0 while i<len(golfscorelist): score="hole" + str(i+1) + ":" +str(golfscorelist[i]) #build element allscores = allscores+score #concat to allscores i=i+1 if i<len(golfscorelist): allscores=allscores+"," print allscores We start by initializing the string allscores to the empty string. This is the string equivalent to the list initialization list=[]. Just as we use append to build lists, we'll use concatenation to incrementally build allscores into our final result. We iterate through the list, as usual, with a while loop. On each iteration, we build a string for that score (e.g., ''hole 1:3'). The line:
score="hole" + str(i+1) + ":" +str(golfscorelist[i]) does this by concatenating the string "hole" with the hold number (i+1), a colon, and the score (golfscorelist[i]). Because (i+1) and golfscorelist[i] are both integers, we use Python's str function to convert them into strings. On the second line within the while loop: allscores = allscores+score #concat to allscores we append the current score string to the allscores string using concatenation. After incrementing our hole counter i, we check if we are on the last score: i<len(golfscorelist If not, we know we have another score to come so we can add a comma to allscores. Iterating Through a String To Python, a string is a different type than a list, but in essence a string is a list of characters. And in fact, in Python you can loop through a string like you do a list. Take for instance this code to count the occurrences of the letter 'a' in a string variable word: word = "aabbbaaabb" # could be any string i=0 count=0 while i<len(word): # do something with each character if word[i]=='a': count=count+1 i=i+1 Note the use of the function len: you can use it to find the number of characters in a string, just as you can use it to find out how many elements are in a list. Note also the use of the index operator in the statement: if word[i]=='a'
Just as the index operator gets an element of a list, it can also get a character in a string. If word was 'cat', then word[0] would be 'c', word[1] would be 'a', and word[2] 't'. One thing you cannot do to a string, which you can do to a list, is modify an element using an index. If word is a string, the following will give you an error: word[i]='x' in this sense, strings are immutable. You can get a slice of a string. For instance, the following code gets the first two characters of a string: astring=astring[0:2] Slice can be used to get around the 'immutable' nature of strings. Consider this function to replace the ith character of a string: def replace(astring,i,replacement): return astring[0:i]+replacement+astring[i+1:len(astring)] The code grabs the slice of the string up to i, then concatenates the replacement character, then appends the last part of the original string (from i+1 to the length of the string). For the following code: word = 'cat' word2 = replace(word,1,'x') print word2 'cxt' would be printed. Note that the variable word is not modified-- it still holds the value 'cat', and the variable word2 holds 'cxt'. We could have changed the variable word by writing: word = replace(word,1,'x')
Python String Functions Python has many built-in functions for manipulating strings, all listed at the Python web site: http://www.python.org/doc/2.5.2/lib/string-methods.html. The functions are object-oriented, meaning they are called in the form: somestring.function(p1,p2) where somestring is any string and p1 and p2 are the required parameters for the functions. This object-oriented way of calling a function is different from previous functions we've looked at in that the string being manipulated is considered the "object" of the function and is found to the left of a dot and the function name. We'll discuss object-oriented programming in depth in chapter 11. The string library functions perform just about any operation one could think of. One example is the split function, which splits a string into a list of subparts based on a given delimiter. For example, consider a bookmarking site that allows a user to tag articles with keywords. Some such sites allow the user to enter the tags separated by commas. So the user might tag an article about Babe Ruth with "baseball, drinkers", meaning that the article should be categorized under "baseball" and "drinkers". The split function could be used to separate the tags: userinput="baseball,drinkers" taglist= userinput.split(",") After the call to split, taglist[0] would be "baseball", and taglist[1] would be "drinkers". Other string functions include upper and lower, which return the upper and lower cases of strings, find which returns the index of the occurrence of a substring, and startswith, which returns True if a string begins with a particular substring. Hashtables A hash table consists of key-value pairs. It is like a list, but whereas a list is indexed with a number, a hash table is indexed with a string called a key. Hashtables are useful when data is best accessed using a keyword. One example is an english-to-spanish mapping:
engtospan = {} # initialize the hash table engtospan['hello']='hola' #map the key 'hello' to the value 'hola' engtospan['goodbye']='adios' #map the key 'goodbye' to the value 'adios' For the hashtable engtospan, the keys are English words, and they are mapped to values that are Spanish words. Each entry in the hashtable is a key-value pair: key hello goodbye value hola adios In Python, Hashtables are called Dictionaries. Hashtable Initialization Recall that lists can be initialized either with an empty list: list=[] or with some initial data: list=[3,5,9] Hashtables are initialized in a similar fashion, using {} instead of []. You can create an empty hashtable with: engtospan={} or you can create one with initial key-value pairs: engtospan={'hello':'hola','goodbye','adios','beer','cerveza'} Note that each entry in a hash-table has two parts, the key and the value, separated by a colon. So the first entry in the code above is 'hello':'hola', with 'hello' being the key, and 'hola' the value. Commas separate the entries of the table. Hashtable Modification Recall that new items can be added to a list using the function append. With hashtables, new items are added using an assignment statement and an index:
engtospan["horse"]="caballo" The index "horse" is the key, and the value is "caballo". If this line of code followed the three entry initialization above, the hashtable would then have four entries. Accessing Hashtable Data Recall that list data is accessed by indexing into the list with a number. One can print the third item of a list with: print list[2] Hashtables are also accessed by indexing, but instead of accessing the ith item, one accesses data using a key (string) index. So one could print the Spanish word for "goodbye" with: print engtospan("goodbye") Or one could print the Spanish equivalent to a word input by the user with: engword = raw_input('please enter an English word:') spanword = engtospan[engword] print 'The spanish eqivalent is: ',spanword Python also provides the keys() and values() functions for iterating through all the entries of a hashtable. For instance, one could print an entire hashtable with the following code: for key in engtospan.keys(): print key+":"+engtospan[key] Using Hashtables in App Engine Hashtables have many uses. For instance, you could use a hash table to store data for each user in your system, using the user's id as a key. Hashtables are also used in web programming. In the App Engine system, which we'll be studying soon, a hashtable is used to build dynamic web pages. Consider the following HTML code: <html> <body> <p> The interest is: {{interest}} </p>
<p> The principal after one year is: {{principal}} </p> </body> </html> The interest and principal variable defined within double-curly brackets are called template variables. The template variables are the part of the web page that shows dynamic information, in this case the results of some banking interest computations performed by the web site server. With App Engine, the web site server code is written in Python. Suppose that the user has entered the original principal and rate of a bank account into an HTML form on a different page. The App Engine controller on the server would compute the interest earned and the new principal, then stick these results into a hash table called 'template_values' class ComputeInterestHandler(webapp.RequestHandler): def get(self): principalstring=self.request.get('principal') ratestring = self.request.get('rate') interest = int(principalstring)*int(ratestring)/100 intereststring = str(interest) principalstring = str(principal+interest) template_values={'interest':intereststring,'principal':principalstring} # render the page using the template engine path = os.path.join(os.path.dirname( file ),'index.html') self.response.out.write(template.render(path,template_values)) The template_values variable is a hash table. Each key represents one of the template variables in the HTML template. Each value represents the data that should replace those template variables when the new dynamic page is sent to the browser. In this case, the server replaces the key 'interest' with the value that was computed for it (the value in the variable intereststring), and the key 'principal' with the value that was computed for it. Summary Strings and hashtables are fundamental data types used by programmers, along with lists, integers, floating point numbers, booleans. In the following chapter, will discuss classes, which allow programmers to define their own data types.
Problems 1. Write a function that takes a string representation of a positive whole number as a paramter and returns the number as an integer. You can assume that the parameter is a valid number, e.g., "327" and does not have any non-digits, as in: "3&27" 2. Write a modified version of problem 1 in which your function: handles negative numbers returns 0 if the given string is not a valid whole number 3.Write a Python program that uses a hash table to map US states to their capitols, e.g., 'California' would be a key, and 'Sacramento' a value. The program should initialize the hash table with three entries, add a fourth entry on the next line, then prompt the user to enter a state. The program should print the capitol of the state entered by the user.