Chapter 8 More Data Structures We have seen the list data structure and its uses. We will now examine two, more advanced data structures: the set and the dictionary. In particular, the dictionary is an important, very useful part of Python as well as generally useful to solve many problems. What is a Dictionary? In data structure terms, a dictionary is better termed an associative array or associative list or a map. You can think if it as a list of pairs, where the first element of the pair, the key, is used to retrieve the second element, the value. Thus we map a key to a value. Key Value Pairs The key acts as a lookup to find the associated value. Just like a dictionary, you look up a word by its spelling to find the associated definition. A dictionary can be searched to locate the value associated with a key. Python Dictionary Use the { } marker to create a dictionary Use the : marker to indicate key:value pairs: contacts= { bill : 353-1234, rich : 269-1234, jane : 352-1234 } print contacts { jane : 352-1234, bill : 353-1234, rich : 269-1234 } Keys and Values Key must be immutable: strings, integers, tuples are fine lists are NOT Value can be anything. 1
Collection but not a Sequence Dictionaries are collections, but they are not sequences like lists, strings or tuples: there is no order to the elements of a dictionary in fact, the order (for example, when printed) might change as elements are added or deleted. So how to access dictionary elements? Access Dictionary Elements Access requires [ ], but the key is the index! mydict={} an empty dictionary mydict[ bill ]=25 added the pair bill :25 print mydict[ bill ] prints 25 Dictionaries are Mutable Like lists, dictionaries are a mutable data structure: you can change the object via various operations, such as index assignment mydict = { bill :3, rich :10} print mydict[ bill ] # prints 2 mydict[ bill ] = 100 print mydict[ bill ] # prints 100 Again, Common Operators Like others, dictionaries respond to these: len(mydict) number of key:value pairs in the dictionary element in mydict boolean, is element a key in the dictionary for key in mydict: iterates through the keys of a dictionary Lots of Methods mydict.items() all the key/value pairs mydict.keys() all the keys mydict.values() all the values key in mydict does the key exist in the dictionary mydict.clear() empty the dictionary mydict.update(yourdict) for each key in yourdict, updates mydict with that key/value pair 2
Dictionaries are Iterable for key in mydict: print key prints all the keys for key,value in mydict.items(): print key,value prints all the key/value pairs for value in mydict.values(): print value prints all the values Building Dictionaries Faster zip creates pairs from two parallel lists: zip( abc,[1,2,3]) yields [( a,1),( b,2),( c,3)] That s good for building dictionaries. We call the dict function which takes a list of pairs to make a dictionary: dict(zip( abc,[1,2,3])) yields {'a': 1, 'c': 3, 'b': 2} Word Frequency Analysis of Gettysburg Address def addword(w, wcdict): '''Update the word frequency: word is the key, frequency is the value.''' if w in wcdict: wcdict[w] += 1 else: wcdict[w] = 1 import string def processline(line, wcdict): '''Process the line to get lowercase words to add to the dictionary.''' line = line.strip() wordlist = line.split() for word in wordlist: # ignore the '--' that is in the file if word!= '--': word = word.lower() word = word.strip() # get commas, periods and other punctuation out as well word = word.strip(string.punctuation) addword(word, wcdict) def prettyprint(wcdict): '''Print nicely from highest to lowest frequency.''' # create a list of tuples, (value, key) # valkeylist = [(val,key) for key,val in d.items()] valkeylist=[] for key,val in wcdict.items(): valkeylist.append((val,key)) # sort method sorts on list's first element, here the frequency. # Reverse to get biggest first valkeylist.sort(reverse=true) print '%-10s%10s'%('Word', 'Count') print '_'*21 for val,key in valkeylist: print '%-12s %3d'%(key,val) 3
sort(...) L.sort(cmp=None, key=none, reverse=false) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1 def main (): wcdict={} fobj = open('gettysburg.txt','r') for line in fobj: processline(line, wcdict) print 'Length of the dictionary:',len(wcdict) prettyprint(wcdict) Passing Mutables Because we are passing a mutable data structure, such as a dictionary, we do not have to return the dictionary when the function ends. If all we do is update the dictionary (change the object) then the argument will be associated with the changed object. The only built-in sort method works on lists, so if we sort we must sort a list. For complex elements (like a tuple), the sort compares the first element of each complex element: (1, 3) < (2, 1) # True (3,0) < (1,2,3) # False Sets, as in Mathematical Sets In mathematics, a set is a collection of objects, potentially of many different types. In a set, no two elements are identical. That is, a set consists of elements each of which is unique compared to the other elements. There is no order to the elements of a set A set with no elements is the empty set Creating a Set myset = set( abcd ) The set keyword creates a set. The single argument that follows must be iterable, that is, something that can be walked through one item at a time with a for. The result is a set data structure: print myset set(['a', 'c', 'b', 'd']) Diverse Elements A set can consist of a mixture of different types of elements: myset = set([ a,1,3.14159,true]) As long as the single argument can be iterated through, you can make a set of it. Duplicates are automatically removed. myset = set( aabbccdd ) print myset set(['a', 'c', 'b', 'd']) 4
Common Operators Most data structures respond to these: len(myset) the number of elements in a set element in myset boolean indicating whether element is in the set for element in myset: iterate through the elements in myset Set Operators The set data structure provides some special operators that correspond to the operators you learned in middle school. These are various combinations of set contents. Unlike most Python names, these are full names (this changes in Python 3.x). Set Ops, Intersection myset=set( abcd ); newset=set( cdef ) myset.intersection(newset) # returns set([ c, d ]) Set Ops, Difference myset=set( abcd ); newset=set( cdef ) myset.difference(newset) # returns set([ a, b ]) Set Ops, Union myset=set( abcd ); newset=set( cdef ) myset.union(newset) # returns set([ a, b, c, d, e, f ]) Set Ops, symmetricdifference myset=set( abcd ); newset=set( cdef ) myset.symmetric_difference(newset) # returns set([ a, b, e, f ]) Set Ops, super and sub set myset=set( abc ); newset=set( abcdef ) myset.issubset(newset) # returns True newset.issuperset(myset) # returns True Other Set Ops myset.add( g ) Adds to the set, no effect if item is in set already. myset.clear() Empties the set. myset.remove( g ) versus myset.discard( g ) remove throws an error if g isn t there. discard doesn t care. Both remove g from the set. myset.copy() - Returns a shallow copy of myset. 5
Common/Unique Words def addword(w, theset): '''A the word to the set. if len(w) > 3: theset.add(w) No word smaller than length 3.''' def processline(line, theset): '''Process the line to get lowercase words to be added to the set.''' line = line.strip() wordlist = line.split() for word in wordlist: # ignore the '--' thatt is in the file if word!= '--': word = word.strip( () # get commas, periods and other punctuation out as well word = word.strip( (string.punctuation) word = word.lower( () addword(word, theset) def main (): '''Compare the Gettysburg Address and the Declaration of Independence.''' GASet = set() DoISet = set() GAFileObj = open('gettysburg.txt') DoIFileObj = open('declofind.txt') for line in GAFileObj: processline(line, GASet) for line in DoIFileObj: processline(line,doiset) prettyprint(gaset, DoISet) 6
def prettyprint(gaset, doiset): # print some stats about the two sets print 'Count of unique words of length 4 or greater' print 'Gettysburg Addr: %d, Decl of Ind: %d\n'%(len(gaset),len(doiset)) print '%15s %15s'%('Operation', 'Count') print '-'*35 print '%15s %15d'%('Union', len(gaset.union(doiset))) print '%15s %15d'%('Intersection', len(gaset.intersection(doiset))) print '%15s %15d'%('Sym Diff', len(gaset.symmetric_difference(doiset))) print '%15s %15d'%('GA-DoI', len(gaset.difference(doiset))) print '%15s %15d'%('DoI-GA', len(doiset.difference(gaset))) # list the intersection words, 5 to a line, alphabetical order intersectionset = gaset.intersection(doiset) wordlist = list(intersectionset) wordlist.sort() print '\n Common words to both' print '-'*20 cnt = 0 for w in wordlist: if cnt % 5 == 0: print print '%13s'% (w), cnt += 1 Common words in Gettysburg Address and Declaration of Independence Can reuse or only slightly modify much of the code for document frequency. The overall outline remains much the same. For clarity, we will ignore any word that has three characters or less (typically stop words). More on Scope What is a Namespace? We ve had this discussion, but let s review: A namespace is an association of a name and a value. It looks like a dictionary, and for the most part it is (at least for modules and classes). Scope What namespace you might be using is part of identifying the scope of the variables and function you are using. By scope, we mean the context, the part of the code, where we can make a reference to a variable or function. Multiple Scopes Often, there can be multiple scopes that are candidates for determining a reference. Knowing which one is the right one (or more importantly, knowing the order of scope) is important. Two Kinds of Namespaces Unqualified namespaces: this is what we have pretty much seen so far - functions, assignments, etc. Qualified namespaces: these are modules and classes (we ll talk more about this one later in the Classes section). Unqualified This is the standard assignment and def we have seen so far. 7
Determining the scope of a reference identifies what its true value is. Unqualified Follow the LEGB Rule Local - inside the function in which it was defined. If not there, enclosing/encompassing. Is it defined in an enclosing function? If not there, is it defined in the global namespace? Finally, check the built-in, defined as part of the special builtin scope. Else, ERROR. Local def myfn (myparam = 25): print myparam myfn(100) # prints 100 myparam # ERROR Function Local Values If a reference is assigned in a function, then that reference is only available within that function. If a reference with the same name is provided outside the function, the reference is reassigned. Local and Global myvar = 123 # global def myfn(myvar=456): # parameter is local yourvar = -10 # assignment local print myvar, yourvar myfn() # prints 456-10 myfn(500) # prints 500-10 myvar # prints 123 yourvar # ERROR Builtin This is just the standard library of Python. To see what is there, look at import builtin dir( builtin ) ['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_', ' debug ', ' doc ', ' import ', ' name ', ' package ', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip'] 8
The Global Statement You can cheat by using the global keyword in a function: myvar = 100 def myfn(): global myvar myvar = -1 myfn() print myvar # prints -1 Nested Functions Look up through the nested functions to find a binding: def afunc(): x = 5 def bfunc(): # x = 10 print x bfunc() afunc() # prints 5 bfunc() ERROR, bfunc not defined 9