Lab 6: Data Types, Mutability, Sorting Ling 1330/2330: Computational Linguistics Na-Rae Han
Objectives Data types and conversion Tuple Mutability Sorting: additional parameters Text processing overview Token, type Frequency counts Homework 2 9/13/2018 2
Text: what to find? It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 3
Tokenizer function again A function that tokenizes a sentence: def gettokens(sent) : new = sent.lower() for s in ",.!?':;#$" : new = new.replace(s, " "+s+" ") return new.split() >>> joke = "Knock knock, who's there?" >>> gettokens(joke) ['knock', 'knock', ',', 'who', "'", 's', 'there', '?'] >>> pal = "A man, a plan, a canal: Panama." >>> gettokens(pal) ['a', 'man', ',', 'a', 'plan', ',', 'a', 'canal', ':', 'panama', '.'] >>> 9/13/2018 4
Text-derived data types 'Rose is a rose is a rose is a rose.' ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] {'a': 3, 'is': 3, '.': 1, 'rose': 4} ['.', 'a', 'is', 'rose'] rose text as a string rosetoks word token list rosefreq frequency dict rosetypes word type list 9/13/2018 5
Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. 9/13/2018 6
Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. How about: Words that are at least 2-chars long? Words that occur at least 3 times? 9/13/2018 7
The text processing pipeline 'Rose is a rose is a rose is a rose.' gettokens ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes getfreq ['.', 'a', 'is', 'rose'] x = 2 getxlengthwords {'a': 3, 'is': 3, '.': 1, 'rose': 4} x = 3 getxfreqwords ['is', 'rose'] ['a', 'is', 'rose'] 9/13/2018 8
Homework 2: Basic text stats It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 9
Python data types 33 5.49 int: integer float: floating point number 'Bart' 'Hello, world!' str: string (a piece of text) ['cat', 'dog', 'fox', 'hippo'] list ('gold', 'silver', 'bronze') tuple {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8, 'Maggie':1} dict: (dictionary) maps a value to an object 9/13/2018 10
Data types in Python type() displays the data type of the object >>> h = 'hello' # h is str type >>> h = list(h) # h now refers to a list >>> h ['h', 'e', 'l', 'l', 'o'] >>> type(h) <type 'list'> Beware of type conflicts! >>> w = 'Mary' >>> w+3 Traceback (most recent call last): File "<pyshell#68>", line 1, in <module> w+3 TypeError: cannot concatenate 'str' and 'int' objects 9/13/2018 11
Converting between types int() float() str() string, floating point integer >>> int(3.141592) 3 string, integer floating point number >>> float('256') 256.0 integer, float, list, tuple, dictionary string >>> str(3.141592) '3.141592' >>> str([1,2,3,4]) '[1, 2, 3, 4]' list() string, tuple, dictionary list >>> list('mary') ['M', 'a', 'r', 'y'] 9/13/2018 12
Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") tip = bill * 0.18 total = bill + tip print("your tip amount is $" + tip) print("your total is $" + total) This script does not work. Why? 9/13/2018 13
Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") bill = float(bill) tip = bill * 0.18 total = bill + tip print("your tip amount is $" + str(tip)) print("your total is $" + str(total)) input() returns a string turn it into float tip and total are floats turn them into strings before concatenating 9/13/2018 14
Tuples Enclosed in (). Pretty much like a list, but once declared, it cannot be changed (i.e., immutable). >>> li = ['spring', 'summer', 'fall', 'winter'] >>> li[2] = 'autumn' >>> li ['spring', 'summer', 'autumn', 'winter'] list >>> tu = ('spring', 'summer', 'fall', 'winter') >>> tu ('spring', 'summer', 'fall', 'winter') >>> tu[2] = 'autumn' Traceback (most recent call last): File "<pyshell#37>", line 1, in <module> tu[2] = 'autumn' TypeError: 'tuple' object does not support item assignment tuple 9/13/2018 15
Tuple: indexing, length, membership Can be indexed and sliced >>> tu[0] 'spring' >>> tu[-1] 'winter' >>> tu[2:] ('fall', 'winter') >>> len() for length, in for membership test These operations are shared by all sequence types: list, tuple, string >>> tu ('spring', 'summer', 'fall', 'winter') >>> len(tu) 4 >>> 'fall' in tu True 9/13/2018 16
Conversion tuple() converts an object into a tuple. >>> li = [1,2,3,4] >>> tuple(li) (1, 2, 3, 4) >>> tuple('penguin') ('p', 'e', 'n', 'g', 'u', 'i', 'n') >>> 9/13/2018 17
One more data type: set Set: a collection of distinct objects. Is order-less. >>> {'a', 'b', 'c'} == {'b', 'c', 'a'} True >>> len({'b', 'c', 'a', 'c', 'a', 'c', 'a'}) 3 set() converts an object into a set. >>> set(['a', 'man', 'a', 'plan', 'a', 'canal', 'a']) {'a', 'canal', 'plan', 'man'} >>> set('abracadabra') {'c', 'b', 'd', 'r', 'a'} 9/13/2018 18
String and list behave differently >>> thing = 'iphone' >>> thing.upper() 'IPHONE' >>> thing 'iphone' >>> Method creates and returns a new string Original string is unchanged string: IMMUTABLE >>> >>> >>> li = [1, 2, 3, 4] li.append(5) li [1, 2, 3, 4, 5] >>> Nothing returned Original list has changed! list: MUTABLE 9/13/2018 19
Tuple and list behave differently >>> tu = (10, 20, 30, 40) >>> tu[2] = 75 Traceback (most recent call last): File "<pyshell#9>", line 1, Cannot in <module> change tu[2] = 75 individual items TypeError: 'tuple' object does not support item assignment tuple: IMMUTABLE >>> li = [10, 20, 30, 40] >>> li[2] = 75 >>> li [10, 20, 75, 40] >>> Individual items can be updated list: MUTABLE 9/13/2018 20
Immutable data types int 33 float 33.5 str tuple 'Hello, world!' ('Spring', 'Summer', 'Winter', 'Fall') In Python, the data types integer, float, string, and tuple are immutable. Python functions do NOT directly change these data they are unchangeable! Instead, a new object is created in memory and returned. Value change only occurs through a new assignment statement. 9/13/2018 21
Mutable data types list dict ['cat', 'dog', 'fox', 'hippo'] {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8} List and dictionary are mutable data types. When a Python method is called on these data, the operation is done in place. The original data object in memory is changed! And, nothing gets returned.* *Unless it also returns a value, e.g.,.pop() 9/13/2018 22
List operations List methods* Functions that are defined on the list datatype Called on a list object, has this syntax: listobj.method() Lists are mutable, which means list methods modify the caller object (list) in place. *"Methods" are functions that are specific to a data type. They are "called on" a data object and have object.method() syntax. 9/13/2018 23
Sorting and reversing >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>>.sort() sorts a list in place.reverse() reverses a list in place 9/13/2018 24
Assignment statement vs. mutability >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>> li2 = li.reverse() >>> li2 >>> This assignment statement produces unintended result: li2 has no value!.reverse() modifies li in place, and returns NOTHING. Hence, li2 gets assigned to None. 9/13/2018 25
Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] What's the difference? 9/13/2018 26
Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] li itself is changed in memory Returns nothing sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] Creates and returns a new sorted list li is not changed x.sort() is a list METHOD: defined on list object only. sorted(x) is a general-purpose function: works on lists, strings, etc. 9/13/2018 27
Why use sorted() Once you sort a list through.sort(), the original order is lost forever. Often, don't want to change the original object. This is when you use sorted(). >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] >>> li2 = sorted(li) >>> li2 [1, 2, 3, 4] >>> li [2, 3, 1, 4] Assign a new name to the returned list 9/13/2018 28
Practice: common characters Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1: if c in wd2: common.append(c) return sorted(common) Initialize a new empty list. For each character in wd1, if it is also in wd2, add it to the list. Return the sorted list. >>> in_both('pear', 'apple') ['a', 'e', 'p'] 9/13/2018 29
Sorting output with sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return sorted(common) >>> in_both('pear', 'apple') ['a', 'e', 'p'] Works! Can we use.sort() instead? Sorting output before returning, using sorted() function 9/13/2018 30
Alternative: using.sorted() method Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() >>> in_both('pear', 'apple') >>> Whoa! What happened? 9/13/2018 31
Return statement vs..sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() Sorts common in place, and returns NOTHING. >>> in_both('pear', 'apple') >>> Do NOT use an "in-place-operating" method in a return statement. * *Unless it also returns a value, e.g.,.pop() 9/13/2018 32
Sort list, and then return. Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) common.sort() return common Sort first, and then return. >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'e', 'n', 'p'] Duplicates! How to avoid? 9/13/2018 33
Finished Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common Append to common only if c is not already in it >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] Success! *Also: you could use set() 9/13/2018 34
Try it out 2 minutes Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] 9/13/2018 35
More on sorted(): sorting options >>> li = ['x', 'ab', 'd', 'cde'] >>> sorted(li) ['ab', 'cde', 'd', 'x'] >>> sorted(li, reverse=true) ['x', 'd', 'cde', 'ab'] Reverse alphabetical order >>> sorted(li, key=len) ['x', 'd', 'ab', 'cde'] Use len() function as key: Order by string length >>> sorted(li, key=len, reverse=true) ['cde', 'ab', 'x', 'd'] Likewise, but in the descending order 9/13/2018 36
Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> for s in sim : print(s, 'is', sim[s], 'years old.') Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Bart is 10 years old. Results are in the order of declaration (starting from 3.6) 9/13/2018 37
Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim) ['Bart', 'Homer', 'Lisa', 'Marge'] sorted(dict) returns a sorted list of keys >>> for s in sorted(sim) : print(s, 'is', sim[s], 'years old.') Bart is 10 years old. Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Dictionary now prints out in an alphabetically sorted key order. You are temporarily sorting dictionary keys. Dictionary itself is not getting sorted! 9/13/2018 38
Sorting dict keys by VALUE >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim, key=sim.get) ['Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') dict[x] and dict.get(x) do the same thing: retrieving the dict value. Returns a sorted list keys in the order of their value. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. Dictionary now prints out in the value's sorted order. 9/13/2018 39
Try it out 2 minutes >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10, 'Maggie':1} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart', 'Maggie']) >>> sorted(sim, key=sim.get) ['Maggie', 'Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') Also try: key=len reverse=true Maggie is 1 years old. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. 9/13/2018 40
Population by state 3 minutes Find the popul dictionary in text-samples.txt Copy & paste it into IDLE shell Print out the most populous states Print out the least populous states 9/13/2018 41
Wrap-up Next class: Continue discussion on spell checkers File IO: how to read a file Homework 2 Review 2 spell checkers of your choice textstats.py Recitation students: take a look before tomorrow's meeting 9/13/2018 42