2 Objectives Data types and conversion Tuple Mutability Sorting: additional parameters Text processing overview Token, type Frequency counts Homework 2 9/13/2018 2

3 Text: what to find? It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 3

4 Tokenizer function again A function that tokenizes a sentence: def gettokens(sent) : new = sent.lower() for s in ",.!?':;#$" : new = new.replace(s, " "+s+" ") return new.split() >>> joke = "Knock knock, who's there?" >>> gettokens(joke) ['knock', 'knock', ',', 'who', "'", 's', 'there', '?'] >>> pal = "A man, a plan, a canal: Panama." >>> gettokens(pal) ['a', 'man', ',', 'a', 'plan', ',', 'a', 'canal', ':', 'panama', '.'] >>> 9/13/2018 4

5 Text-derived data types 'Rose is a rose is a rose is a rose.' ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] {'a': 3, 'is': 3, '.': 1, 'rose': 4} ['.', 'a', 'is', 'rose'] rose text as a string rosetoks word token list rosefreq frequency dict rosetypes word type list 9/13/2018 5

6 Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. 9/13/2018 6

7 Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. How about: Words that are at least 2-chars long? Words that occur at least 3 times? 9/13/2018 7

8 The text processing pipeline 'Rose is a rose is a rose is a rose.' gettokens ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes getfreq ['.', 'a', 'is', 'rose'] x = 2 getxlengthwords {'a': 3, 'is': 3, '.': 1, 'rose': 4} x = 3 getxfreqwords ['is', 'rose'] ['a', 'is', 'rose'] 9/13/2018 8

9 Homework 2: Basic text stats It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 9

10 Python data types int: integer float: floating point number 'Bart' 'Hello, world!' str: string (a piece of text) ['cat', 'dog', 'fox', 'hippo'] list ('gold', 'silver', 'bronze') tuple {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8, 'Maggie':1} dict: (dictionary) maps a value to an object 9/13/

11 Data types in Python type() displays the data type of the object >>> h = 'hello' # h is str type >>> h = list(h) # h now refers to a list >>> h ['h', 'e', 'l', 'l', 'o'] >>> type(h) <type 'list'> Beware of type conflicts! >>> w = 'Mary' >>> w+3 Traceback (most recent call last): File "<pyshell#68>", line 1, in <module> w+3 TypeError: cannot concatenate 'str' and 'int' objects 9/13/

12 Converting between types int() float() str() string, floating point integer >>> int( ) 3 string, integer floating point number >>> float('256') integer, float, list, tuple, dictionary string >>> str( ) ' ' >>> str([1,2,3,4]) '[1, 2, 3, 4]' list() string, tuple, dictionary list >>> list('mary') ['M', 'a', 'r', 'y'] 9/13/

13 Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") tip = bill * 0.18 total = bill + tip print("your tip amount is $" + tip) print("your total is $" + total) This script does not work. Why? 9/13/

14 Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") bill = float(bill) tip = bill * 0.18 total = bill + tip print("your tip amount is $" + str(tip)) print("your total is $" + str(total)) input() returns a string turn it into float tip and total are floats turn them into strings before concatenating 9/13/

15 Tuples Enclosed in (). Pretty much like a list, but once declared, it cannot be changed (i.e., immutable). >>> li = ['spring', 'summer', 'fall', 'winter'] >>> li[2] = 'autumn' >>> li ['spring', 'summer', 'autumn', 'winter'] list >>> tu = ('spring', 'summer', 'fall', 'winter') >>> tu ('spring', 'summer', 'fall', 'winter') >>> tu[2] = 'autumn' Traceback (most recent call last): File "<pyshell#37>", line 1, in <module> tu[2] = 'autumn' TypeError: 'tuple' object does not support item assignment tuple 9/13/

16 Tuple: indexing, length, membership Can be indexed and sliced >>> tu[0] 'spring' >>> tu[-1] 'winter' >>> tu[2:] ('fall', 'winter') >>> len() for length, in for membership test These operations are shared by all sequence types: list, tuple, string >>> tu ('spring', 'summer', 'fall', 'winter') >>> len(tu) 4 >>> 'fall' in tu True 9/13/

17 Conversion tuple() converts an object into a tuple. >>> li = [1,2,3,4] >>> tuple(li) (1, 2, 3, 4) >>> tuple('penguin') ('p', 'e', 'n', 'g', 'u', 'i', 'n') >>> 9/13/

18 One more data type: set Set: a collection of distinct objects. Is order-less. >>> {'a', 'b', 'c'} == {'b', 'c', 'a'} True >>> len({'b', 'c', 'a', 'c', 'a', 'c', 'a'}) 3 set() converts an object into a set. >>> set(['a', 'man', 'a', 'plan', 'a', 'canal', 'a']) {'a', 'canal', 'plan', 'man'} >>> set('abracadabra') {'c', 'b', 'd', 'r', 'a'} 9/13/

19 String and list behave differently >>> thing = 'iphone' >>> thing.upper() 'IPHONE' >>> thing 'iphone' >>> Method creates and returns a new string Original string is unchanged string: IMMUTABLE >>> >>> >>> li = [1, 2, 3, 4] li.append(5) li [1, 2, 3, 4, 5] >>> Nothing returned Original list has changed! list: MUTABLE 9/13/

20 Tuple and list behave differently >>> tu = (10, 20, 30, 40) >>> tu[2] = 75 Traceback (most recent call last): File "<pyshell#9>", line 1, Cannot in <module> change tu[2] = 75 individual items TypeError: 'tuple' object does not support item assignment tuple: IMMUTABLE >>> li = [10, 20, 30, 40] >>> li[2] = 75 >>> li [10, 20, 75, 40] >>> Individual items can be updated list: MUTABLE 9/13/

21 Immutable data types int 33 float 33.5 str tuple 'Hello, world!' ('Spring', 'Summer', 'Winter', 'Fall') In Python, the data types integer, float, string, and tuple are immutable. Python functions do NOT directly change these data they are unchangeable! Instead, a new object is created in memory and returned. Value change only occurs through a new assignment statement. 9/13/

22 Mutable data types list dict ['cat', 'dog', 'fox', 'hippo'] {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8} List and dictionary are mutable data types. When a Python method is called on these data, the operation is done in place. The original data object in memory is changed! And, nothing gets returned.* *Unless it also returns a value, e.g.,.pop() 9/13/

23 List operations List methods* Functions that are defined on the list datatype Called on a list object, has this syntax: listobj.method() Lists are mutable, which means list methods modify the caller object (list) in place. *"Methods" are functions that are specific to a data type. They are "called on" a data object and have object.method() syntax. 9/13/

24 Sorting and reversing >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>>.sort() sorts a list in place.reverse() reverses a list in place 9/13/

25 Assignment statement vs. mutability >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>> li2 = li.reverse() >>> li2 >>> This assignment statement produces unintended result: li2 has no value!.reverse() modifies li in place, and returns NOTHING. Hence, li2 gets assigned to None. 9/13/

26 Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] What's the difference? 9/13/

27 Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] li itself is changed in memory Returns nothing sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] Creates and returns a new sorted list li is not changed x.sort() is a list METHOD: defined on list object only. sorted(x) is a general-purpose function: works on lists, strings, etc. 9/13/

28 Why use sorted() Once you sort a list through.sort(), the original order is lost forever. Often, don't want to change the original object. This is when you use sorted(). >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] >>> li2 = sorted(li) >>> li2 [1, 2, 3, 4] >>> li [2, 3, 1, 4] Assign a new name to the returned list 9/13/

29 Practice: common characters Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1: if c in wd2: common.append(c) return sorted(common) Initialize a new empty list. For each character in wd1, if it is also in wd2, add it to the list. Return the sorted list. >>> in_both('pear', 'apple') ['a', 'e', 'p'] 9/13/

30 Sorting output with sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return sorted(common) >>> in_both('pear', 'apple') ['a', 'e', 'p'] Works! Can we use.sort() instead? Sorting output before returning, using sorted() function 9/13/

31 Alternative: using.sorted() method Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() >>> in_both('pear', 'apple') >>> Whoa! What happened? 9/13/

32 Return statement vs..sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() Sorts common in place, and returns NOTHING. >>> in_both('pear', 'apple') >>> Do NOT use an "in-place-operating" method in a return statement. * *Unless it also returns a value, e.g.,.pop() 9/13/

33 Sort list, and then return. Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) common.sort() return common Sort first, and then return. >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'e', 'n', 'p'] Duplicates! How to avoid? 9/13/

34 Finished Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common Append to common only if c is not already in it >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] Success! *Also: you could use set() 9/13/

35 Try it out 2 minutes Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] 9/13/

36 More on sorted(): sorting options >>> li = ['x', 'ab', 'd', 'cde'] >>> sorted(li) ['ab', 'cde', 'd', 'x'] >>> sorted(li, reverse=true) ['x', 'd', 'cde', 'ab'] Reverse alphabetical order >>> sorted(li, key=len) ['x', 'd', 'ab', 'cde'] Use len() function as key: Order by string length >>> sorted(li, key=len, reverse=true) ['cde', 'ab', 'x', 'd'] Likewise, but in the descending order 9/13/

37 Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> for s in sim : print(s, 'is', sim[s], 'years old.') Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Bart is 10 years old. Results are in the order of declaration (starting from 3.6) 9/13/

38 Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim) ['Bart', 'Homer', 'Lisa', 'Marge'] sorted(dict) returns a sorted list of keys >>> for s in sorted(sim) : print(s, 'is', sim[s], 'years old.') Bart is 10 years old. Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Dictionary now prints out in an alphabetically sorted key order. You are temporarily sorting dictionary keys. Dictionary itself is not getting sorted! 9/13/

39 Sorting dict keys by VALUE >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim, key=sim.get) ['Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') dict[x] and dict.get(x) do the same thing: retrieving the dict value. Returns a sorted list keys in the order of their value. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. Dictionary now prints out in the value's sorted order. 9/13/

40 Try it out 2 minutes >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10, 'Maggie':1} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart', 'Maggie']) >>> sorted(sim, key=sim.get) ['Maggie', 'Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') Also try: key=len reverse=true Maggie is 1 years old. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. 9/13/

41 Population by state 3 minutes Find the popul dictionary in text-samples.txt Copy & paste it into IDLE shell Print out the most populous states Print out the least populous states 9/13/

42 Wrap-up Next class: Continue discussion on spell checkers File IO: how to read a file Homework 2 Review 2 spell checkers of your choice Recitation students: take a look before tomorrow's meeting 9/13/

CS Prelim 2 Review Fall 2018

CS Prelim 2 Review Fall 2018 CS 1110 Prelim 2 Review Fall 2018 Exam Info Prelim 1: Thursday, November 8th Last name L P at 5:15 6:45 in Uris G01 Last name Q Z at 5:15 6:45 in Statler Aud. Last name A D at 7:30 9:00 in Uris G01 Last

More information

CMSC201 Computer Science I for Majors

CMSC201 Computer Science I for Majors CMSC201 Computer Science I for Majors Lecture 12 Tuples All materials copyright UMBC and Dr. Katherine Gibson unless otherwise noted Modularity Meaning Benefits Program design Last Class We Covered Top

More information