Lab 6: Data Types, Mutability, Sorting. Ling 1330/2330: Computational Linguistics Na-Rae Han

Similar documents
Lesson 4: Type Conversion, Mutability, Sequence Indexing. Fundamentals of Text Processing for Linguists Na-Rae Han

Lab 5: Function Types, Lists and Dictionaries, Sorting. Ling 1330/2330: Computational Linguistics Na-Rae Han

Lab 8: File I/O, Mutability vs. Assignment. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han

Lab 3: for and while Loops, Indexing. Ling 1330/2330: Computational Linguistics Na-Rae Han

Lab 1: Course Intro, Getting Started with Python IDLE. Ling 1330/2330 Computational Linguistics Na-Rae Han

Python I. Some material adapted from Upenn cmpe391 slides and other sources

Collections. Lists, Tuples, Sets, Dictionaries

Data Structures. Lists, Tuples, Sets, Dictionaries

Some material adapted from Upenn cmpe391 slides and other sources

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han

Python. Karin Lagesen.

Lecture 5: Strings

Computer Sciences 368 Scripting for CHTC Day 3: Collections Suggested reading: Learning Python

Lab 2: input(), if else, String Operations, Variable Assignment. Ling 1330/2330: Computational Linguistics Na-Rae Han

Types, lists & functions

A Brief Introduction to Python for those who know Java. (Last extensive revision: Daniel Moroz, fall 2015)

Python Intro GIS Week 1. Jake K. Carr

Part I. Wei Tianwen. A Brief Introduction to Python. Part I. Wei Tianwen. Basics. Object Oriented Programming

Introduction to Python

History Installing & Running Python Names & Assignment Sequences types: Lists, Tuples, and Strings Mutability

Lists in Python CS 8: Introduction to Computer Science, Winter 2018 Lecture #10

Interactive use. $ python. >>> print 'Hello, world!' Hello, world! >>> 3 $ Ctrl-D

Python Programming: Lecture 2 Data Types

Sequence types. str and bytes are sequence types Sequence types have several operations defined for them. Sequence Types. Python

Programming to Python

Lists, loops and decisions

Interactive use. $ python. >>> print 'Hello, world!' Hello, world! >>> 3 $ Ctrl-D

Topic 7: Lists, Dictionaries and Strings

Python for Non-programmers

University of Washington CSE 140 Introduction to Data Programming Winter Midterm exam. February 6, 2013

CSCE 110 Programming I

GIS 4653/5653: Spatial Programming and GIS. More Python: Statements, Types, Functions, Modules, Classes

MEIN 50010: Python Data Structures

Sequences and iteration in Python

Outline: Data collections (Ch11)

STSCI Python Introduction. Class URL

Python Review IPRE

Lab 18: Regular Expressions in Python. Ling 1330/2330: Intro to Computational Linguistics Na-Rae Han

Python Review IPRE

Introduction to Bioinformatics

Lecture 7: Python s Built-in. in Types and Basic Statements

The current topic: Python. Announcements. Python. Python

Python and Bioinformatics. Pierre Parutto

Worksheet 6: Basic Methods Methods The Format Method Formatting Floats Formatting Different Types Formatting Keywords

CSCE 110 Programming I Basics of Python: Lists, Tuples, and Functions

CMPT 120 Lists and Strings. Summer 2012 Instructor: Hassan Khosravi

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

Basic Scripting, Syntax, and Data Types in Python. Mteor 227 Fall 2017

University of Washington CSE 140 Introduction to Data Programming Winter Midterm exam. February 6, 2013

Functions, Scope & Arguments. HORT Lecture 12 Instructor: Kranthi Varala

Manipulating Digital Information

CS Advanced Unix Tools & Scripting

CS Introduction to Computational and Data Science. Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017

Lists How lists are like strings

CS2304: Python for Java Programmers. CS2304: Sequences and Collections

Script language: Python Data structures

A Brief Introduction to Python for those who know Java Last extensive revision: Jie Gao, Fall 2018 Previous revisions: Daniel Moroz, Fall 2015

CS Introduction to Computational and Data Science. Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017

MUTABLE LISTS AND DICTIONARIES 4

COMP1730/COMP6730 Programming for Scientists. Strings

Chapter 6: List. 6.1 Definition. What we will learn: What you need to know before: Data types Assignments

Introduction to Python

Data type built into Python. Dictionaries are sometimes found in other languages as associative memories or associative arrays.

Basic Python Revision Notes With help from Nitish Mittal

Algorithms. Algorithms 3.1 SYMBOL TABLES. API elementary implementations ordered operations ROBERT SEDGEWICK KEVIN WAYNE

Python Tutorial. Day 1

Impera've Programming

There are four numeric types: 1. Integers, represented as a 32 bit (or longer) quantity. Digits sequences (possibly) signed are integer literals:

Introduction to Python Code Quality

Python - Variable Types. John R. Woodward

CS S-02 Python 1. Most python references use examples involving spam, parrots (deceased), silly walks, and the like

Lists. Lists Element Types. Creating Lists. CS 112 Lists Part II 2/22/09. Dan Fleck Spring 2009 George Mason University

Compound Data Types 1

python 01 September 16, 2016

CMPT 125: Lecture 3 Data and Expressions

Here n is a variable name. The value of that variable is 176.

CSE : Python Programming

At full speed with Python

CSc 110, Spring 2017 Lecture 3: Expressions, Variables and Loops. Adapted from slides by Marty Stepp and Stuart Reges

UNIVERSITY OF TORONTO SCARBOROUGH. Fall 2015 EXAMINATIONS. CSC A20H Duration 3 hours. No Aids Allowed

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

Chapter 8 Dictionaries. Dictionaries are mutable, unordered collections with elements in the form of a key:value pairs that associate keys to values.

def order(food): food = food.upper() print( Could I have a big + food + please? ) return fresh + food

All programs can be represented in terms of sequence, selection and iteration.

CS 234 Python Review Part 2

Overview of List Syntax

Slicing. Open pizza_slicer.py

Python BASICS. Introduction to Python programming, basic concepts: formatting, naming conventions, variables, etc.

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

Introduction to Problem Solving and Programming in Python.

Beyond Blocks: Python Session #1

Data Handing in Python

ENGR 102 Engineering Lab I - Computation

9/19/12. Why Study Discrete Math? What is discrete? Sets (Rosen, Chapter 2) can be described by discrete math TOPICS

Getting Started. Excerpted from Hello World! Computer Programming for Kids and Other Beginners

DM550/DM857 Introduction to Programming. Peter Schneider-Kamp

CSc 120. Introduc/on to Computer Programing II. Adapted from slides by Dr. Saumya Debray. 01- a: Python review

Programming for Engineers in Python. Autumn

CS Prelim 2 Review Fall 2018

CMSC201 Computer Science I for Majors

Transcription:

Lab 6: Data Types, Mutability, Sorting Ling 1330/2330: Computational Linguistics Na-Rae Han

Objectives Data types and conversion Tuple Mutability Sorting: additional parameters Text processing overview Token, type Frequency counts Homework 2 9/13/2018 2

Text: what to find? It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 3

Tokenizer function again A function that tokenizes a sentence: def gettokens(sent) : new = sent.lower() for s in ",.!?':;#$" : new = new.replace(s, " "+s+" ") return new.split() >>> joke = "Knock knock, who's there?" >>> gettokens(joke) ['knock', 'knock', ',', 'who', "'", 's', 'there', '?'] >>> pal = "A man, a plan, a canal: Panama." >>> gettokens(pal) ['a', 'man', ',', 'a', 'plan', ',', 'a', 'canal', ':', 'panama', '.'] >>> 9/13/2018 4

Text-derived data types 'Rose is a rose is a rose is a rose.' ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] {'a': 3, 'is': 3, '.': 1, 'rose': 4} ['.', 'a', 'is', 'rose'] rose text as a string rosetoks word token list rosefreq frequency dict rosetypes word type list 9/13/2018 5

Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. 9/13/2018 6

Tokenization first 'Rose is a rose is a rose is a rose.' gettokens Tokenization is the very first step. ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes ['.', 'a', 'is', 'rose'] getfreq {'a': 3, 'is': 3, '.': 1, 'rose': 4} Types and the frequency dictionary are built from the token list. How about: Words that are at least 2-chars long? Words that occur at least 3 times? 9/13/2018 7

The text processing pipeline 'Rose is a rose is a rose is a rose.' gettokens ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.'] gettypes getfreq ['.', 'a', 'is', 'rose'] x = 2 getxlengthwords {'a': 3, 'is': 3, '.': 1, 'rose': 4} x = 3 getxfreqwords ['is', 'rose'] ['a', 'is', 'rose'] 9/13/2018 8

Homework 2: Basic text stats It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way When we have a text, we typically would like to know: How many words are in it? (Token count) How many unique words are in it? (Type count) What are their frequency counts? (Frequency distribution) How many times is word x found? Are there a lot of long/short words? Which ones are long? Which words are very frequent? How frequent? 9/13/2018 9

Python data types 33 5.49 int: integer float: floating point number 'Bart' 'Hello, world!' str: string (a piece of text) ['cat', 'dog', 'fox', 'hippo'] list ('gold', 'silver', 'bronze') tuple {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8, 'Maggie':1} dict: (dictionary) maps a value to an object 9/13/2018 10

Data types in Python type() displays the data type of the object >>> h = 'hello' # h is str type >>> h = list(h) # h now refers to a list >>> h ['h', 'e', 'l', 'l', 'o'] >>> type(h) <type 'list'> Beware of type conflicts! >>> w = 'Mary' >>> w+3 Traceback (most recent call last): File "<pyshell#68>", line 1, in <module> w+3 TypeError: cannot concatenate 'str' and 'int' objects 9/13/2018 11

Converting between types int() float() str() string, floating point integer >>> int(3.141592) 3 string, integer floating point number >>> float('256') 256.0 integer, float, list, tuple, dictionary string >>> str(3.141592) '3.141592' >>> str([1,2,3,4]) '[1, 2, 3, 4]' list() string, tuple, dictionary list >>> list('mary') ['M', 'a', 'r', 'y'] 9/13/2018 12

Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") tip = bill * 0.18 total = bill + tip print("your tip amount is $" + tip) print("your total is $" + total) This script does not work. Why? 9/13/2018 13

Example: tip calculator Tip calculator Prompts for bill amount, prints out tip and total bill = input("what's your bill amount? ") bill = float(bill) tip = bill * 0.18 total = bill + tip print("your tip amount is $" + str(tip)) print("your total is $" + str(total)) input() returns a string turn it into float tip and total are floats turn them into strings before concatenating 9/13/2018 14

Tuples Enclosed in (). Pretty much like a list, but once declared, it cannot be changed (i.e., immutable). >>> li = ['spring', 'summer', 'fall', 'winter'] >>> li[2] = 'autumn' >>> li ['spring', 'summer', 'autumn', 'winter'] list >>> tu = ('spring', 'summer', 'fall', 'winter') >>> tu ('spring', 'summer', 'fall', 'winter') >>> tu[2] = 'autumn' Traceback (most recent call last): File "<pyshell#37>", line 1, in <module> tu[2] = 'autumn' TypeError: 'tuple' object does not support item assignment tuple 9/13/2018 15

Tuple: indexing, length, membership Can be indexed and sliced >>> tu[0] 'spring' >>> tu[-1] 'winter' >>> tu[2:] ('fall', 'winter') >>> len() for length, in for membership test These operations are shared by all sequence types: list, tuple, string >>> tu ('spring', 'summer', 'fall', 'winter') >>> len(tu) 4 >>> 'fall' in tu True 9/13/2018 16

Conversion tuple() converts an object into a tuple. >>> li = [1,2,3,4] >>> tuple(li) (1, 2, 3, 4) >>> tuple('penguin') ('p', 'e', 'n', 'g', 'u', 'i', 'n') >>> 9/13/2018 17

One more data type: set Set: a collection of distinct objects. Is order-less. >>> {'a', 'b', 'c'} == {'b', 'c', 'a'} True >>> len({'b', 'c', 'a', 'c', 'a', 'c', 'a'}) 3 set() converts an object into a set. >>> set(['a', 'man', 'a', 'plan', 'a', 'canal', 'a']) {'a', 'canal', 'plan', 'man'} >>> set('abracadabra') {'c', 'b', 'd', 'r', 'a'} 9/13/2018 18

String and list behave differently >>> thing = 'iphone' >>> thing.upper() 'IPHONE' >>> thing 'iphone' >>> Method creates and returns a new string Original string is unchanged string: IMMUTABLE >>> >>> >>> li = [1, 2, 3, 4] li.append(5) li [1, 2, 3, 4, 5] >>> Nothing returned Original list has changed! list: MUTABLE 9/13/2018 19

Tuple and list behave differently >>> tu = (10, 20, 30, 40) >>> tu[2] = 75 Traceback (most recent call last): File "<pyshell#9>", line 1, Cannot in <module> change tu[2] = 75 individual items TypeError: 'tuple' object does not support item assignment tuple: IMMUTABLE >>> li = [10, 20, 30, 40] >>> li[2] = 75 >>> li [10, 20, 75, 40] >>> Individual items can be updated list: MUTABLE 9/13/2018 20

Immutable data types int 33 float 33.5 str tuple 'Hello, world!' ('Spring', 'Summer', 'Winter', 'Fall') In Python, the data types integer, float, string, and tuple are immutable. Python functions do NOT directly change these data they are unchangeable! Instead, a new object is created in memory and returned. Value change only occurs through a new assignment statement. 9/13/2018 21

Mutable data types list dict ['cat', 'dog', 'fox', 'hippo'] {'Homer':36, 'Marge':36, 'Bart':10, 'Lisa':8} List and dictionary are mutable data types. When a Python method is called on these data, the operation is done in place. The original data object in memory is changed! And, nothing gets returned.* *Unless it also returns a value, e.g.,.pop() 9/13/2018 22

List operations List methods* Functions that are defined on the list datatype Called on a list object, has this syntax: listobj.method() Lists are mutable, which means list methods modify the caller object (list) in place. *"Methods" are functions that are specific to a data type. They are "called on" a data object and have object.method() syntax. 9/13/2018 23

Sorting and reversing >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>>.sort() sorts a list in place.reverse() reverses a list in place 9/13/2018 24

Assignment statement vs. mutability >>> li = ['fox', 'dog', 'owl', 'cat'] >>> li.sort() >>> li ['cat', 'dog', 'fox', 'owl'] >>> li.reverse() >>> li ['owl', 'fox', 'dog', 'cat'] >>> li2 = li.reverse() >>> li2 >>> This assignment statement produces unintended result: li2 has no value!.reverse() modifies li in place, and returns NOTHING. Hence, li2 gets assigned to None. 9/13/2018 25

Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] What's the difference? 9/13/2018 26

Two ways to sort list.sort() >>> li = [2, 3, 1, 4] >>> li.sort() >>> li [1, 2, 3, 4] li itself is changed in memory Returns nothing sorted(list) >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] Creates and returns a new sorted list li is not changed x.sort() is a list METHOD: defined on list object only. sorted(x) is a general-purpose function: works on lists, strings, etc. 9/13/2018 27

Why use sorted() Once you sort a list through.sort(), the original order is lost forever. Often, don't want to change the original object. This is when you use sorted(). >>> li = [2, 3, 1, 4] >>> sorted(li) [1, 2, 3, 4] >>> li [2, 3, 1, 4] >>> li2 = sorted(li) >>> li2 [1, 2, 3, 4] >>> li [2, 3, 1, 4] Assign a new name to the returned list 9/13/2018 28

Practice: common characters Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1: if c in wd2: common.append(c) return sorted(common) Initialize a new empty list. For each character in wd1, if it is also in wd2, add it to the list. Return the sorted list. >>> in_both('pear', 'apple') ['a', 'e', 'p'] 9/13/2018 29

Sorting output with sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return sorted(common) >>> in_both('pear', 'apple') ['a', 'e', 'p'] Works! Can we use.sort() instead? Sorting output before returning, using sorted() function 9/13/2018 30

Alternative: using.sorted() method Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() >>> in_both('pear', 'apple') >>> Whoa! What happened? 9/13/2018 31

Return statement vs..sorted() Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) return common.sort() Sorts common in place, and returns NOTHING. >>> in_both('pear', 'apple') >>> Do NOT use an "in-place-operating" method in a return statement. * *Unless it also returns a value, e.g.,.pop() 9/13/2018 32

Sort list, and then return. Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 : common.append(c) common.sort() return common Sort first, and then return. >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'e', 'n', 'p'] Duplicates! How to avoid? 9/13/2018 33

Finished Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common Append to common only if c is not already in it >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] Success! *Also: you could use set() 9/13/2018 34

Try it out 2 minutes Build this function: def in_both(wd1, wd2) : "Takes two strs, returns a sorted list of common chars" common = [] for c in wd1 : if c in wd2 and c not in common : common.append(c) common.sort() return common >>> in_both('pear', 'apple') ['a', 'e', 'p'] >>> in_both('elephant', 'penguin') ['e', 'n', 'p'] 9/13/2018 35

More on sorted(): sorting options >>> li = ['x', 'ab', 'd', 'cde'] >>> sorted(li) ['ab', 'cde', 'd', 'x'] >>> sorted(li, reverse=true) ['x', 'd', 'cde', 'ab'] Reverse alphabetical order >>> sorted(li, key=len) ['x', 'd', 'ab', 'cde'] Use len() function as key: Order by string length >>> sorted(li, key=len, reverse=true) ['cde', 'ab', 'x', 'd'] Likewise, but in the descending order 9/13/2018 36

Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> for s in sim : print(s, 'is', sim[s], 'years old.') Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Bart is 10 years old. Results are in the order of declaration (starting from 3.6) 9/13/2018 37

Sorting dict keys with sorted() >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim) ['Bart', 'Homer', 'Lisa', 'Marge'] sorted(dict) returns a sorted list of keys >>> for s in sorted(sim) : print(s, 'is', sim[s], 'years old.') Bart is 10 years old. Homer is 36 years old. Lisa is 8 years old. Marge is 35 years old. Dictionary now prints out in an alphabetically sorted key order. You are temporarily sorting dictionary keys. Dictionary itself is not getting sorted! 9/13/2018 38

Sorting dict keys by VALUE >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart']) >>> sorted(sim, key=sim.get) ['Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') dict[x] and dict.get(x) do the same thing: retrieving the dict value. Returns a sorted list keys in the order of their value. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. Dictionary now prints out in the value's sorted order. 9/13/2018 39

Try it out 2 minutes >>> sim = {'Homer':36, 'Lisa':8, 'Marge':35, 'Bart':10, 'Maggie':1} >>> sim.keys() dict_keys(['homer', 'Lisa', 'Marge', 'Bart', 'Maggie']) >>> sorted(sim, key=sim.get) ['Maggie', 'Lisa', 'Bart', 'Marge', 'Homer'] >>> for s in sorted(sim, key=sim.get) : print(s, 'is', sim[s], 'years old.') Also try: key=len reverse=true Maggie is 1 years old. Lisa is 8 years old. Bart is 10 years old. Marge is 35 years old. Homer is 36 years old. 9/13/2018 40

Population by state 3 minutes Find the popul dictionary in text-samples.txt Copy & paste it into IDLE shell Print out the most populous states Print out the least populous states 9/13/2018 41

Wrap-up Next class: Continue discussion on spell checkers File IO: how to read a file Homework 2 Review 2 spell checkers of your choice textstats.py Recitation students: take a look before tomorrow's meeting 9/13/2018 42