File I/O, Benford s Law, and sets

Similar documents
examples from first year calculus (continued), file I/O, Benford s Law

Fundamentals of Programming (Python) File Processing. Sina Sajadmanesh Sharif University of Technology Fall 2017

Fundamentals of Programming (Python) File Processing. Ali Taheri Sharif University of Technology Spring 2018

Files on disk are organized hierarchically in directories (folders). We will first review some basics about working with them.

Announcements for this Lecture

Using Files. Wrestling with Python Classes. Rob Miles

Interactive use. $ python. >>> print 'Hello, world!' Hello, world! >>> 3 $ Ctrl-D

CSc 120 Introduction to Computer Programing II Adapted from slides by Dr. Saumya Debray

APT Session 2: Python

1 Modules 2 IO. 3 Lambda Functions. 4 Some tips and tricks. 5 Regex. Sandeep Sadanandan (TU, Munich) Python For Fine Programmers May 30, / 22

CS150 - Final Cheat Sheet

Python Tutorial. Day 1

CSCA20 Worksheet Working with Files

CS2304: Python for Java Programmers. CS2304: Sequences and Collections

Math 1MP3, midterm test February 2017

File Operations. Working with files in Python. Files are persistent data storage. File Extensions. CS111 Computer Programming

ECE 364 Software Engineering Tools Laboratory. Lecture 4 Python: Collections I

Python review. 1 Python basics. References. CS 234 Naomi Nishimura

CSC148 Fall 2017 Ramp Up Session Reference

STSCI Python Introduction. Class URL

Python Tutorial. Day 2

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

FILE HANDLING AND EXCEPTIONS

File Operations. Working with files in Python. Files are persistent data storage. File Extensions. CS111 Computer Programming

20.5. urllib Open arbitrary resources by URL

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

One minute responses. Not really sure how the loop, for, while, and zip work. I just need more practice problems to work on.

Teaching London Computing

Interactive use. $ python. >>> print 'Hello, world!' Hello, world! >>> 3 $ Ctrl-D

Python I. Some material adapted from Upenn cmpe391 slides and other sources

CSc 120. Introduc/on to Computer Programing II. 01- b: Python review. Adapted from slides by Dr. Saumya Debray

Part IV. More on Python. Tobias Neckel: Scripting with Bash and Python Compact Max-Planck, February 16-26,

MULTIPLE CHOICE. Chapter Seven

CIS192 Python Programming

LECTURE 4 Python Basics Part 3

Intro. Scheme Basics. scm> 5 5. scm>

Introduction to programming using Python

CS S-02 Python 1. Most python references use examples involving spam, parrots (deceased), silly walks, and the like

CME 193: Introduction to Scientific Python Lecture 4: Strings and File I/O

Some material adapted from Upenn cmpe391 slides and other sources

Introduction to Python

CS 2316 Exam 1 Spring 2014

Python: Short Overview and Recap

CPTS 111, Fall 2011, Sections 6&7 Exam 3 Review

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

Python Working with files. May 4, 2017

GE PROBLEM SOVING AND PYTHON PROGRAMMING. Question Bank UNIT 1 - ALGORITHMIC PROBLEM SOLVING

Sequence Types FEB

DM550/DM857 Introduction to Programming. Peter Schneider-Kamp

AI Programming CS S-02 Python

1 Strings (Review) CS151: Problem Solving and Programming

CS1102: Macros and Recursion

Python File Modes. Mode Description. Open a file for reading. (default)

Strings. Upsorn Praphamontripong. Note: for reference when we practice loop. We ll discuss Strings in detail after Spring break

Spring 2011 PROGRAMMING ASSIGNMENT Encrypted Message Program Due Tuesday, May 17th

#11: File manipulation Reading: Chapter 7

CS101 - Text Processing Lecture 8

Level 3 Computing Year 2 Lecturer: Phil Smith

CSCI 195 Intro to Programming w/python Parsing the Bible Text into a Python data structure

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

There are four numeric types: 1. Integers, represented as a 32 bit (or longer) quantity. Digits sequences (possibly) signed are integer literals:

LISTS WITH PYTHON. José M. Garrido Department of Computer Science. May College of Computing and Software Engineering Kennesaw State University

Black Problem 2: Huffman Compression [75 points] Next, the Millisoft back story! Starter files

List of squares. Program to generate a list containing squares of n integers starting from 0. list. Example. n = 12

CIS192 Python Programming

Chapter 8 SETS AND DICTIONARIES

Question 1. Part (a) Part (b) December 2013 Final Examination Marking Scheme CSC 108 H1F. [13 marks] [4 marks] Consider this program:

Reading and writing files

Introduction to Python! Lecture 2

For strings (and tuples, when we get to them), its easiest to think of them like primitives directly stored in the variable table.

Dictionaries. Looking up English words in the dictionary. Python sequences and collections. Properties of sequences and collections

CS101 Lecture 30: How Search Works and searching algorithms.

Introduction to programming Lecture 4: processing les and counting words

PYTHON CONTENT NOTE: Almost every task is explained with an example

C mini reference. 5 Binary numbers 12

Introduction to Bioinformatics

EXAMINATION REGULATIONS and REFERENCE MATERIAL for ENCM 339 Fall 2017 Section 01 Final Examination

Babu Madhav Institute of Information Technology, UTU 2015

1 Fall 2017 CS 1110/1111 Exam 3

Introduction to Web Scraping with Python

CIS192 Python Programming

CIS192 Python Programming

CS 3100 Models of Computation Fall 2011 Assignment 7, Posted on: 10/7. Due by 11/1/11 midnight

STA141C: Big Data & High Performance Statistical Computing

Introduction to Python Part I

Sets and Dictionaries. Modules and File I/O

Chapter 8 SETS AND DICTIONARIES

Introduction to Python! Lecture 3

Chapter 6: Files and Exceptions. COSC 1436, Summer 2016 Dr. Ling Zhang 06/23/2016

Python Tutorial. CSE 3461: Computer Networking

CIS192: Python Programming Data Types & Comprehensions Harry Smith University of Pennsylvania September 6, 2017 Harry Smith (University of Pennsylvani

PLEASE HAND IN UNIVERSITY OF TORONTO Faculty of Arts and Science

Directory of C:\Users\Ami\Documents\Python Scripts

Homework 4: Hash Tables Due: 5:00 PM, Mar 9, 2018

Python 2: Loops & Data Input and output 1 / 20

Announcements COMP 141. Writing to a File. Reading From a File 10/18/2017. Reading/Writing from/to Files

GIS 4653/5653: Spatial Programming and GIS. More Python: Statements, Types, Functions, Modules, Classes

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han

Transcription:

File I/O, Benford s Law, and sets Matt Valeriote 11 February 2019

Benford s law Benford s law describes the (surprising) distribution of first digits of many different sets of numbers. Read it about it on Wikipedia or MathWorld Benford s law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9). Benford s law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881). We ll write a Python function benford_count that tabulates the occurrence of digits from a set of numbers But where do we get our numbers from?

file I/O A bunch of ways to read a file Reference f = open(filename, mode) (mode = r for reading, w for writing, a for appending) - text by default; returns a file object f = open("test.txt", 'r') print(f) print(f.closed) f.close() print(f.closed) To free up system resources, files should be closed after no longer needed.

If a file is located in another directory, its path must be provided in the open statement. f = open('/users/matt/documents/temp/temp.txt', 'r') f.close() Once a file has been opened, its entire contents can be read into as a string using the read() method on the associated file object. f = open('test.txt', 'r') contents = f.read() print(contents) print(repr(contents)) f.close() Note that each line of the file ends in the newline character \n.

The read method can be used to read a specified number of characters from the file. f.read(10) will read the next 10 characters from the file. If the end of the file has been reached, this will return '', the empty string. the file object keeps track of how much has been read. f = open('test.txt', 'r') print('first 10 characters of the file:') print(f.read(10)) print('next 6 characters (this includes the end of line character `\\n`:') print(f.read(6)) print("this is the rest of the file:") contents = f.read() print(contents) f.close()

The lines of the file can be stored in a list of strings using the readlines() method. f = open('test.txt') lines = f.readlines() # print the list of lines. # Note that each line ends with a new line character \n. print(lines) print(lines[2]) f.close() A common way to process a file is with a for loop to read its lines: f = open("test.txt") for L in f: print(l) L2 = L.upper() print(l2) f.close()

More I/O details To get rid of pesky newlines and leading and trailing whitespace use the strip() method: f = open('test.txt') for L in f: print(repr(l)) print(l.strip()) f.close() To break a string (or line) into words, use the split() method for strings: f = open("test.txt") L = f.readline() ## read one line LL = L.strip().split(" ") print(ll)

And even more import os in order to find out working directory (os.getcwd()), or set the directory os.chdir(newpath); use full path or use (e.g.).. to go up one level. opening a URL from the web import urllib.request as ur; ur.urlopen(url) # open up a webpage using the urllib package import urllib.request as ur #open the 1mp3 webpage web_page = ur.urlopen('https://ms.mcmaster.ca/ ~matt/1mp3.html') lines = web_page.readlines() print(lines[25])

Benford s Law Project: Create a python function benford_count that tabulates the occurrence of leading digits from a set of numbers. The set of numbers should be read in from a file. The filename should be an argument to the function. The function should return a tuple digits_count of length 10 (one for each digit) such that digits_count[i] is equal to the number of occurrences of the digit i as the leading digit of some number from the file, for i = 1, 1,..., 9. So digits_count[2] is the count of the number of times the digit 2 occurred as the leading digit of some number from the file. Note that the digit 0 doesn t occur as the leading digit of a number, except for the number 0, so we really don t need to count the number of 0 s that occur. A better way to handle this is with a python dictionary object.

File Format In order to properly process a data file, we need to make some assumption about its format. We will assume that each line contains a number of words, for example, the name of a town, followed by its province/state and country, etc., and that the last word will be a number that represents the size of the quantities being counted in the file. Steps: 1. Initialize a digits_count list of length 10. 2. Open the file 3. For each line in the file, 3.1. retrieve the last word from the line 3.2. If it is a string of digits that doesn t start with 0, 3.2.1. get the leading digit and update digits_count. 4. return tuple(digits_count).

Other considerations A source for some interesting data sets to use is: http://www.testingbenfordslaw.com/. Modify the code so that it will process a list of files. Improve the code so that it can correctly deal with a general number format of the form: 123,456,789.123456... Enable the program to directly access suitable files from the internet. the file that we created contains several function definitions. This file could be imported by other files in order to make use of the functions. To prevent the code at the bottom of the file to be executed when the file is imported, first test to see if the interpreter s name for the file is main. This will be the case when the file itself is being executed and not imported. Use a python dictionary instead of a tuple or list to keep track of the digit counts.

Other considerations A source for some interesting data sets to use is: http://www.testingbenfordslaw.com/. Modify the code so that it will process a list of files. Improve the code so that it can correctly deal with a general number format of the form: 123,456,789.123456... Enable the program to directly access suitable files from the internet. the file that we created contains several function definitions. This file could be imported by other files in order to make use of the functions. To prevent the code at the bottom of the file to be executed when the file is imported, first test to see if the interpreter s name for the file is main. This will be the case when the file itself is being executed and not imported. Use a python dictionary instead of a tuple or list to keep track of the digit counts.

Sets A set is an unordered collection of distinct immutable objects. It is similar in some respects to a list, but the elements of a set are not ordered. sets are mutable objects and so can be altered. Sets are created using braces { and }. duplicate items in a set are ignored. vowels = {'a', 'e', 'i', 'o', 'u'} print(type(vowels)) print(vowels) vowels = {'a', 'e', 'i', 'o', 'u', 'e', 'a'} print(vowels) print({1, 2, 3, 'a'} == {'a', 1, 1, 3, 2, 'a'}) ## <class 'set'> ## {'u', 'e', 'a', 'i', 'o'} ## {'u', 'e', 'a', 'i', 'o'} ## True

sets can also be created with the function set. set() produces the empty set. set(list) and set(string) produce the set of items/characters from the list/string. sets can also be produced from a range. sets also support iteration, or looping, but the order is arbitrary. print(set()) print(set([1, 2, 0, -1, 3, 1, 1, 2])) print(set('hello world!')) print(set(range(0, 10, 2))) ## set() ## {0, 1, 2, 3, -1} ## {' ', 'd', 'w', 'h', 'r', '!', 'e', 'o', 'l'} ## {0, 2, 4, 6, 8}

a number of operations and methods are available to set objects. small = {0, 1, 2, 3} mid = {3, 4, 5, 6, 7} big = {7, 8, 9} big.add(10) small.remove(0) print(small, big) print(small.intersection(mid)) print(small.union(big)) ## {1, 2, 3} {8, 9, 10, 7} ## {3} ## {1, 2, 3, 7, 8, 9, 10}

sets can also be compared using methods or set operators. print({0, 1}.issubset({0, 1, 2, 3})) print({0, 1} <= {0, 1, 2, 3}) ## True ## True Recall the code used for testing a hexadecimal string: hex_char = "0123456789abcdef" for char in word: if not char in hex_char: return False return True this can be replaced by: return set(word) <= set('0123456789abcdef')