processing data from the web

Similar documents
processing data with a database

making connections general transit feed specification stop names and stop times storing the connections in a dictionary

Field required - The field column must be included in your feed, and a value must be

pygtfs Documentation Release Yaron de Leeuw

Hands-on GTFS. Omaha, NE October 29, U.S. Department of Transportation Federal Transit Administration

Package SIRItoGTFS. May 21, 2018

This lab will introduce you to MySQL. Begin by logging into the class web server via SSH Secure Shell Client

Midterm Exam II MCS 275 Programming Tools 14 March 2017

Assignment 6: SQL III Solution

Web Interfaces for Database Servers

Data Modelling and Databases Exercise dates: March 22/March 23, 2018 Ce Zhang, Gustavo Alonso Last update: March 26, 2018.

Assignment 6: SQL III

Draft. Students Table. FName LName StudentID College Year. Justin Ennen Science Senior. Dan Bass Management Junior

Operating systems fundamentals - B07

Graphical User Interfaces

Assignment 5: SQL II Solution

Exam. Question: Total Points: Score:

More MySQL ELEVEN Walkthrough examples Walkthrough 1: Bulk loading SESSION

User Interfaces. MCS 507 Lecture 11 Mathematical, Statistical and Scientific Software Jan Verschelde, 16 September Command Line Interfaces

1 INTRODUCTION TO EASIK 2 TABLE OF CONTENTS

User Interfaces. getting arguments of the command line a command line interface to store points fitting points with polyfit of numpy

Web Clients and Crawlers

SQL: Data De ni on. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 3

THE DEFINITIVE GUIDE TO GTFS-REALTIME. Quentin Zervaas

CPSC 217 Midterm (Python 3 version)

Review for Second Midterm Exam

MySQL: an application

Welcome to MCS 275. Course Content Prerequisites & Expectations. Scripting in Python from OOP to LAMP example: Factorization in Primes

Data Modelling and Databases Exercise dates: March 20/March 27, 2017 Ce Zhang, Gustavo Alonso Last update: February 17, 2018.

COMP 4/6262: Programming UNIX

What is SQL? Toolkit for this guide. Learning SQL Using phpmyadmin

Provider: MySQLAB Web page:

MySQL Creating a Database Lecture 3

CSCI-UA: Database Design & Web Implementation. Professor Evan Sandhaus Lecture #23: SQLite

STOP DROWNING IN DATA. START MAKING SENSE! An Introduction To SQLite Databases. (Data for this tutorial at

Connecting People and Events: Multi-Modal Routing and Dynamic User-Generated Content

Random Walks & Cellular Automata

ETH Zurich Spring Semester Systems Group Lecturer(s): Gustavo Alonso, Ce Zhang Date: March 20/March 27, 2017.

CS 1110 SPRING 2016: GETTING STARTED (Jan 27-28) First Name: Last Name: NetID:

Web Clients and Crawlers

Information Systems Engineering. SQL Structured Query Language DDL Data Definition (sub)language

MySQL by Examples for Beginners

MTC 511 Regional Real-time Transit System

WHAT IS A DATABASE? There are at least six commonly known database types: flat, hierarchical, network, relational, dimensional, and object.

Model Question Paper. Credits: 4 Marks: 140

SQL Functionality SQL. Creating Relation Schemas. Creating Relation Schemas

Welcome to MCS 360. content expectations. using g++ input and output streams the namespace std. Euclid s algorithm the while and do-while statements

IBM DB2 UDB V7.1 Family Fundamentals.

In this exercise, you will import orders table from MySQL database. into HDFS. Get acquainted with some of basic commands of Sqoop

Is the 370 the worst bus in Sydney?

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Kaivos User Guide Getting a database account 2

SQL Structured Query Language Introduction

CSE 115. Introduction to Computer Science I

Graphical User Interfaces

Final Exam, Version 1 CSci 127: Introduction to Computer Science Hunter College, City University of New York

user specifies what is wanted, not how to find it

Using MySQL on the Winthrop Linux Systems

Outline. gzip and gunzip data compression archiving files and pipes in Unix. format conversions encrypting text

Creating Your First MySQL Database. Scott Seighman Sales Consultant Oracle

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan

CS 327E Lecture 2. Shirley Cohen. January 27, 2016

Lab # 1. You will be using MySQL as a database management system during the labs. The goal of this first lab is to familiarize you with MySQL.

Random Walks & Cellular Automata

MySQL Installation Guide (OS X)

UNIX II:grep, awk, sed. October 30, 2017

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

From Chrome or Firefox browser, Open Google.com/MyMaps (You must log-in to a Google account)

Multiple Choice (Questions 1 14) 28 Points Select all correct answers (multiple correct answers are possible)

Shell scripting and system variables. HORT Lecture 5 Instructor: Kranthi Varala

Insertions, Deletions, and Updates

CPTS 111, Fall 2011, Sections 6&7 Exam 3 Review

CSC A20H3 S 2011 Test 1 Duration 90 minutes Aids allowed: none. Student Number:

Running Cython and Vectorization

World Premium Points of Interest Getting Started Guide

Mysql Tutorial Show Table Like Name Not >>>CLICK HERE<<<

Final Exam, Version 2 CSci 127: Introduction to Computer Science Hunter College, City University of New York

From MySQL to PostgreSQL

GOSAT Tools Installation and Operation Manual

Load data into Table from external files, using two methods:

CSci 1113 Lab Exercise 6 (Week 7): Arrays & Strings

MySQL User Conference and Expo 2010 Optimizing Stored Routines

World Premium Points of Interest Getting Started Guide

Data Types in MySQL CSCU9Q5. MySQL. Data Types. Consequences of Data Types. Common Data Types. Storage size Character String Date and Time.

Good Luck! CSC207, Fall 2012: Quiz 1 Duration 25 minutes Aids allowed: none. Student Number:

XQ: An XML Query Language Language Reference Manual

Exact Numeric Data Types

Loop structures and booleans

Intro to Database Commands

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Multithreaded Servers

MySQL Schema Best Practices

Mysql Information Schema Update Time Null >>>CLICK HERE<<< doctrine:schema:update --dump-sql ALTER TABLE categorie

Mastering Modern Linux by Paul S. Wang Appendix: Pattern Processing with awk

Applicaton Instrumentaton for MySQL What Why and How

Basic SQL. Dr Fawaz Alarfaj. ACKNOWLEDGEMENT Slides are adopted from: Elmasri & Navathe, Fundamentals of Database Systems MySQL Documentation

CS Programming Languages: Python

1 Truth. 2 Conditional Statements. Expressions That Can Evaluate to Boolean Values. Williams College Lecture 4 Brent Heeringa, Bill Jannen

Advanced MySQL Query Tuning

2

Transcription:

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections MCS 275 Lecture 39 Programming Tools and File Management Jan Verschelde, 17 April 2017 Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 1 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 2 / 39

GTFS of our CTA We can download the schedules of the CTA: http://www.transitchicago.com/developers/gtfs.aspx GTFS = General Transit Feed Specification is an open format for packaging scheduled service data. AGTFSfeedisaseriesoftextfileswithdataonlinesseparatedby commas (csv format). Each file is a table in a relational database. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 3 / 39

some tables stops.txt: stoplocationsforbusortrain routes.txt: routelistwithuniqueidentifiers trips.txt: informationabouteachtripbyavehicle stop_times.txt: scheduledarrivalanddeparturetimesfor each stop on each trip. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 4 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 5 / 39

finding a stop name $ python3 ctastopname.py opening CTA/stops.txt... give a stop id : 3021 skipping line 0 3021 has name "California & Augusta" The script looks for the line 3021,3021,"California & Augusta",41.89939053, \ -87.69688045,0,,1 In a Terminal window, we can type $ cat stops.txt grep ",3021," 3021,3021,"California & Augusta",41.89939053, \ -87.69688045,0,,1 $ Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 6 / 39

ctastopname.py FILENAME = CTA/stops.txt print( opening, FILENAME,... ) DATAFILE = open(filename, r ) STOPID = int(input( give a stop id : )) COUNT = 0 STOPNAME = None while True: LINE = DATAFILE.readline() if LINE == : break L = LINE.split(, ) try: if int(l[0]) == STOPID: STOPNAME = L[2] break except: print( skipping line, COUNT) COUNT = COUNT + 1 print(stopid, has name, STOPNAME) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 7 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 8 / 39

finding head signs Given an identification of a stop, we look for all CTA vehicles that make a stop there. $ python3 ctastoptimes.py opening CTA/stop_times.txt... give a stop id : 3021 skipping line 0 adding "63rd Pl/Kedzie" adding "Jackson" [ "63rd Pl/Kedzie", "Jackson" ] We scan the lines in stop_times.txt for where the given stop identification occurs. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 9 / 39

ctastoptimes.py FILENAME = CTA/stop_times.txt print( opening, FILENAME,... ) DATAFILE = open(filename, r ) STOPID = int(input( give a stop id : )) COUNT = 0 TIMES = [] while True: LINE = DATAFILE.readline() if LINE == : break L = LINE.split(, ) try: if int(l[3]) == STOPID: if L[5] not in TIMES: print( adding, L[5]) TIMES.append(L[5]) except: print( skipping line, COUNT) COUNT = COUNT + 1 print(times) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 10 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 11 / 39

GTFS of our CTA We can download the schedules of the CTA: http://www.transitchicago.com/developers/gtfs.aspx GTFS = General Transit Feed Specification is an open format for packaging scheduled service data. AGTFSfeedisaseriesoftextfileswithdataonlinesseparatedby commas (csv format). Each file is a table in a relational database. We call our database CTA and will add tables reading the information from stops.txt. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 12 / 39

fields in stops.txt The first line in stops.txt lists: 1 stop_id: type INT 2 stop_code: type INT 3 stop_name: type CHAR(80) 4 stop_desc: type VARCHAR(80) 5 stop_lat: type FLOAT 6 stop_lon: type FLOAT 7 location_type: type INT 8 parent_station: type INT 9 wheelchair_boarding: type SMALLINT Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 13 / 39

creating database and table Note: depending on the setup of mysql,wemayhave to execute mysql as superuser (use sudo on Mac OS X). To make a database CTA, we run mysqladmin: $ mysqladmin create CTA Then we start mysql, tocreateatableinthedatabase: mysql> use CTA; Database changed mysql> create table stops -> (id INT, code INT, name CHAR(80), -> ndesc VARCHAR(128), -> lat FLOAT, lon FLOAT, -> tp INT, ps INT, wb SMALLINT); Query OK, 0 rows affected (0.01 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 14 / 39

explain mysql> explain stops; +-------+--------------+------+-----+---------+-------+ Field Type Null Key Default Extra +-------+--------------+------+-----+---------+-------+ id int(11) YES NULL code int(11) YES NULL name char(80) YES NULL ndesc varchar(128) YES NULL lat float YES NULL lon float YES NULL tp int(11) YES NULL ps int(11) YES NULL wb smallint(6) YES NULL +-------+--------------+------+-----+---------+-------+ 9 rows in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 15 / 39

manual insertion The first data line in stops.txt contains 1,1,"Jackson & Austin Terminal", "Jackson & Austin Terminal, Northeastbound, Bus Terminal", 41.87632184,-87.77410482,0,,1 mysql> insert into stops values -> (1,1,"Jackson & Austin Terminal", -> "Jackson & Austin Terminal, Northeastbound, Bus Term -> 41.87632184,-87.77410482,0,0,1); Query OK, 1 row affected (0.00 sec) mysql> select name from stops where id = 1; +---------------------------+ name +---------------------------+ Jackson & Austin Terminal +---------------------------+ 1 row in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 16 / 39

deleting rows To delete a row, given its id: mysql> delete from stops where id = 1; Query OK, 1 row affected (0.65 sec) mysql> select * from stops; Empty set (0.01 sec) If the where clause is omitted, then all rows in the table are deleted. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 17 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 18 / 39

filling a table Typing 12,165 is rather tedious... After filling the table stops of the database we query the table for a name: mysql> select name from stops where id = 3021; +----------------------+ name +----------------------+ California & Augusta +----------------------+ 1 row in set (0.00 sec) Our data is on file: FILENAME = CTA/stops.txt Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 19 / 39

fillstops() in dbctafillstops.py import pymysql def fillstops(printonly=true): """ Opens the file with name FILENAME, reads every line and inserts the data into the table stops. """ if printonly: crs = None else: cta = pymysql.connect(db= CTA ) crs = cta.cursor() print( opening, FILENAME,... ) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 20 / 39

fillstops() continued datafile = open(filename, r ) line = datafile.readline() # skip the first line while True: line = datafile.readline() if(line == ): break insert_data(crs, line, printonly) if not printonly: cta.commit() crs.close() datafile.close() For the changes to take effect, we must do a commit(). With rollback(),wecancancelthecurrenttransaction, provided the database and tables support transactions. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 21 / 39

extracting data "Jackson & Austin Terminal, Northeastbound, Bus Terminal", Commas in the string! def extract(line): """ Returns a list of 9 elements extracted from the string line. Missing data are replaced by 0. """ result = [] strd = line.split( \" ) # extract strings first (name, desc) = (strd[1], strd[3]) data = strd[0].split(, ) result.append( 0 if data[0] == else data[0]) result.append( 0 if data[1] == else data[1]) result.append(name) result.append(desc) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 22 / 39

function extract(line) continued... Extracting latitute, longitude, and last 3 integers: data = strd[4].split( \n ) # remove newline data = data[0].split(, ) for k in range(1, len(data)): result.append( 0 if data[k] == else data[k]) while len(result) < 9: result.append( 0 ) return result Missing data are replaced by 0. The list on return will always have nine items. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 23 / 39

inserting data def insert_data(cur, line, printonly=true): """ Inserts the data in the string line, using the cursor c, if printonly is False. """ data = extract(line) dbc = INSERT INTO stops VALUES ( dbc += data[0] +, + data[1] +, dbc += \" + data[2] + \", # name is a string dbc += \" + data[3] + \", # description for k in range(4, 8): dbc += data[k] +, dbc += data[8] + ) print(repr(dbc)) # print raw string if not printonly: cur.execute(dbc) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 24 / 39

querying the table mysql> select id from stops -> where name = "California & Augusta"; +-------+ id +-------+ 3021 17154 +-------+ 2 rows in set (0.00 sec) mysql> select name from stops where id = 17154; +----------------------+ name +----------------------+ California & Augusta +----------------------+ 1 row in set (0.01 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 25 / 39

querying with Python $ python3 dbctastopquery.py give a stop id : 3021 3021 has name California & Augusta $ $ python3 dbctastopquery.py give a stop id : 0 0 has name None $ Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 26 / 39

the main program import pymysql def main(): """ Connects to the database, prompts the user for a stop id and the queries the stops table. """ cta = pymysql.connect(db= CTA ) crs = cta.cursor() stop = int(input( give a stop id : )) name = get_stop_name(crs, stop) print(stop, has name, name) cta.close() Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 27 / 39

executing the query def get_stop_name(crs, stopid): """ Given a cursor crs to the CTA database, queries the stops table for the stop id. Returns None if the stop id has not been found, otherwise returns the stop name. """ sel = SELECT name FROM stops whe = WHERE id = %d % stopid query = sel + whe returned = crs.execute(query) if returned == 0: return None else: tpl = crs.fetchone() return tpl[0] Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 28 / 39

processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 29 / 39

fields in stop_times.txt The first line in stop_times.txt lists: 1 trip_id: type INT 2 arrival_time: type TIME 3 departure_time: type TIME 4 stop_id: type INT 5 stop_sequence: type INT 6 stop_headsign: type VARCHAR(80) 7 pickup_type: type INT 8 shape_dist_traveled: type INT Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 30 / 39

adding a new table mysql> create table stop_times -> (id BIGINT, arrival TIME, departure TIME, -> stop INT, seq INT, head VARCHAR(80), -> ptp INT, sdt INT); Query OK, 0 rows affected (0.02 sec) Note the types TIME and VARCHAR. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 31 / 39

explain mysql> explain stop_times; +-----------+-------------+------+-----+---------+-------+ Field Type Null Key Default Extra +-----------+-------------+------+-----+---------+-------+ id bigint(20) YES NULL arrival time YES NULL departure time YES NULL stop int(11) YES NULL seq int(11) YES NULL head varchar(80) YES NULL ptp int(11) YES NULL sdt int(11) YES NULL +-----------+-------------+------+-----+---------+-------+ 8 rows in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 32 / 39

manual insertion mysql> insert into stop_times values ( -> 46035893,"12:09:14","12:09:14",6531,29, -> "Midway Orange Line",0,18625); Query OK, 1 row affected (0.00 sec) mysql> select departure, head from stop_times; +-----------+--------------------+ departure head +-----------+--------------------+ 12:09:14 Midway Orange Line +-----------+--------------------+ 1 row in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 33 / 39

filling the table On Mac OS X laptop: $ python3 dbctafillstoptimes.py opening CTA/stop_times.txt... dbctafillstoptimes.py:26: Warning: Out of range value for column id at row 1 c.execute(d) Redo on a fast Linux Workstation: # time python3 dbctafillstoptimes.py opening CTA/stop_times.txt... dbctafillstoptimes.py:26: Warning: Out of range value for column id at row 1 c.execute(d) real user sys 5m32.433s 1m11.921s 0m17.735s Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 34 / 39

about the complexity While running dbctafillstoptimes.py, the memory consumption of Python and mysql was of the same magnitude, about 300Mb. mysql> select count(*) from stop_times; +----------+ count(*) +----------+ 5455515 +----------+ 1 row in set (1.46 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 35 / 39

inserting data def insert_data(crs, scsv, printonly=true): """ Inserts the data in the comma separated string scsv using the cursor crs. """ data = scsv.split(, ) cmd = INSERT INTO stop_times VALUES ( cmd += ( 0, if data[0] == else data[0] +, ) cmd += \" + data[1] + \" +, cmd += \" + data[2] + \" +, cmd += data[3] +, + data[4] +, cmd += data[5] +, + data[6] +, wrk = data[7] # must cut off the \n data7 = wrk[0:len(wrk)-1] + ) cmd += ( 0) if data[7] == else data7) print(repr(cmd)) if not printonly: crs.execute(cmd) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 36 / 39

querying stop_times mysql> select head from stop_times -> where stop = 3021 and -> arrival < "05:30:00"; +----------------+ head +----------------+ 63rd Pl/Kedzie 63rd Pl/Kedzie 63rd Pl/Kedzie 63rd Pl/Kedzie +----------------+ 4 rows in set (0.94 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 37 / 39

an involved query mysql> select name, departure, head -> from stops, stop_times -> where stops.id = 3021 -> and stops.id = stop_times.stop -> and stop_times.departure < "05:30:00"; +----------------------+-----------+----------------+ name departure head +----------------------+-----------+----------------+ California & Augusta 04:43:49 63rd Pl/Kedzie California & Augusta 05:03:49 63rd Pl/Kedzie California & Augusta 05:19:49 63rd Pl/Kedzie California & Augusta 05:12:49 63rd Pl/Kedzie +----------------------+-----------+----------------+ 4 rows in set (0.57 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 38 / 39

Exercises 1 Modify ctastopname.py so the user is prompted for a string instead of a number. The modified script prints all id s and corresponding names that have the given string as substring. Use the in operator. 2 The file stops.txt contains the latitude and longitude of each stop. Use these coordinates to plot (either with pylab, pyplot, or a Tkinter canvas) the blue line from O Hare to Forest Park. Use a proper scaling so your plot resembles what we see on a map. 3 Write a Python script to return the name of stop, given its id, using the table stops. 4 Design a GUI with Tkinter to query the stop name: one entry field for the stop id, another for the name of the stop, and one button in the middle to execute the query. Note that the GUI allows to query given the stop id or given the stop name. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 39 / 39