processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections MCS 275 Lecture 39 Programming Tools and File Management Jan Verschelde, 17 April 2017 Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 1 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 2 / 39
GTFS of our CTA We can download the schedules of the CTA: http://www.transitchicago.com/developers/gtfs.aspx GTFS = General Transit Feed Specification is an open format for packaging scheduled service data. AGTFSfeedisaseriesoftextfileswithdataonlinesseparatedby commas (csv format). Each file is a table in a relational database. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 3 / 39
some tables stops.txt: stoplocationsforbusortrain routes.txt: routelistwithuniqueidentifiers trips.txt: informationabouteachtripbyavehicle stop_times.txt: scheduledarrivalanddeparturetimesfor each stop on each trip. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 4 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 5 / 39
finding a stop name $ python3 ctastopname.py opening CTA/stops.txt... give a stop id : 3021 skipping line 0 3021 has name "California & Augusta" The script looks for the line 3021,3021,"California & Augusta",41.89939053, \ -87.69688045,0,,1 In a Terminal window, we can type $ cat stops.txt grep ",3021," 3021,3021,"California & Augusta",41.89939053, \ -87.69688045,0,,1 $ Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 6 / 39
ctastopname.py FILENAME = CTA/stops.txt print( opening, FILENAME,... ) DATAFILE = open(filename, r ) STOPID = int(input( give a stop id : )) COUNT = 0 STOPNAME = None while True: LINE = DATAFILE.readline() if LINE == : break L = LINE.split(, ) try: if int(l[0]) == STOPID: STOPNAME = L[2] break except: print( skipping line, COUNT) COUNT = COUNT + 1 print(stopid, has name, STOPNAME) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 7 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 8 / 39
finding head signs Given an identification of a stop, we look for all CTA vehicles that make a stop there. $ python3 ctastoptimes.py opening CTA/stop_times.txt... give a stop id : 3021 skipping line 0 adding "63rd Pl/Kedzie" adding "Jackson" [ "63rd Pl/Kedzie", "Jackson" ] We scan the lines in stop_times.txt for where the given stop identification occurs. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 9 / 39
ctastoptimes.py FILENAME = CTA/stop_times.txt print( opening, FILENAME,... ) DATAFILE = open(filename, r ) STOPID = int(input( give a stop id : )) COUNT = 0 TIMES = [] while True: LINE = DATAFILE.readline() if LINE == : break L = LINE.split(, ) try: if int(l[3]) == STOPID: if L[5] not in TIMES: print( adding, L[5]) TIMES.append(L[5]) except: print( skipping line, COUNT) COUNT = COUNT + 1 print(times) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 10 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 11 / 39
GTFS of our CTA We can download the schedules of the CTA: http://www.transitchicago.com/developers/gtfs.aspx GTFS = General Transit Feed Specification is an open format for packaging scheduled service data. AGTFSfeedisaseriesoftextfileswithdataonlinesseparatedby commas (csv format). Each file is a table in a relational database. We call our database CTA and will add tables reading the information from stops.txt. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 12 / 39
fields in stops.txt The first line in stops.txt lists: 1 stop_id: type INT 2 stop_code: type INT 3 stop_name: type CHAR(80) 4 stop_desc: type VARCHAR(80) 5 stop_lat: type FLOAT 6 stop_lon: type FLOAT 7 location_type: type INT 8 parent_station: type INT 9 wheelchair_boarding: type SMALLINT Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 13 / 39
creating database and table Note: depending on the setup of mysql,wemayhave to execute mysql as superuser (use sudo on Mac OS X). To make a database CTA, we run mysqladmin: $ mysqladmin create CTA Then we start mysql, tocreateatableinthedatabase: mysql> use CTA; Database changed mysql> create table stops -> (id INT, code INT, name CHAR(80), -> ndesc VARCHAR(128), -> lat FLOAT, lon FLOAT, -> tp INT, ps INT, wb SMALLINT); Query OK, 0 rows affected (0.01 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 14 / 39
explain mysql> explain stops; +-------+--------------+------+-----+---------+-------+ Field Type Null Key Default Extra +-------+--------------+------+-----+---------+-------+ id int(11) YES NULL code int(11) YES NULL name char(80) YES NULL ndesc varchar(128) YES NULL lat float YES NULL lon float YES NULL tp int(11) YES NULL ps int(11) YES NULL wb smallint(6) YES NULL +-------+--------------+------+-----+---------+-------+ 9 rows in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 15 / 39
manual insertion The first data line in stops.txt contains 1,1,"Jackson & Austin Terminal", "Jackson & Austin Terminal, Northeastbound, Bus Terminal", 41.87632184,-87.77410482,0,,1 mysql> insert into stops values -> (1,1,"Jackson & Austin Terminal", -> "Jackson & Austin Terminal, Northeastbound, Bus Term -> 41.87632184,-87.77410482,0,0,1); Query OK, 1 row affected (0.00 sec) mysql> select name from stops where id = 1; +---------------------------+ name +---------------------------+ Jackson & Austin Terminal +---------------------------+ 1 row in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 16 / 39
deleting rows To delete a row, given its id: mysql> delete from stops where id = 1; Query OK, 1 row affected (0.65 sec) mysql> select * from stops; Empty set (0.01 sec) If the where clause is omitted, then all rows in the table are deleted. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 17 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 18 / 39
filling a table Typing 12,165 is rather tedious... After filling the table stops of the database we query the table for a name: mysql> select name from stops where id = 3021; +----------------------+ name +----------------------+ California & Augusta +----------------------+ 1 row in set (0.00 sec) Our data is on file: FILENAME = CTA/stops.txt Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 19 / 39
fillstops() in dbctafillstops.py import pymysql def fillstops(printonly=true): """ Opens the file with name FILENAME, reads every line and inserts the data into the table stops. """ if printonly: crs = None else: cta = pymysql.connect(db= CTA ) crs = cta.cursor() print( opening, FILENAME,... ) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 20 / 39
fillstops() continued datafile = open(filename, r ) line = datafile.readline() # skip the first line while True: line = datafile.readline() if(line == ): break insert_data(crs, line, printonly) if not printonly: cta.commit() crs.close() datafile.close() For the changes to take effect, we must do a commit(). With rollback(),wecancancelthecurrenttransaction, provided the database and tables support transactions. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 21 / 39
extracting data "Jackson & Austin Terminal, Northeastbound, Bus Terminal", Commas in the string! def extract(line): """ Returns a list of 9 elements extracted from the string line. Missing data are replaced by 0. """ result = [] strd = line.split( \" ) # extract strings first (name, desc) = (strd[1], strd[3]) data = strd[0].split(, ) result.append( 0 if data[0] == else data[0]) result.append( 0 if data[1] == else data[1]) result.append(name) result.append(desc) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 22 / 39
function extract(line) continued... Extracting latitute, longitude, and last 3 integers: data = strd[4].split( \n ) # remove newline data = data[0].split(, ) for k in range(1, len(data)): result.append( 0 if data[k] == else data[k]) while len(result) < 9: result.append( 0 ) return result Missing data are replaced by 0. The list on return will always have nine items. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 23 / 39
inserting data def insert_data(cur, line, printonly=true): """ Inserts the data in the string line, using the cursor c, if printonly is False. """ data = extract(line) dbc = INSERT INTO stops VALUES ( dbc += data[0] +, + data[1] +, dbc += \" + data[2] + \", # name is a string dbc += \" + data[3] + \", # description for k in range(4, 8): dbc += data[k] +, dbc += data[8] + ) print(repr(dbc)) # print raw string if not printonly: cur.execute(dbc) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 24 / 39
querying the table mysql> select id from stops -> where name = "California & Augusta"; +-------+ id +-------+ 3021 17154 +-------+ 2 rows in set (0.00 sec) mysql> select name from stops where id = 17154; +----------------------+ name +----------------------+ California & Augusta +----------------------+ 1 row in set (0.01 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 25 / 39
querying with Python $ python3 dbctastopquery.py give a stop id : 3021 3021 has name California & Augusta $ $ python3 dbctastopquery.py give a stop id : 0 0 has name None $ Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 26 / 39
the main program import pymysql def main(): """ Connects to the database, prompts the user for a stop id and the queries the stops table. """ cta = pymysql.connect(db= CTA ) crs = cta.cursor() stop = int(input( give a stop id : )) name = get_stop_name(crs, stop) print(stop, has name, name) cta.close() Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 27 / 39
executing the query def get_stop_name(crs, stopid): """ Given a cursor crs to the CTA database, queries the stops table for the stop id. Returns None if the stop id has not been found, otherwise returns the stop name. """ sel = SELECT name FROM stops whe = WHERE id = %d % stopid query = sel + whe returned = crs.execute(query) if returned == 0: return None else: tpl = crs.fetchone() return tpl[0] Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 28 / 39
processing data from the web 1 CTA Tables general transit feed specification stop identification and name finding trips for a given stop 2 CTA Tables in MySQL files in GTFS feed are tables in database filling a table with a Python script storing the connections Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 29 / 39
fields in stop_times.txt The first line in stop_times.txt lists: 1 trip_id: type INT 2 arrival_time: type TIME 3 departure_time: type TIME 4 stop_id: type INT 5 stop_sequence: type INT 6 stop_headsign: type VARCHAR(80) 7 pickup_type: type INT 8 shape_dist_traveled: type INT Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 30 / 39
adding a new table mysql> create table stop_times -> (id BIGINT, arrival TIME, departure TIME, -> stop INT, seq INT, head VARCHAR(80), -> ptp INT, sdt INT); Query OK, 0 rows affected (0.02 sec) Note the types TIME and VARCHAR. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 31 / 39
explain mysql> explain stop_times; +-----------+-------------+------+-----+---------+-------+ Field Type Null Key Default Extra +-----------+-------------+------+-----+---------+-------+ id bigint(20) YES NULL arrival time YES NULL departure time YES NULL stop int(11) YES NULL seq int(11) YES NULL head varchar(80) YES NULL ptp int(11) YES NULL sdt int(11) YES NULL +-----------+-------------+------+-----+---------+-------+ 8 rows in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 32 / 39
manual insertion mysql> insert into stop_times values ( -> 46035893,"12:09:14","12:09:14",6531,29, -> "Midway Orange Line",0,18625); Query OK, 1 row affected (0.00 sec) mysql> select departure, head from stop_times; +-----------+--------------------+ departure head +-----------+--------------------+ 12:09:14 Midway Orange Line +-----------+--------------------+ 1 row in set (0.00 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 33 / 39
filling the table On Mac OS X laptop: $ python3 dbctafillstoptimes.py opening CTA/stop_times.txt... dbctafillstoptimes.py:26: Warning: Out of range value for column id at row 1 c.execute(d) Redo on a fast Linux Workstation: # time python3 dbctafillstoptimes.py opening CTA/stop_times.txt... dbctafillstoptimes.py:26: Warning: Out of range value for column id at row 1 c.execute(d) real user sys 5m32.433s 1m11.921s 0m17.735s Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 34 / 39
about the complexity While running dbctafillstoptimes.py, the memory consumption of Python and mysql was of the same magnitude, about 300Mb. mysql> select count(*) from stop_times; +----------+ count(*) +----------+ 5455515 +----------+ 1 row in set (1.46 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 35 / 39
inserting data def insert_data(crs, scsv, printonly=true): """ Inserts the data in the comma separated string scsv using the cursor crs. """ data = scsv.split(, ) cmd = INSERT INTO stop_times VALUES ( cmd += ( 0, if data[0] == else data[0] +, ) cmd += \" + data[1] + \" +, cmd += \" + data[2] + \" +, cmd += data[3] +, + data[4] +, cmd += data[5] +, + data[6] +, wrk = data[7] # must cut off the \n data7 = wrk[0:len(wrk)-1] + ) cmd += ( 0) if data[7] == else data7) print(repr(cmd)) if not printonly: crs.execute(cmd) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 36 / 39
querying stop_times mysql> select head from stop_times -> where stop = 3021 and -> arrival < "05:30:00"; +----------------+ head +----------------+ 63rd Pl/Kedzie 63rd Pl/Kedzie 63rd Pl/Kedzie 63rd Pl/Kedzie +----------------+ 4 rows in set (0.94 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 37 / 39
an involved query mysql> select name, departure, head -> from stops, stop_times -> where stops.id = 3021 -> and stops.id = stop_times.stop -> and stop_times.departure < "05:30:00"; +----------------------+-----------+----------------+ name departure head +----------------------+-----------+----------------+ California & Augusta 04:43:49 63rd Pl/Kedzie California & Augusta 05:03:49 63rd Pl/Kedzie California & Augusta 05:19:49 63rd Pl/Kedzie California & Augusta 05:12:49 63rd Pl/Kedzie +----------------------+-----------+----------------+ 4 rows in set (0.57 sec) Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 38 / 39
Exercises 1 Modify ctastopname.py so the user is prompted for a string instead of a number. The modified script prints all id s and corresponding names that have the given string as substring. Use the in operator. 2 The file stops.txt contains the latitude and longitude of each stop. Use these coordinates to plot (either with pylab, pyplot, or a Tkinter canvas) the blue line from O Hare to Forest Park. Use a proper scaling so your plot resembles what we see on a map. 3 Write a Python script to return the name of stop, given its id, using the table stops. 4 Design a GUI with Tkinter to query the stop name: one entry field for the stop id, another for the name of the stop, and one button in the middle to execute the query. Note that the GUI allows to query given the stop id or given the stop name. Programming Tools (MCS 275) processing gtfs data L-39 17 April 2017 39 / 39