DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 2
TODAY S MENU 1. D ATA B A S E S 2. D ATA T R A N S F O R M AT I O N S 3. F I LT E R I N G AND I M P U TAT I O N
DATABASES This isn t a course on databases: hopefully you ve already taken one But we ll refresh some basics to be able to access data in databases - sqlite - elementary SQL For the most part, we ll just extract the data we need and manipulate it in, e.g., python and command-line tools
EXAMPLE: KAGGLE: EUROPEAN SOCCER DATABASE
EUROPEAN SOCCER DATABASE Create an account on Kaggle unless you already have one Chief Data Scientist s advice: Do Kaggle competitions. [ ] Preprocessing, missing values, using libraries [ ] You can find the soccer database here The database is a single zip file: database.sqlite.zip Zipped 34 MB, unzipped 313 MB
EUROPEAN SOCCER DATABASE Easy to use from command-line: sqlite3 $ sqlite3 database.sqlite SQLite version 3.8.10.2 2015-05-20 18:17:19 Enter ".help" for usage hints. sqlite> SELECT player_name FROM Player LIMIT 10; Aaron Appindangoye Aaron Cresswell Aaron Doran Aaron Galindo Aaron Hughes Aaron Hunt Aaron Kuhl Aaron Lennon Aaron Lennox Aaron Meijers
EUROPEAN SOCCER DATABASE Same in python: import sqlite3 database = 'database.sqlite' conn = sqlite3.connect(database) c = conn.cursor() query = "SELECT player_name FROM Player;" c.execute(query) rows = c.fetchmany(10) print(rows) conn.close() [('Aaron Appindangoye',), ('Aaron Cresswell',), ('Aaron Doran',), ('Aaron Galindo',), ('Aaron Hughes',), ('Aaron Hunt',), ('Aaron Kuhl',), ('Aaron Lennon',), ('Aaron Lennox',), ('Aaron Meijers',)]
EUROPEAN SOCCER DATABASE Same in python with pandas (note the formatting, incl. header): import sqlite3 import pandas as pd database = 'database.sqlite' conn = sqlite3.connect(database) query = "SELECT player_name FROM Player;" rows = pd.read_sql(query, conn) print(rows[0:10]) conn.close() player_name 0 Aaron Appindangoye 1 Aaron Cresswell 2 Aaron Doran 3 Aaron Galindo 4 Aaron Hughes 5 Aaron Hunt 6 Aaron Kuhl 7 Aaron Lennon 8 Aaron Lennox 9 Aaron Meijers
EUROPEAN SOCCER DATABASE Simple SQL tricks: sqlite> SELECT player_name, height FROM Player...> ORDER BY height...> LIMIT 10; Juan Quero 157.48 Diego Buonanotte 160.02 Maxi Moralez 160.02 Anthony Deroin 162.56 Bakari Kone 162.56 Edgar Salli 162.56 Fouad Rachid 162.56 Frederic Sammaritano 162.56 Lorenzo Insigne 162.56 Pablo Piatti 162.56
EUROPEAN SOCCER DATABASE TABLE Player: id, player_api_id, player_name, player_fifa_api_id, birthday, height, weight TABLE Player_Attributes: id, player_fifa_api_id, player_api_id, date, overall_rating, potential, preferred_foot, attacking_work_rate, defensive_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes
EUROPEAN SOCCER DATABASE Joining tables: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id FROM Player) a...> INNER JOIN Player_attributes b...> ON a.p_id = b.player_api_id...> LIMIT 10; Aaron Appindangoye 182.88 187 505942 1 218353 505942 2016-02-18 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8 Aaron Appindangoye 182.88 187 505942 2 218353 505942 2015-11-19 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8...
EUROPEAN SOCCER DATABASE More SQL tricks: CREATE TABLE, GROUP BY, aggregate functions (MAX) sqlite> CREATE TABLE player_max_date...> AS SELECT player_api_id AS p_id,...> MAX(date) AS date...> FROM player_attributes...> GROUP BY p_id; sqlite> SELECT * FROM player_max_date LIMIT 3; 2625 2015-01-16 00:00:00 2752 2015-10-16 00:00:00 2768 2016-03-17 00:00:00
EUROPEAN SOCCER DATABASE Three-way join: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id...> FROM player) a...> INNER JOIN...> player_attributes b...> ON a.p_id = b.player_api_id...> INNER JOIN player_max_date c...> ON b.player_api_id = c.p_id AND...> b.date = c.date; Aaron Appindangoye 182.88 187 505942 1 218353 505942 2016-02-18 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8 505942 2016-02-18 00:00:00 Aaron Cresswell 170.18 146 155782 6 189615 155782 2016-04-21 00:00:00 74 76 left high medium 80 53 58 71 40 73 70 69 68 71 79 78 78 67 90 71 85 79 56 62 68 67 60 66 59 76 75 78 14 7 9 9 12 155782 2016-04-21 00:00:00...
2. DATA TRANS- FORMATIONS
T R A N S F O R M AT I O N S sqlite> sqlite> sqlite> sqlite>...>...>...>...>...>...>...>...> sqlite>.mode csv.headers on.output player_stats.csv SELECT * FROM (SELECT player_name, height, weight, player_api_id AS p_id FROM player) a INNER JOIN player_attributes b ON a.p_id = b.player_api_id INNER JOIN player_max_date c ON b.player_api_id = c.p_id AND b.date = c.date;.output stdout
T R A N S F O R M AT I O N S
T R A N S F O R M AT I O N S csv => json is easy with python and pandas! import pandas as pd import json data = pd.read_csv("player_stats.csv") print(data.to_json(orient='records', lines=true)) {"player_name":"aaron Appindangoye","height": 182.88,"weight":187,"p_id":505942,"id": 1,"player_fifa_api_id":218353,"player_api_id": 505942,"date":"2016-02-18 00:00:00","overall_rating":67.0,"potential": 71.0,"preferred_foot":"right","attacking_work_rate" :"medium","defensive_work_rate":"medium","crossing" :49.0,"finishing":44.0,"heading_accuracy": 71.0,"short_passing":61.0,"volleys": 44.0,"dribbling":51.0,"curve": 45.0,"free_kick_accuracy":39.0,"long_passing": 64.0,"ball_control":49.0,"acceleration": 60.0,"sprint_speed":64.0,"agility": 59.0,"reactions":47.0,"balance":65.0,"shot_power":
OTHER TRANSFORMATIONS HTML => e.g. CSV "Scraping!" (dirty business)
TRANSFORMATIONS Content transformations: string to numeric, " 12.002" > 12.002 (float) dates (mind the formats, 9/5/2017 vs 5.9.2017) NA/ /0/99/etc can mean missing entries splitting: name = "Teemu Roos" => first = "Teemu", last = "Roos"... Especially for text, it may be important to: downcase: SuperMan > superman remove punctuation stem: 'swimming' > 'swim'
3. F I LT E R I N G AND I M P U TAT I O N
FILTERING Subsetting: columns and/or rows Many of these are conveniently done using command-line tools such as grep, cut, awk, sed For big data, it is important to avoid reading all the data in memory before starting: the above tools only store and process the data little by little, so memory consumption is constant
IMPUTATION Missing values can be a show-stopper for many analysis methods A simple way is to filter out all records with missing entries This may, however, lose a lot of important data Another option is to impute, i.e., enter "fake" data in the place of the missing entries: average for numeric columns mode (most typical value) for categorical columns also possible to use machine learning to predict the missing entries based on the others