INTRODUCTION TO DATA SCIENCE

Similar documents
INTRODUCTION TO DATA SCIENCE

Traffic violations revisited

Converting categorical data into numbers with Pandas and Scikit-learn -...

Databases. Course October 23, 2018 Carsten Witt

15-388/688 - Practical Data Science: Relational Data. J. Zico Kolter Carnegie Mellon University Spring 2018

INTRODUCTION TO DATA SCIENCE

IMPORTING DATA IN PYTHON I. Introduction to relational databases

Data Analyst Nanodegree Syllabus

NCSS: Databases and SQL

DATA STRUCTURE AND ALGORITHM USING PYTHON

Now go to bash and type the command ls to list files. The unix command unzip <filename> unzips a file.

Data Analyst Nanodegree Syllabus

CIS 192: Lecture 11 Databases (SQLite3)

STAT 408. Data Scraping and SQL STAT 408. Data Scraping SQL. March 8, 2018

SQLite vs. MongoDB for Big Data

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

Command-Line Data Analysis INX_S17, Day 15,

CSE 115. Introduction to Computer Science I

SQL I: Introduction. Relational Databases. Attribute. Tuple. Relation

CS 2316 Exam 3 ANSWER KEY

Data Science. Data Analyst. Data Scientist. Data Architect

Exceptions & a Taste of Declarative Programming in SQL

Lecture #12: Quick: Exceptions and SQL

SOFTWARE DEVELOPMENT: DATA SCIENCE

Pandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments

CS108 Lecture 18: Databases and SQL

LECTURE 21. Database Interfaces

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Prometheus. A Next Generation Monitoring System. Brian Brazil Founder

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...

Pandas. Data Manipulation in Python

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NCSS: Databases and SQL

Designing dashboards for performance. Reference deck

Optimizer Challenges in a Multi-Tenant World

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Microsoft Excel & The Internet. J. Carlton Collins ASA Research

Extract API: Build sophisticated data models with the Extract API

Why I Use Python for Academic Research

Big Data, Right Tools: Computational Resources for Empirical Research 2014

42 Building a Report with a Text Pluggable Data Source

10 things I wish I knew. about Machine Learning Competitions

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Investigating Source Code Reusability for Android and Blackberry Applications

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Things You Will Most Likely Want to Do in TeamSnap

Pandas. Data Manipulation in Python

Using PostgreSQL, Prometheus & Grafana for Storing, Analyzing and Visualizing Metrics

Data and Text Mining

Scalable Web Software. CS193S - Jan Jannink - 1/07/10

A detailed comparison of EasyMorph vs Tableau Prep

Chapter The Juice: A Podcast Aggregator

Data Science Bootcamp Curriculum. NYC Data Science Academy

Python & Spark PTT18/19

Financial Statements Using Crystal Reports

Training. Data Modelling. Framework Manager Projects (2 days) Contents

Microsoft Access Illustrated. Unit B: Building and Using Queries

Dealing with Data Especially Big Data

Please pick up your name card

Part 1: Collecting and visualizing The Movie DB (TMDb) data

CITS4009 Introduction to Data Science

CSE 115. Introduction to Computer Science I

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Today s Presentation

Help: Importing Contacts User Guide

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

Databases in Python. MySQL, SQLite. Accessing persistent storage (Relational databases) from Python code

Introduction to Database Systems CSE 414

Querying Data with Transact SQL

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

HOST A GET CODING! CLUB TAKEOVER

Python and Databases

Connecting Spotfire to Data Sources with Information Designer

CSC 411 Lecture 4: Ensembles I

Querying Data with Transact-SQL (761)

IMPORTING DATA IN PYTHON I. Welcome to the course!

A day in the life of a functional data scientist. Richard Minerich, Director of R&D at Bayard

CS108 Lecture 19: The Python DBAPI

CS1 Lecture 5 Jan. 25, 2019

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

IREASONING INC. UltraSwing User Guide

Lab Assignment 3 on XML

Lotus IT Hub. Module-1: Python Foundation (Mandatory)

Database Design. A Bottom-Up Approach

CS317 File and Database Systems

INTRODUCTION TO DATA SCIENCE

A Non-Relational Storage Analysis

Relational Query Languages. Preliminaries. Formal Relational Query Languages. Example Schema, with table contents. Relational Algebra

Data Wrangling with Python and Pandas

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

Queries give database managers its real power. Their most common function is to filter and consolidate data from tables to retrieve it.

INFORMATION TECHNOLOGY NOTES

CS12020 (Computer Graphics, Vision and Games) Worksheet 1

Application development with relational and non-relational databases

Detailed instructions for video analysis using Logger Pro.

Databases and ERP Selection: Oracle vs SQL Server

RavenDB & document stores

How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA

Data Collection, Simple Storage (SQLite) & Cleaning

Fall 2017 Discussion 10: November 15, Introduction. 2 Creating Tables

Transcription:

DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 2

TODAY S MENU 1. D ATA B A S E S 2. D ATA T R A N S F O R M AT I O N S 3. F I LT E R I N G AND I M P U TAT I O N

DATABASES This isn t a course on databases: hopefully you ve already taken one But we ll refresh some basics to be able to access data in databases - sqlite - elementary SQL For the most part, we ll just extract the data we need and manipulate it in, e.g., python and command-line tools

EXAMPLE: KAGGLE: EUROPEAN SOCCER DATABASE

EUROPEAN SOCCER DATABASE Create an account on Kaggle unless you already have one Chief Data Scientist s advice: Do Kaggle competitions. [ ] Preprocessing, missing values, using libraries [ ] You can find the soccer database here The database is a single zip file: database.sqlite.zip Zipped 34 MB, unzipped 313 MB

EUROPEAN SOCCER DATABASE Easy to use from command-line: sqlite3 $ sqlite3 database.sqlite SQLite version 3.8.10.2 2015-05-20 18:17:19 Enter ".help" for usage hints. sqlite> SELECT player_name FROM Player LIMIT 10; Aaron Appindangoye Aaron Cresswell Aaron Doran Aaron Galindo Aaron Hughes Aaron Hunt Aaron Kuhl Aaron Lennon Aaron Lennox Aaron Meijers

EUROPEAN SOCCER DATABASE Same in python: import sqlite3 database = 'database.sqlite' conn = sqlite3.connect(database) c = conn.cursor() query = "SELECT player_name FROM Player;" c.execute(query) rows = c.fetchmany(10) print(rows) conn.close() [('Aaron Appindangoye',), ('Aaron Cresswell',), ('Aaron Doran',), ('Aaron Galindo',), ('Aaron Hughes',), ('Aaron Hunt',), ('Aaron Kuhl',), ('Aaron Lennon',), ('Aaron Lennox',), ('Aaron Meijers',)]

EUROPEAN SOCCER DATABASE Same in python with pandas (note the formatting, incl. header): import sqlite3 import pandas as pd database = 'database.sqlite' conn = sqlite3.connect(database) query = "SELECT player_name FROM Player;" rows = pd.read_sql(query, conn) print(rows[0:10]) conn.close() player_name 0 Aaron Appindangoye 1 Aaron Cresswell 2 Aaron Doran 3 Aaron Galindo 4 Aaron Hughes 5 Aaron Hunt 6 Aaron Kuhl 7 Aaron Lennon 8 Aaron Lennox 9 Aaron Meijers

EUROPEAN SOCCER DATABASE Simple SQL tricks: sqlite> SELECT player_name, height FROM Player...> ORDER BY height...> LIMIT 10; Juan Quero 157.48 Diego Buonanotte 160.02 Maxi Moralez 160.02 Anthony Deroin 162.56 Bakari Kone 162.56 Edgar Salli 162.56 Fouad Rachid 162.56 Frederic Sammaritano 162.56 Lorenzo Insigne 162.56 Pablo Piatti 162.56

EUROPEAN SOCCER DATABASE TABLE Player: id, player_api_id, player_name, player_fifa_api_id, birthday, height, weight TABLE Player_Attributes: id, player_fifa_api_id, player_api_id, date, overall_rating, potential, preferred_foot, attacking_work_rate, defensive_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes

EUROPEAN SOCCER DATABASE Joining tables: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id FROM Player) a...> INNER JOIN Player_attributes b...> ON a.p_id = b.player_api_id...> LIMIT 10; Aaron Appindangoye 182.88 187 505942 1 218353 505942 2016-02-18 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8 Aaron Appindangoye 182.88 187 505942 2 218353 505942 2015-11-19 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8...

EUROPEAN SOCCER DATABASE More SQL tricks: CREATE TABLE, GROUP BY, aggregate functions (MAX) sqlite> CREATE TABLE player_max_date...> AS SELECT player_api_id AS p_id,...> MAX(date) AS date...> FROM player_attributes...> GROUP BY p_id; sqlite> SELECT * FROM player_max_date LIMIT 3; 2625 2015-01-16 00:00:00 2752 2015-10-16 00:00:00 2768 2016-03-17 00:00:00

EUROPEAN SOCCER DATABASE Three-way join: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id...> FROM player) a...> INNER JOIN...> player_attributes b...> ON a.p_id = b.player_api_id...> INNER JOIN player_max_date c...> ON b.player_api_id = c.p_id AND...> b.date = c.date; Aaron Appindangoye 182.88 187 505942 1 218353 505942 2016-02-18 00:00:00 67 71 right medium medium 49 44 71 61 44 51 45 39 64 49 60 64 59 47 65 55 58 54 76 35 71 70 45 54 48 65 69 69 6 11 10 8 8 505942 2016-02-18 00:00:00 Aaron Cresswell 170.18 146 155782 6 189615 155782 2016-04-21 00:00:00 74 76 left high medium 80 53 58 71 40 73 70 69 68 71 79 78 78 67 90 71 85 79 56 62 68 67 60 66 59 76 75 78 14 7 9 9 12 155782 2016-04-21 00:00:00...

2. DATA TRANS- FORMATIONS

T R A N S F O R M AT I O N S sqlite> sqlite> sqlite> sqlite>...>...>...>...>...>...>...>...> sqlite>.mode csv.headers on.output player_stats.csv SELECT * FROM (SELECT player_name, height, weight, player_api_id AS p_id FROM player) a INNER JOIN player_attributes b ON a.p_id = b.player_api_id INNER JOIN player_max_date c ON b.player_api_id = c.p_id AND b.date = c.date;.output stdout

T R A N S F O R M AT I O N S

T R A N S F O R M AT I O N S csv => json is easy with python and pandas! import pandas as pd import json data = pd.read_csv("player_stats.csv") print(data.to_json(orient='records', lines=true)) {"player_name":"aaron Appindangoye","height": 182.88,"weight":187,"p_id":505942,"id": 1,"player_fifa_api_id":218353,"player_api_id": 505942,"date":"2016-02-18 00:00:00","overall_rating":67.0,"potential": 71.0,"preferred_foot":"right","attacking_work_rate" :"medium","defensive_work_rate":"medium","crossing" :49.0,"finishing":44.0,"heading_accuracy": 71.0,"short_passing":61.0,"volleys": 44.0,"dribbling":51.0,"curve": 45.0,"free_kick_accuracy":39.0,"long_passing": 64.0,"ball_control":49.0,"acceleration": 60.0,"sprint_speed":64.0,"agility": 59.0,"reactions":47.0,"balance":65.0,"shot_power":

OTHER TRANSFORMATIONS HTML => e.g. CSV "Scraping!" (dirty business)

TRANSFORMATIONS Content transformations: string to numeric, " 12.002" > 12.002 (float) dates (mind the formats, 9/5/2017 vs 5.9.2017) NA/ /0/99/etc can mean missing entries splitting: name = "Teemu Roos" => first = "Teemu", last = "Roos"... Especially for text, it may be important to: downcase: SuperMan > superman remove punctuation stem: 'swimming' > 'swim'

3. F I LT E R I N G AND I M P U TAT I O N

FILTERING Subsetting: columns and/or rows Many of these are conveniently done using command-line tools such as grep, cut, awk, sed For big data, it is important to avoid reading all the data in memory before starting: the above tools only store and process the data little by little, so memory consumption is constant

IMPUTATION Missing values can be a show-stopper for many analysis methods A simple way is to filter out all records with missing entries This may, however, lose a lot of important data Another option is to impute, i.e., enter "fake" data in the place of the missing entries: average for numeric columns mode (most typical value) for categorical columns also possible to use machine learning to predict the missing entries based on the others