INTRODUCTION TO DATA SCIENCE

Size: px
Start display at page:

Download "INTRODUCTION TO DATA SCIENCE"

Transcription

1 DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 2

2 TODAY S MENU 1. D ATA B A S E S 2. D ATA T R A N S F O R M AT I O N S 3. F I LT E R I N G AND I M P U TAT I O N

3 DATABASES This isn t a course on databases: hopefully you ve already taken one But we ll refresh some basics to be able to access data in databases - sqlite - elementary SQL For the most part, we ll just extract the data we need and manipulate it in, e.g., python and command-line tools

4 EXAMPLE: KAGGLE: EUROPEAN SOCCER DATABASE

5 EUROPEAN SOCCER DATABASE Create an account on Kaggle unless you already have one Chief Data Scientist s advice: Do Kaggle competitions. [ ] Preprocessing, missing values, using libraries [ ] You can find the soccer database here The database is a single zip file: database.sqlite.zip Zipped 34 MB, unzipped 313 MB

6 EUROPEAN SOCCER DATABASE Easy to use from command-line: sqlite3 $ sqlite3 database.sqlite SQLite version :17:19 Enter ".help" for usage hints. sqlite> SELECT player_name FROM Player LIMIT 10; Aaron Appindangoye Aaron Cresswell Aaron Doran Aaron Galindo Aaron Hughes Aaron Hunt Aaron Kuhl Aaron Lennon Aaron Lennox Aaron Meijers

7 EUROPEAN SOCCER DATABASE Same in python: import sqlite3 database = 'database.sqlite' conn = sqlite3.connect(database) c = conn.cursor() query = "SELECT player_name FROM Player;" c.execute(query) rows = c.fetchmany(10) print(rows) conn.close() [('Aaron Appindangoye',), ('Aaron Cresswell',), ('Aaron Doran',), ('Aaron Galindo',), ('Aaron Hughes',), ('Aaron Hunt',), ('Aaron Kuhl',), ('Aaron Lennon',), ('Aaron Lennox',), ('Aaron Meijers',)]

8 EUROPEAN SOCCER DATABASE Same in python with pandas (note the formatting, incl. header): import sqlite3 import pandas as pd database = 'database.sqlite' conn = sqlite3.connect(database) query = "SELECT player_name FROM Player;" rows = pd.read_sql(query, conn) print(rows[0:10]) conn.close() player_name 0 Aaron Appindangoye 1 Aaron Cresswell 2 Aaron Doran 3 Aaron Galindo 4 Aaron Hughes 5 Aaron Hunt 6 Aaron Kuhl 7 Aaron Lennon 8 Aaron Lennox 9 Aaron Meijers

9 EUROPEAN SOCCER DATABASE Simple SQL tricks: sqlite> SELECT player_name, height FROM Player...> ORDER BY height...> LIMIT 10; Juan Quero Diego Buonanotte Maxi Moralez Anthony Deroin Bakari Kone Edgar Salli Fouad Rachid Frederic Sammaritano Lorenzo Insigne Pablo Piatti

10 EUROPEAN SOCCER DATABASE TABLE Player: id, player_api_id, player_name, player_fifa_api_id, birthday, height, weight TABLE Player_Attributes: id, player_fifa_api_id, player_api_id, date, overall_rating, potential, preferred_foot, attacking_work_rate, defensive_work_rate, crossing, finishing, heading_accuracy, short_passing, volleys, dribbling, curve, free_kick_accuracy, long_passing, ball_control, acceleration, sprint_speed, agility, reactions, balance, shot_power, jumping, stamina, strength, long_shots, aggression, interceptions, positioning, vision, penalties, marking, standing_tackle, sliding_tackle, gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes

11 EUROPEAN SOCCER DATABASE Joining tables: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id FROM Player) a...> INNER JOIN Player_attributes b...> ON a.p_id = b.player_api_id...> LIMIT 10; Aaron Appindangoye :00: right medium medium Aaron Appindangoye :00: right medium medium

12 EUROPEAN SOCCER DATABASE More SQL tricks: CREATE TABLE, GROUP BY, aggregate functions (MAX) sqlite> CREATE TABLE player_max_date...> AS SELECT player_api_id AS p_id,...> MAX(date) AS date...> FROM player_attributes...> GROUP BY p_id; sqlite> SELECT * FROM player_max_date LIMIT 3; :00: :00: :00:00

13 EUROPEAN SOCCER DATABASE Three-way join: sqlite> SELECT * FROM...> (SELECT player_name, height, weight,...> player_api_id AS p_id...> FROM player) a...> INNER JOIN...> player_attributes b...> ON a.p_id = b.player_api_id...> INNER JOIN player_max_date c...> ON b.player_api_id = c.p_id AND...> b.date = c.date; Aaron Appindangoye :00: right medium medium :00:00 Aaron Cresswell :00: left high medium :00:00...

14 2. DATA TRANS- FORMATIONS

15 T R A N S F O R M AT I O N S sqlite> sqlite> sqlite> sqlite>...>...>...>...>...>...>...>...> sqlite>.mode csv.headers on.output player_stats.csv SELECT * FROM (SELECT player_name, height, weight, player_api_id AS p_id FROM player) a INNER JOIN player_attributes b ON a.p_id = b.player_api_id INNER JOIN player_max_date c ON b.player_api_id = c.p_id AND b.date = c.date;.output stdout

16 T R A N S F O R M AT I O N S

17 T R A N S F O R M AT I O N S csv => json is easy with python and pandas! import pandas as pd import json data = pd.read_csv("player_stats.csv") print(data.to_json(orient='records', lines=true)) {"player_name":"aaron Appindangoye","height": ,"weight":187,"p_id":505942,"id": 1,"player_fifa_api_id":218353,"player_api_id": ,"date":" :00:00","overall_rating":67.0,"potential": 71.0,"preferred_foot":"right","attacking_work_rate" :"medium","defensive_work_rate":"medium","crossing" :49.0,"finishing":44.0,"heading_accuracy": 71.0,"short_passing":61.0,"volleys": 44.0,"dribbling":51.0,"curve": 45.0,"free_kick_accuracy":39.0,"long_passing": 64.0,"ball_control":49.0,"acceleration": 60.0,"sprint_speed":64.0,"agility": 59.0,"reactions":47.0,"balance":65.0,"shot_power":

18 OTHER TRANSFORMATIONS HTML => e.g. CSV "Scraping!" (dirty business)

19 TRANSFORMATIONS Content transformations: string to numeric, " " > (float) dates (mind the formats, 9/5/2017 vs ) NA/ /0/99/etc can mean missing entries splitting: name = "Teemu Roos" => first = "Teemu", last = "Roos"... Especially for text, it may be important to: downcase: SuperMan > superman remove punctuation stem: 'swimming' > 'swim'

20 3. F I LT E R I N G AND I M P U TAT I O N

21 FILTERING Subsetting: columns and/or rows Many of these are conveniently done using command-line tools such as grep, cut, awk, sed For big data, it is important to avoid reading all the data in memory before starting: the above tools only store and process the data little by little, so memory consumption is constant

22 IMPUTATION Missing values can be a show-stopper for many analysis methods A simple way is to filter out all records with missing entries This may, however, lose a lot of important data Another option is to impute, i.e., enter "fake" data in the place of the missing entries: average for numeric columns mode (most typical value) for categorical columns also possible to use machine learning to predict the missing entries based on the others

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 2: MINIPROJECT, ARRAY DATA, STORAGE FORMATS & TRANSFORMATIONS TODAY S MENU 1. M I N I P R O J E C T S 2. A R R AY D ATA 3. D ATA T R A N S F O R M AT I O

More information

Traffic violations revisited

Traffic violations revisited Traffic violations revisited November 9, 2017 In this lab, you will once again extract data about traffic violations from a CSV file, but this time you will use SQLite. First, download the following files

More information

Converting categorical data into numbers with Pandas and Scikit-learn -...

Converting categorical data into numbers with Pandas and Scikit-learn -... 1 of 6 11/17/2016 11:02 AM FastML Machine learning made easy RSS Home Contents Popular Links Backgrounds About Converting categorical data into numbers with Pandas and Scikit-learn 2014-04-30 Many machine

More information

Databases. Course October 23, 2018 Carsten Witt

Databases. Course October 23, 2018 Carsten Witt Databases Course 02807 October 23, 2018 Carsten Witt Databases Database = an organized collection of data, stored and accessed electronically (Wikipedia) Different principles for organization of data:

More information

15-388/688 - Practical Data Science: Relational Data. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Relational Data. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University Spring 2018 1 Announcements Piazza etiquette: Changing organization of threads to be easier to search (starting

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #7 2/16/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Anant s office hours have changed: Old: 2PM-3PM on Tuesdays New: 11AM-12PM on

More information

IMPORTING DATA IN PYTHON I. Introduction to relational databases

IMPORTING DATA IN PYTHON I. Introduction to relational databases IMPORTING DATA IN PYTHON I Introduction to relational databases What is a relational database? Based on relational model of data First described by Edgar Ted Codd Example: Northwind database Orders table

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

NCSS: Databases and SQL

NCSS: Databases and SQL NCSS: Databases and SQL Tim Dawborn Lecture 2, January, 2017 Python/sqlite3 DB Design API JOINs 2 Outline 1 Connecting to an SQLite database using Python 2 What is a good database design? 3 A nice API

More information

DATA STRUCTURE AND ALGORITHM USING PYTHON

DATA STRUCTURE AND ALGORITHM USING PYTHON DATA STRUCTURE AND ALGORITHM USING PYTHON Common Use Python Module II Peter Lo Pandas Data Structures and Data Analysis tools 2 What is Pandas? Pandas is an open-source Python library providing highperformance,

More information

Now go to bash and type the command ls to list files. The unix command unzip <filename> unzips a file.

Now go to bash and type the command ls to list files. The unix command unzip <filename> unzips a file. wrangling data unix terminal and filesystem Grab data-examples.zip from top of lecture 4 notes and upload to main directory on c9.io. (No need to unzip yet.) Now go to bash and type the command ls to list

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

CIS 192: Lecture 11 Databases (SQLite3)

CIS 192: Lecture 11 Databases (SQLite3) CIS 192: Lecture 11 Databases (SQLite3) Lili Dworkin University of Pennsylvania In-Class Quiz app = Flask( main ) @app.route('/') def home():... app.run() 1. type(app.run) 2. type(app.route( / )) Hint:

More information

STAT 408. Data Scraping and SQL STAT 408. Data Scraping SQL. March 8, 2018

STAT 408. Data Scraping and SQL STAT 408. Data Scraping SQL. March 8, 2018 and and March 8, 2018 and and scraping is defined as using a computer to extract information, typically from human readable websites. We could spend multiple weeks on this, so this will be a basic introduction

More information

SQLite vs. MongoDB for Big Data

SQLite vs. MongoDB for Big Data SQLite vs. MongoDB for Big Data In my latest tutorial I walked readers through a Python script designed to download tweets by a set of Twitter users and insert them into an SQLite database. In this post

More information

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

STATS Data Analysis using Python. Lecture 15: Advanced Command Line STATS 700-002 Data Analysis using Python Lecture 15: Advanced Command Line Why UNIX/Linux? As a data scientist, you will spend most of your time dealing with data Data sets never arrive ready to analyze

More information

Command-Line Data Analysis INX_S17, Day 15,

Command-Line Data Analysis INX_S17, Day 15, Command-Line Data Analysis INX_S17, Day 15, 2017-05-12 General tool efficiency, tr, newlines, join, column Learning Outcome(s): Discuss the theory behind Unix/Linux tool efficiency, e.g., the reasons behind

More information

CSE 115. Introduction to Computer Science I

CSE 115. Introduction to Computer Science I CSE 115 Introduction to Computer Science I Road map Review (sorting) Persisting data Databases Sorting Given a sequence of values that can be ordered, sorting involves rearranging these values so they

More information

SQL I: Introduction. Relational Databases. Attribute. Tuple. Relation

SQL I: Introduction. Relational Databases. Attribute. Tuple. Relation 1 SQL I: Introduction Lab Objective: Being able to store and manipulate large data sets quickly is a fundamental part of data science. The SQL language is the classic database management system for working

More information

CS 2316 Exam 3 ANSWER KEY

CS 2316 Exam 3 ANSWER KEY CS 2316 Exam 3 Practice ANSWER KEY Failure to properly fill in the information on this page will result in a deduction of up to 5 points from your exam score. Signing signifies you are aware of and in

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Exceptions & a Taste of Declarative Programming in SQL

Exceptions & a Taste of Declarative Programming in SQL Exceptions & a Taste of Declarative Programming in SQL David E. Culler CS8 Computational Structures in Data Science http://inst.eecs.berkeley.edu/~cs88 Lecture 12 April 18, 2016 Computational Concepts

More information

Lecture #12: Quick: Exceptions and SQL

Lecture #12: Quick: Exceptions and SQL UC Berkeley EECS Adj. Assistant Prof. Dr. Gerald Friedland Computational Structures in Data Science Lecture #12: Quick: Exceptions and SQL Administrivia Open Project: Starts Monday! Creative data task

More information

SOFTWARE DEVELOPMENT: DATA SCIENCE

SOFTWARE DEVELOPMENT: DATA SCIENCE PROFESSIONAL CAREER TRAINING INSTITUTE SOFTWARE DEVELOPMENT: DATA SCIENCE www.pcti.edu/data-science applicant@pcti.edu 832-484-9100 PROGRAM OVERVIEW Prepare for a life changing career as a data scientist

More information

Pandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments

Pandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments Pandas UDF Scalable Analysis with Python and PySpark Li Jin, Two Sigma Investments About Me Li Jin (icexelloss) Software Engineer @ Two Sigma Investments Analytics Tools Smith Apache Arrow Committer Other

More information

CS108 Lecture 18: Databases and SQL

CS108 Lecture 18: Databases and SQL CS108 Lecture 18: Databases and SQL Databases for data storage and access The Structured Query Language Aaron Stevens 4 March 2013 What You ll Learn Today How does Facebook generate unique pages for each

More information

LECTURE 21. Database Interfaces

LECTURE 21. Database Interfaces LECTURE 21 Database Interfaces DATABASES Commonly, Python applications will need to access a database of some sort. As you can imagine, not only is this easy to do in Python but there is a ton of support

More information

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service Demo Introduction Keywords: Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service Goal of Demo: Oracle Big Data Preparation Cloud Services can ingest data from various

More information

Prometheus. A Next Generation Monitoring System. Brian Brazil Founder

Prometheus. A Next Generation Monitoring System. Brian Brazil Founder Prometheus A Next Generation Monitoring System Brian Brazil Founder Who am I? Engineer passionate about running software reliably in production. Based in Ireland Core-Prometheus developer Contributor to

More information

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author...

An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. About This Book... ix About The Author... An Introduction to Preparing Data for Analysis with JMP. Full book available for purchase here. Contents About This Book... ix About The Author... xiii Chapter 1: Data Management in the Analytics Process...

More information

Pandas. Data Manipulation in Python

Pandas. Data Manipulation in Python Pandas Data Manipulation in Python 1 / 27 Pandas Built on NumPy Adds data structures and data manipulation tools Enables easier data cleaning and analysis import pandas as pd 2 / 27 Pandas Fundamentals

More information

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018 NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE DACE https://dace.unige.ch Data and Analysis Center for Exoplanets. Facility to store, exchange and analyse data

More information

NCSS: Databases and SQL

NCSS: Databases and SQL NCSS: Databases and SQL Tim Dawborn Lecture 1, January, 2016 Motivation SQLite SELECT WHERE JOIN Tips 2 Outline 1 Motivation 2 SQLite 3 Searching for Data 4 Filtering Results 5 Joining multiple tables

More information

Designing dashboards for performance. Reference deck

Designing dashboards for performance. Reference deck Designing dashboards for performance Reference deck Basic principles 1. Everything in moderation 2. If it isn t fast in database, it won t be fast in Tableau 3. If it isn t fast in desktop, it won t be

More information

Optimizer Challenges in a Multi-Tenant World

Optimizer Challenges in a Multi-Tenant World Optimizer Challenges in a Multi-Tenant World Pat Selinger pselinger@salesforce.come Classic Query Optimizer Concepts & Assumptions Relational Model Cost = X * CPU + Y * I/O Cardinality Selectivity Clustering

More information

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm CS 170 Algorithms Fall 2014 David Wagner HW12 Due Dec. 5, 6:00pm Instructions. This homework is due Friday, December 5, at 6:00pm electronically via glookup. This homework assignment is a programming assignment

More information

Microsoft Excel & The Internet. J. Carlton Collins ASA Research

Microsoft Excel & The Internet. J. Carlton Collins ASA Research Microsoft Excel & The Internet J. Carlton Collins ASA Research Carlton@ASAResearch.com 770.734.0950 Excel and the Internet There are at least 9 good ways in which Excel and the Internet can work together,

More information

Extract API: Build sophisticated data models with the Extract API

Extract API: Build sophisticated data models with the Extract API Welcome # T C 1 8 Extract API: Build sophisticated data models with the Extract API Justin Craycraft Senior Sales Consultant Tableau / Customer Consulting My Office Photo Used with permission Agenda 1)

More information

Why I Use Python for Academic Research

Why I Use Python for Academic Research Why I Use Python for Academic Research Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As

More information

Big Data, Right Tools: Computational Resources for Empirical Research 2014

Big Data, Right Tools: Computational Resources for Empirical Research 2014 Big Data, Right Tools: Computational Resources for Empirical Research 2014 Dokyun Lee, PhD Candidate, OPIM Dept. July 30, 2014 The aim of this course is to familiarize beginning Wharton PhD studentswithbothpubliclyavailable

More information

42 Building a Report with a Text Pluggable Data Source

42 Building a Report with a Text Pluggable Data Source 42 Building a Report with a Text Pluggable Data Source Figure 42 1 Report output using a text PDS Reports Builder enables you to use any data source you wish. In this chapter, you will learn how to use

More information

10 things I wish I knew. about Machine Learning Competitions

10 things I wish I knew. about Machine Learning Competitions 10 things I wish I knew about Machine Learning Competitions Introduction Theoretical competition run-down The list of things I wish I knew Code samples for a running competition Kaggle the platform Reasons

More information

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting CS 725/825 Information Visualization Fall 2013 Data Foundations Dr. Michele C. Weigle http://www.cs.odu.edu/~mweigle/cs725-f13/ Topic Objectives! Distinguish between ordinal and nominal values and list

More information

Investigating Source Code Reusability for Android and Blackberry Applications

Investigating Source Code Reusability for Android and Blackberry Applications Investigating Source Code Reusability for Android and Blackberry Applications Group G8 Jenelle Chen Aaron Jin 1 Outline Recaps Challenges with mobile development Problem definition Approach Demo Detailed

More information

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending Presented by: Trishla Maru Agenda Overview MultiSource Data Federation Use Cases Design Considerations Data

More information

Things You Will Most Likely Want to Do in TeamSnap

Things You Will Most Likely Want to Do in TeamSnap How to Use TeamSnap for Parents This is a How To Guide for parents of children playing in Beaumont Soccer Association who want to learn how to utilize TeamSnap effectively. TeamSnap helps Managers: Organize

More information

Pandas. Data Manipulation in Python

Pandas. Data Manipulation in Python Pandas Data Manipulation in Python 1 / 26 Pandas Built on NumPy Adds data structures and data manipulation tools Enables easier data cleaning and analysis import pandas as pd 2 / 26 Pandas Fundamentals

More information

Using PostgreSQL, Prometheus & Grafana for Storing, Analyzing and Visualizing Metrics

Using PostgreSQL, Prometheus & Grafana for Storing, Analyzing and Visualizing Metrics Using PostgreSQL, Prometheus & Grafana for Storing, Analyzing and Visualizing Metrics Erik Nordström, PhD Core Database Engineer hello@timescale.com github.com/timescale Why PostgreSQL? Reliable and familiar

More information

Data and Text Mining

Data and Text Mining Data representation and manipulation I prof. dr. Bojan Cestnik Temida d.o.o. & Jozef Stefan Institute Ljubljana bojan.cestnik@temida.si prof. dr. Bojan Cestnik 1 Contents Introduction Basic Data Mining

More information

Scalable Web Software. CS193S - Jan Jannink - 1/07/10

Scalable Web Software. CS193S - Jan Jannink - 1/07/10 Scalable Web Software CS193S - Jan Jannink - 1/07/10 Administrative Stuff Computer Forum Career Fair: Wed. 13, 11-4 Lawn between Hewlett Teaching Center and Gilbert Building Looking forward to your emails!

More information

A detailed comparison of EasyMorph vs Tableau Prep

A detailed comparison of EasyMorph vs Tableau Prep A detailed comparison of vs We at keep getting asked by our customers and partners: How is positioned versus?. Well, you asked, we answer! Short answer and are similar, but there are two important differences.

More information

Chapter The Juice: A Podcast Aggregator

Chapter The Juice: A Podcast Aggregator Chapter 12 The Juice: A Podcast Aggregator For those who may not be familiar, podcasts are audio programs, generally provided in a format that is convenient for handheld media players. The name is a play

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Python & Spark PTT18/19

Python & Spark PTT18/19 Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes Härtel Msc. Marcel Heinz The Big Picture [Aggarwal15] Plenty of Building Blocks are involved in this Big Picture Back to the Big Picture [Aggarwal15]

More information

Financial Statements Using Crystal Reports

Financial Statements Using Crystal Reports Sessions 6-7 & 6-8 Friday, October 13, 2017 8:30 am 1:00 pm Room 616B Sessions 6-7 & 6-8 Financial Statements Using Crystal Reports Presented By: David Hardy Progressive Reports Original Author(s): David

More information

Training. Data Modelling. Framework Manager Projects (2 days) Contents

Training. Data Modelling. Framework Manager Projects (2 days) Contents We aim to provide you with the right training, at the right time and at the right price'. A cost effective solution to your business objectives. Our trainers are experts in IBM Cognos applications and

More information

Microsoft Access Illustrated. Unit B: Building and Using Queries

Microsoft Access Illustrated. Unit B: Building and Using Queries Microsoft Access 2010- Illustrated Unit B: Building and Using Queries Objectives Use the Query Wizard Work with data in a query Use Query Design View Sort and find data (continued) Microsoft Office 2010-Illustrated

More information

Dealing with Data Especially Big Data

Dealing with Data Especially Big Data Dealing with Data Especially Big Data INFO-GB-2346.01 Fall 2017 Professor Norman White nwhite@stern.nyu.edu normwhite@twitter Teaching Assistant: Frenil Sanghavi fps241@stern.nyu.edu Administrative Assistant:

More information

Please pick up your name card

Please pick up your name card L06: SQL 233 Announcements! Please pick up your name card - always come with your name card - If nobody answers my question, I will likely pick on those without a namecard or in the last row Polls on speed:

More information

Part 1: Collecting and visualizing The Movie DB (TMDb) data

Part 1: Collecting and visualizing The Movie DB (TMDb) data CSE6242 / CX4242: Data and Visual Analytics Georgia Tech Fall 2015 Homework 1: Analyzing The Movie DB dataset; SQLite; D3 Warmup; OpenRefine Due: Friday, 11 September, 2015, 11:55PM EST Prepared by Meera

More information

CITS4009 Introduction to Data Science

CITS4009 Introduction to Data Science School of Computer Science and Software Engineering CITS4009 Introduction to Data Science SEMESTER 2, 2017: CHAPTER 4 MANAGING DATA 1 Chapter Objectives Fixing data quality problems Organizing your data

More information

CSE 115. Introduction to Computer Science I

CSE 115. Introduction to Computer Science I CSE 115 Introduction to Computer Science I Road map Review HTML injection SQL injection Persisting data Central Processing Unit CPU Random Access Memory RAM persistent storage (e.g. file or database) Persisting

More information

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center Jaql Running Pipes in the Clouds Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ 2009 IBM Corporation Motivating Scenarios

More information

Today s Presentation

Today s Presentation Banish the I/O: Together, SSD and Main Memory Storage Accelerate Database Performance Today s Presentation Conventional Database Performance Optimization Goal: Minimize I/O Legacy Approach: Cache 21 st

More information

Help: Importing Contacts User Guide

Help: Importing Contacts User Guide Help: Importing Contacts User Guide Contents 1. PURPOSE OF THIS GUIDE:... 2 2. OVERVIEW OF THE IMPORT PROCESS:... 2 3. PREPARING YOUR IMPORT CONTACTS FILE... 3 4. STARTING THE IMPORT CONTACTS WIZARD...

More information

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara Best Practices for Choosing Content Reporting Tools and Datasources Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara Agenda Discuss best practices for choosing content with Pentaho Business

More information

Databases in Python. MySQL, SQLite. Accessing persistent storage (Relational databases) from Python code

Databases in Python. MySQL, SQLite. Accessing persistent storage (Relational databases) from Python code Databases in Python MySQL, SQLite Accessing persistent storage (Relational databases) from Python code Goal Making some data 'persistent' When application restarts When computer restarts Manage big amounts

More information

Introduction to Database Systems CSE 414

Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lectures 4 and 5: Aggregates in SQL CSE 414 - Spring 2013 1 Announcements Homework 1 is due on Wednesday Quiz 2 will be out today and due on Friday CSE 414 - Spring

More information

Querying Data with Transact SQL

Querying Data with Transact SQL Course 20761A: Querying Data with Transact SQL Course details Course Outline Module 1: Introduction to Microsoft SQL Server 2016 This module introduces SQL Server, the versions of SQL Server, including

More information

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta Six Core Data Wrangling Activities An introductory guide to data wrangling with Trifacta Today s Data Driven Culture Are you inundated with data? Today, most organizations are collecting as much data in

More information

HOST A GET CODING! CLUB TAKEOVER

HOST A GET CODING! CLUB TAKEOVER HOST A GET CODING! CLUB TAKEOVER www.getcodingkids.com #GetCoding @WalkerBooksUK GETTING STARTED THE LUCKY CAT CLUB We re The Lucky Cat Club! Welcome to our club takeover. Join us for a top-secret mission

More information

Python and Databases

Python and Databases Python and Databases Wednesday 25 th March CAS North East Conference, Newcastle Sue Sentance King s College London/CAS/Python School @suesentance sue.sentance@computingatschool.org.uk This handout includes

More information

Connecting Spotfire to Data Sources with Information Designer

Connecting Spotfire to Data Sources with Information Designer Connecting Spotfire to Data Sources with Information Designer Margot Goodwin, Senior Manager, Application Consulting September 15, 2016 HUMAN HEALTH ENVIRONMENTAL HEALTH 2014 PerkinElmer Spotfire Information

More information

CSC 411 Lecture 4: Ensembles I

CSC 411 Lecture 4: Ensembles I CSC 411 Lecture 4: Ensembles I Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 04-Ensembles I 1 / 22 Overview We ve seen two particular classification algorithms:

More information

Querying Data with Transact-SQL (761)

Querying Data with Transact-SQL (761) Querying Data with Transact-SQL (761) Manage data with Transact-SQL Create Transact-SQL SELECT queries Identify proper SELECT query structure, write specific queries to satisfy business requirements, construct

More information

IMPORTING DATA IN PYTHON I. Welcome to the course!

IMPORTING DATA IN PYTHON I. Welcome to the course! IMPORTING DATA IN PYTHON I Welcome to the course! Import data Flat files, e.g..txts,.csvs Files from other software Relational databases Plain text files Source: Project Gutenberg Table data titanic.csv

More information

A day in the life of a functional data scientist. Richard Minerich, Director of R&D at Bayard

A day in the life of a functional data scientist. Richard Minerich, Director of R&D at Bayard A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus Projecting onto a 2D Plane The Pairwise Entity Resolution Process Blocking Scoring Review

More information

CS108 Lecture 19: The Python DBAPI

CS108 Lecture 19: The Python DBAPI CS108 Lecture 19: The Python DBAPI Sqlite3 database Running SQL and reading results in Python Aaron Stevens 6 March 2013 What You ll Learn Today Review: SQL Review: the Python tuple sequence. How does

More information

CS1 Lecture 5 Jan. 25, 2019

CS1 Lecture 5 Jan. 25, 2019 CS1 Lecture 5 Jan. 25, 2019 HW1 due Monday, 9:00am. Notes: Do not write all the code at once before starting to test. Take tiny steps. Write a few lines test... add a line or two test... add another line

More information

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL Databases Topics History - RDBMS - SQL Architecture - SQL - NoSQL MongoDB, Mongoose Persistent Data Storage What features do we want in a persistent data storage system? We have been using text files to

More information

IREASONING INC. UltraSwing User Guide

IREASONING INC. UltraSwing User Guide ULTRASWING LIBRARY IREASONING INC. UltraSwing User Guide ULTRASWING LIBRARY User Guide Copyright 2003 ireasoning Inc., All Rights Reserved. The information contained herein is the property of ireasoning

More information

Lab Assignment 3 on XML

Lab Assignment 3 on XML CIS612 Dr. Sunnie S. Chung Lab Assignment 3 on XML Semi-structure Data Processing: Transforming XML data to CSV format For Lab3, You can write in your choice of any languages in any platform. The Semi-Structured

More information

Lotus IT Hub. Module-1: Python Foundation (Mandatory)

Lotus IT Hub. Module-1: Python Foundation (Mandatory) Module-1: Python Foundation (Mandatory) What is Python and history of Python? Why Python and where to use it? Discussion about Python 2 and Python 3 Set up Python environment for development Demonstration

More information

Database Design. A Bottom-Up Approach

Database Design. A Bottom-Up Approach Database Design A Bottom-Up Approach Reality Check Why do you need a database? What is the primary use of your database? Fast data entry Fast queries Summary data Who is responsible for the content? Who

More information

CS317 File and Database Systems

CS317 File and Database Systems CS317 File and Database Systems Lecture 3 Relational Model & Languages Part-1 September 7, 2018 Sam Siewert More Embedded Systems Summer - Analog, Digital, Firmware, Software Reasons to Consider Catch

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 1: WHAT IS DATA SCIENCE?, DATA TODAY S MENU 1. COURSE LOGISTICS 2. WHAT IS DATA SCIENCE? 3. DATA WHO WE ARE Lecturer: Teemu Roos, Associate professor, PhD

More information

A Non-Relational Storage Analysis

A Non-Relational Storage Analysis A Non-Relational Storage Analysis Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Cloud Computing - 2nd semester 2012/2013 Universitat Politècnica de Catalunya Microblogging - big data?

More information

Relational Query Languages. Preliminaries. Formal Relational Query Languages. Example Schema, with table contents. Relational Algebra

Relational Query Languages. Preliminaries. Formal Relational Query Languages. Example Schema, with table contents. Relational Algebra Note: Slides are posted on the class website, protected by a password written on the board Reading: see class home page www.cs.umb.edu/cs630. Relational Algebra CS430/630 Lecture 2 Relational Query Languages

More information

Data Wrangling with Python and Pandas

Data Wrangling with Python and Pandas Data Wrangling with Python and Pandas January 25, 2015 1 Introduction to Pandas: the Python Data Analysis library This is a short introduction to pandas, geared mainly for new users and adapted heavily

More information

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 7 October 7 th and 9 th, 2014 15-619 Project Students enrolled in 15-619 Since 12 units, an extra project worth 3-units Project will be released this week Team

More information

Queries give database managers its real power. Their most common function is to filter and consolidate data from tables to retrieve it.

Queries give database managers its real power. Their most common function is to filter and consolidate data from tables to retrieve it. 1 2 Queries give database managers its real power. Their most common function is to filter and consolidate data from tables to retrieve it. The data you want to see is usually spread across several tables

More information

INFORMATION TECHNOLOGY NOTES

INFORMATION TECHNOLOGY NOTES Unit-6 SESSION 7: RESPOND TO A MEETING REQUEST Calendar software allows the user to respond to other users meeting requests. Open the email application to view the request. to respond, select Accept, Tentative,

More information

CS12020 (Computer Graphics, Vision and Games) Worksheet 1

CS12020 (Computer Graphics, Vision and Games) Worksheet 1 CS12020 (Computer Graphics, Vision and Games) Worksheet 1 Jim Finnis (jcf1@aber.ac.uk) 1 Getting to know your shield First, book out your shield. This might take a little time, so be patient. Make sure

More information

Application development with relational and non-relational databases

Application development with relational and non-relational databases Application development with relational and non-relational databases Mario Lassnig European Organization for Nuclear Research (CERN) mario.lassnig@cern.ch About me Software Engineer Data Management for

More information

Detailed instructions for video analysis using Logger Pro.

Detailed instructions for video analysis using Logger Pro. Detailed instructions for video analysis using Logger Pro. 1. Begin by locating or creating a video of a projectile (or any moving object). Save it to your computer. Most video file types are accepted,

More information

Databases and ERP Selection: Oracle vs SQL Server

Databases and ERP Selection: Oracle vs SQL Server DATABASES AND ERP ORACLE VS SQL SELECTION: SERVER Databases and ERP Selection: Oracle vs SQL Server By Rick Veague, Chief Technology Officer, IFS North America An enterprise application like enterprise

More information

RavenDB & document stores

RavenDB & document stores université libre de bruxelles INFO-H415 - Advanced Databases RavenDB & document stores Authors: Yasin Arslan Jacky Trinh Professor: Esteban Zimányi Contents 1 Introduction 3 1.1 Présentation...................................

More information

How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA

How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA Peter Huegel SAP Solutions Specialist Agenda MicroStrategy and SAP Drilldown MicroStrategy and SAP BW Drilldown MicroStrategy and

More information

Data Collection, Simple Storage (SQLite) & Cleaning

Data Collection, Simple Storage (SQLite) & Cleaning Data Collection, Simple Storage (SQLite) & Cleaning Duen Horng (Polo) Chau Georgia Tech CSE 6242 A / CS 4803 DVA Jan 15, 2013 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Fall 2017 Discussion 10: November 15, Introduction. 2 Creating Tables

Fall 2017 Discussion 10: November 15, Introduction. 2 Creating Tables CS 61A SQL Fall 2017 Discussion 10: November 15, 2017 1 Introduction SQL is an example of a declarative programming language Statements do not describe computations directly, but instead describe the desired

More information