Data Science with PostgreSQL

Similar documents
BIG DATA COURSE CONTENT

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

Big Data Applications with Spring XD

KNIME for the life sciences Cambridge Meetup

Specialist ICT Learning

Accessing other data fdw, dblink, pglogical, plproxy,...

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

ACHIEVEMENTS FROM TRAINING

Prototyping Data Intensive Apps: TrendingTopics.org

Python With Data Science

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Big Data with Hadoop Ecosystem

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Webinar Series TMIP VISION

Innovatus Technologies

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Challenges for Data Driven Systems

Big Data Architect.

Shen PingCAP 2017

Data Science Bootcamp Curriculum. NYC Data Science Academy

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

@Pentaho #BigDataWebSeries

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Data Analyst Nanodegree Syllabus

CSE 444: Database Internals. Lecture 23 Spark

Databases and Big Data Today. CS634 Class 22

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Developer Internship Opportunity at I-CC

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

DATABASE DESIGN II - 1DL400

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

The Future of Analytics or The New SQL

COPYRIGHT DATASHEET

Unifying Big Data Workloads in Apache Spark

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

CIB Session 12th NoSQL Databases Structures

Stages of Data Processing

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Oracle Big Data Connectors

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Data Lake Based Systems that Work

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

The TIBCO Insight Platform 1. Data on Fire 2. Data to Action. Michael O Connell Catalina Herrera Peter Shaw September 7, 2016

Who we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Tackling Big Data Using MATLAB

Review - Relational Model Concepts

DATA SCIENCE USING SPARK: AN INTRODUCTION

Databricks, an Introduction

TECHNOLOGY SOLUTION EVOLUTION

Approaching the Petabyte Analytic Database: What I learned

Going Big Data on Apache Spark. KNIME Italy Meetup

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Part I What are Databases?

Python & Spark PTT18/19

Big Data and FrameWorks; Perspectives to Applied Machine Learning

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Data Architectures in Azure for Analytics & Big Data

Cloud Computing & Visualization

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Data Analyst Nanodegree Syllabus

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Hal Varian, Google s Chief Economist The McKinsey Quarterly, Jan 2009

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

Higher level data processing in Apache Spark

An Introduction to Big Data Formats

Data in the Cloud and Analytics in the Lake

Certified Data Science with Python Professional VS-1442

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Facebook data extraction using R & process in Data Lake

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Understanding the latent value in all content

Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Netezza The Analytics Appliance

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

SQL in the Hybrid World

Talend Big Data Sandbox. Big Data Insights Cookbook

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

microsoft

Hadoop Online Training

R Language for the SQL Server DBA

Dealing with Data Especially Big Data

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Hadoop An Overview. - Socrates CCDH

Big Data Infrastructures & Technologies

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

Transcription:

Balázs Bárány Data Scientist pgconf.de 2015

Contents Introduction What is Data Science? Process model Tools and methods of Data Scientists Business & data understanding Preprocessing Modeling Evaluation Deployment Summary

Introduction What is Data Science? Sexiest job of the 21st century According to Google, LinkedIn,...

Introduction What is Data Science? Sexiest job of the 21st century According to Google, LinkedIn,... Who is a Data Scientist?

Introduction What is Data Science? Data Science Venn Diagram (c) Drew Conway, 2010. CC-BY-NC

Introduction What is Data Science? Tasks of data scientists Get data from various sources Big data?

Introduction What is Data Science? Tasks of data scientists Get data from various sources Big data? Mash up & format for analysis

Introduction What is Data Science? Tasks of data scientists Get data from various sources Big data? Mash up & format for analysis Analyze & visualize

Introduction What is Data Science? Tasks of data scientists Get data from various sources Big data? Mash up & format for analysis Analyze & visualize Predict & prescribe

Introduction What is Data Science? Tasks of data scientists Get data from various sources Big data? Mash up & format for analysis Analyze & visualize Predict & prescribe Operationalize

Introduction What is Data Science? Process model The Data Mining process Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons)

Tools and methods of Data Scientists Tools and methods Tools and methods

Tools and methods of Data Scientists Scripting and programming R Python with extensions Octave/Matlab, other mathematic languages Hadoop and Big Data programming libraries (mostly Java) Cloud services

Tools and methods of Data Scientists Integrated GUI tools (partly) Open Source: RapidMiner, KNIME, Orange Data Warehouse tools extended for analytics: Pentaho, Talend Many commercial tools, e. g. SAS, IBM SPSS Hadoop-related newcomers: e. g. Datameer

Tools and methods of Data Scientists Data Infrastructure Databases and data stores Relational, NoSQL Hadoop clusters In-memory Data streams Free-form data: text, images, video, audio,... Web APIs Open Data

Tools and methods of Data Scientists Data acquisition and preprocessing Data ingestion in raw format

Tools and methods of Data Scientists Data acquisition and preprocessing Data ingestion in raw format Joining, aggregating, ltering, calculating,...

Tools and methods of Data Scientists Data acquisition and preprocessing Data ingestion in raw format Joining, aggregating, ltering, calculating,... Data cleansing Missing values Abnormal values

Tools and methods of Data Scientists Data acquisition and preprocessing Data ingestion in raw format Joining, aggregating, ltering, calculating,... Data cleansing Missing values Abnormal values Result: data set suitable for analytics

Tools and methods of Data Scientists Predictive Modeling Supervised and unsupervised methods Target variable known or not

Tools and methods of Data Scientists Predictive Modeling Supervised and unsupervised methods Target variable known or not Classication (supervised): Prediction of a class or category Regression (supervised): Prediction of numeric value

Tools and methods of Data Scientists Predictive Modeling Supervised and unsupervised methods Target variable known or not Classication (supervised): Prediction of a class or category Regression (supervised): Prediction of numeric value Clustering (unsupervised): Automatic grouping of data Association analysis, outlier detection, time series prediction,...

Tools and methods of Data Scientists Deployment and operationalization Model application to new data => prediction + condence What to do with predictions?

Tools and methods of Data Scientists Deployment and operationalization Model application to new data => prediction + condence What to do with predictions? Store in ERP or CRM Tell someone (email, popup) Add a label (e. g. mark email as spam)

Tools and methods of Data Scientists Deployment and operationalization Model application to new data => prediction + condence What to do with predictions? Store in ERP or CRM Tell someone (email, popup) Add a label (e. g. mark email as spam) Interrupt nancial transaction => prescription Order supplies => prescription...

Doing

Caveats This stu is not easy

Caveats This stu is not easy Must be root and postgres Maintain your PostgreSQL yourself Able to compile stu

Caveats This stu is not easy Must be root and postgres Maintain your PostgreSQL yourself Able to compile stu You should ask ;-) your boss co-workers customer

Business & data understanding

Business & data understanding Business understanding What is the purpose of the business? What are existing processes? Drivers of business success

Business & data understanding Business understanding What is the purpose of the business? What are existing processes? Drivers of business success Project goals and challenges Availability of data and resources Success criteria

Business & data understanding Business understanding What is the purpose of the business? What are existing processes? Drivers of business success Project goals and challenges Availability of data and resources Success criteria Not a technical activity, PostgreSQL can't help much

Business & data understanding Data understanding Existing data Entities and covered concepts Complete? Correct? In suitable form? Usable? (regulations, access constraints,...)

Business & data understanding Data understanding Existing data Entities and covered concepts Complete? Correct? In suitable form? Usable? (regulations, access constraints,...) Connecting separate data sources Simple or complex IDs

Business & data understanding Data understanding Existing data Entities and covered concepts Complete? Correct? In suitable form? Usable? (regulations, access constraints,...) Connecting separate data sources Data size Simple or complex IDs Too small Too big

Business & data understanding Data understanding Existing data Entities and covered concepts Complete? Correct? In suitable form? Usable? (regulations, access constraints,...) Connecting separate data sources Data size Simple or complex IDs Too small Too big Suitability for predictive modeling Target variable? Attribute types

Business & data understanding Data understanding with PostgreSQL Get data into PostgreSQL Classical import process Foreign Data Wrappers

Business & data understanding Data understanding with PostgreSQL Get data into PostgreSQL Classical import process Foreign Data Wrappers Analyze data distribution Group by and aggregate Count, Count Distinct, Min, Max Count NULLs Search for missing links (incomplete foreign keys)

Business & data understanding Data understanding with PostgreSQL Get data into PostgreSQL Classical import process Foreign Data Wrappers Analyze data distribution Group by and aggregate Count, Count Distinct, Min, Max Count NULLs Search for missing links (incomplete foreign keys) Analyze surprizes Impossible values Missing values in required elds

Business & data understanding Data understanding with PostgreSQL summary Good SQL knowledge required Tedious manual process repetitive not suitable for large number of attributes No built-in visualization

Business & data understanding Data understanding with PostgreSQL summary Good SQL knowledge required Tedious manual process repetitive not suitable for large number of attributes No built-in visualization Or maybe...

Business & data understanding SQL barchart output

Business & data understanding Bar chart from GUI tool

Business & data understanding Boxplot output

Business & data understanding Data understanding wrap up DBMS not built for this It can support more specialized tools

Business & data understanding Data understanding wrap up DBMS not built for this It can support more specialized tools Introduction: R A free software environment for statistical computing and graphics Available in PostgreSQL

Business & data understanding PL/R: A statistical language for PostgreSQL R as a standalone language Mathematical and statistical methods Powerful visualization functions Classical, modern and bleeding edge modeling Arrays and data frames are central data types Operates only in memory

Business & data understanding PL/R: A statistical language for PostgreSQL R as a standalone language Mathematical and statistical methods Powerful visualization functions Classical, modern and bleeding edge modeling Arrays and data frames are central data types Operates only in memory PL/R: R as a loadable procedural language for PostgreSQL First released in 2003 by Joe Conway License: GPL

Business & data understanding R usage outside of PostgreSQL Development environments Frontends RStudio (AGPL or commercial, local & web) RKWard, Cantor (KDE projects) StatET (Eclipse) R Commander Deducer Rattle Web framework: Shiny (AGPL or commercial)

Business & data understanding Working with R in PostgreSQL Install functions in the database Example select install_rcmd(' myfunction <-function(x) {print(x)} '); Install without function body Example CREATE FUNCTION rnorm (n integer, mean double precision, sd double precision) RETURNS double precision[] AS LANGUAGE 'plr';

Business & data understanding Using R in PostgreSQL for data understanding Advanced visualization Data distributions Advanced statistics

Business & data understanding Using R in PostgreSQL for data understanding Advanced visualization Data distributions Advanced statistics Execution in the database Clumsy, but direct data access

Business & data understanding Using R in PostgreSQL for data understanding Advanced visualization Data distributions Advanced statistics Execution in the database Clumsy, but direct data access Execution outside Simple and interactive, but data transfer

Preprocessing Preprocessing

Preprocessing Preprocessing What databases are built for

Preprocessing Preprocessing What databases are built for Rows: very dynamic Easy to create new rows by joining Easy to lter Columns: not so much Easy to create new columns Only explicit access

Preprocessing Preprocessing What databases are built for Rows: very dynamic Easy to create new rows by joining Easy to lter Columns: not so much Easy to create new columns Only explicit access Wider interpretation of preprocessing Enrichment with external data New attributes from existing ones Recoding, recalculation Missing value handling

Preprocessing Preprocessing: organizing workow Common Table Expressions organize processing steps partial and intermediate results Example WITH source AS ( SELECT *, ROW_NUMBER() OVER () AS rownum FROM source_table ), no_missings AS ( SELECT * FROM source WHERE field1 IS NOT NULL AND field2 IS NOT NULL ) etc.

Preprocessing Preprocessing: attribute creation Aggregation

Preprocessing Preprocessing: attribute creation Aggregation Partial aggregation by window functions In-group measures, e. g. ratio att / SUM(att) OVER (PARTITION BY...)

Preprocessing Preprocessing: attribute creation Aggregation Partial aggregation by window functions In-group measures, e. g. ratio att / SUM(att) OVER (PARTITION BY...) In-group numbering ROW_NUMBER() OVER (PARTITION BY... ORDER BY...)

Preprocessing Preprocessing: attribute creation Aggregation Partial aggregation by window functions In-group measures, e. g. ratio att / SUM(att) OVER (PARTITION BY...) In-group numbering ROW_NUMBER() OVER (PARTITION BY... ORDER BY...) Comparing to previous/next value att - LAG(att, 1) OVER (ORDER BY...)

Preprocessing Preprocessing: attribute creation Aggregation Partial aggregation by window functions In-group measures, e. g. ratio att / SUM(att) OVER (PARTITION BY...) In-group numbering ROW_NUMBER() OVER (PARTITION BY... ORDER BY...) Comparing to previous/next value att - LAG(att, 1) OVER (ORDER BY...) Much easier in SQL than programming languages and data mining tools

Preprocessing Preprocessing: enrichment PostGIS for geodata

Preprocessing Preprocessing: enrichment PostGIS for geodata Foreign data wrappers (see PostgreSQL Wiki)

Preprocessing Preprocessing: enrichment PostGIS for geodata Foreign data wrappers (see PostgreSQL Wiki) Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy...) NoSQL databases (MongoDB, Cassandra, CouchDB, Redis,...)

Preprocessing Preprocessing: enrichment PostGIS for geodata Foreign data wrappers (see PostgreSQL Wiki) Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy...) NoSQL databases (MongoDB, Cassandra, CouchDB, Redis,...) Big Data (Hadoop Hive, Impala)

Preprocessing Preprocessing: enrichment PostGIS for geodata Foreign data wrappers (see PostgreSQL Wiki) Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy...) NoSQL databases (MongoDB, Cassandra, CouchDB, Redis,...) Big Data (Hadoop Hive, Impala) Network sources - Multicorn (RSS, IMAP, Twitter, S3,...) Files (CSV, ZIP, JSON,...)

Preprocessing Preprocessing: enrichment PostGIS for geodata Foreign data wrappers (see PostgreSQL Wiki) Other databases (other PostgreSQL server, MySQL, Oracle, MSSQL, JDBC, SQL Alchemy...) NoSQL databases (MongoDB, Cassandra, CouchDB, Redis,...) Big Data (Hadoop Hive, Impala) Network sources - Multicorn (RSS, IMAP, Twitter, S3,...) Files (CSV, ZIP, JSON,...) Write your own in C or Python or Ruby

Modeling Modeling

Modeling Model development Machine learning algorithms not well suited for SQL

Modeling Model development Machine learning algorithms not well suited for SQL Some attempts to build them Naive Bayes, Linear Regression Dicult for more advanced algorithms

Modeling Model development Machine learning algorithms not well suited for SQL Some attempts to build them Naive Bayes, Linear Regression Dicult for more advanced algorithms Better done in specialized language or tool PL/R PL/Python

Modeling PL/Python Python procedural language available in PostgreSQL scikit-learn: Machine learning toolbox for Python Classication, regression, clustering Model selection, validation Preprocessing matplotlib: Generic and statistical plotting library

Modeling PL/Python Python procedural language available in PostgreSQL scikit-learn: Machine learning toolbox for Python Classication, regression, clustering Model selection, validation Preprocessing matplotlib: Generic and statistical plotting library PL/Python is an alternative to PL/R

Modeling Evaluation

Modeling Evaluation of modeling results Models return predictions Prediction can be compared to known result (target variable) Measures of model performance: Accuracy, precision, recall,...

Modeling Evaluation of modeling results Models return predictions Prediction can be compared to known result (target variable) Measures of model performance: Accuracy, precision, recall,... Results on the training set meaningless

Modeling Evaluation of modeling results Models return predictions Prediction can be compared to known result (target variable) Measures of model performance: Accuracy, precision, recall,... Results on the training set meaningless Split validation Cross validation

Modeling Evaluation results Good result depends on the application If not good enough,

Modeling Evaluation results Good result depends on the application If not good enough, get more data

Modeling Evaluation results Good result depends on the application If not good enough, get more data do more preprocessing

Modeling Evaluation results Good result depends on the application If not good enough, get more data do more preprocessing select better classier

Modeling Evaluation results Good result depends on the application If not good enough, get more data do more preprocessing select better classier optimize classier parameters

Modeling Evaluation results Good result depends on the application If not good enough, get more data do more preprocessing select better classier optimize classier parameters Cycle: preprocessing - modeling - evaluation

Modeling Evaluation results Good result depends on the application If not good enough, get more data do more preprocessing select better classier optimize classier parameters Cycle: preprocessing - modeling - evaluation Better done in data mining environment

Deployment Deployment

Deployment Deployment Advantages of deployment in the database: Less overhead

Deployment Deployment Advantages of deployment in the database: Less overhead Instant application using triggers

Deployment Deployment Advantages of deployment in the database: Less overhead Instant application using triggers Well-known execution environment

Deployment Deployment Advantages of deployment in the database: Less overhead Instant application using triggers Well-known execution environment Functionality available over standard interface

Deployment Deployment Advantages of deployment in the database: Less overhead Instant application using triggers Well-known execution environment Functionality available over standard interface Some models easily expressed in SQL

Deployment Deployment of PL/R or PL/Python models Model developed in database or outside

Deployment Deployment of PL/R or PL/Python models Model developed in database or outside Put into global context PL/R: load(saved object,.globalenv) PL/Python: Global dictionary GD Application function in matching language Uses existing model Returns target variable

Deployment Deployment of PL/R or PL/Python models Model developed in database or outside Put into global context PL/R: load(saved object,.globalenv) PL/Python: Global dictionary GD Application function in matching language Uses existing model Returns target variable Trigger func or UPDATE uses application function

Summary Summary PostgreSQL's support for data science tasks Best: preprocessing, deployment

Summary Summary PostgreSQL's support for data science tasks Best: preprocessing, deployment Modern SQL for preprocessing

Summary Summary PostgreSQL's support for data science tasks Best: preprocessing, deployment Modern SQL for preprocessing Foreign Data Wrappers for data integration

Summary Summary PostgreSQL's support for data science tasks Best: preprocessing, deployment Modern SQL for preprocessing Foreign Data Wrappers for data integration Procedural languages for data mining

Summary Questions? Balázs Bárány, <balazs@tud.at> https://datascientist.at/