Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

Similar documents
1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

DATA MINING AND WAREHOUSING

Python With Data Science

Unifying Big Data Workloads in Apache Spark

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Advanced Data Management Technologies Written Exam

DATA WAREHOUING UNIT I

Contents. Part I Setting the Scene

Relational Model, Relational Algebra, and SQL

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

Data Science Bootcamp Curriculum. NYC Data Science Academy

Unit Assessment Guide

Machine Learning with Python

Leverage the power of SQL Analytical functions in Business Intelligence and Analytics. Viana Rumao, Asher Dmello

Nuts and Bolts Research Methods Symposium

Deccansoft Software Services. SSIS Syllabus

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

PowerPoint Presentation to Accompany GO! All In One. Chapter 13

R Language for the SQL Server DBA

Oracle Machine Learning Notebook

Data Analysis and Data Science

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

TPCX-BB (BigBench) Big Data Analytics Benchmark

BIG DATA SCIENTIST Certification. Big Data Scientist

Table Of Contents: xix Foreword to Second Edition

How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA

Welcome to the topic of SAP HANA modeling views.

Business Analytics in the Oracle 12.2 Database: Analytic Views. Event: BIWA 2017 Presenter: Dan Vlamis and Cathye Pendley Date: January 31, 2017

Interview Questions on DBMS and SQL [Compiled by M V Kamal, Associate Professor, CSE Dept]

Data warehouses Decision support The multidimensional model OLAP queries

SOFTWARE DEVELOPMENT: DATA SCIENCE

Higher level data processing in Apache Spark

Data about data is database Select correct option: True False Partially True None of the Above

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Concepts of Database Management Eighth Edition. Chapter 2 The Relational Model 1: Introduction, QBE, and Relational Algebra

Data Science. Data Analyst. Data Scientist. Data Architect

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Deccansoft softwareservices-microsoft Silver Learing Partner. SQL Server Syllabus

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

DATA MINING TRANSACTION

Specialist ICT Learning

TIM 50 - Business Information Systems

KNIME for the life sciences Cambridge Meetup

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Workbooks (File) and Worksheet Handling

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Speeding Up Data Science: From a Data Management Perspective

Data Preprocessing. Slides by: Shree Jaswal

High throughput Data Analysis 2. Cluster Analysis

alteryx training courses

Business Analytics and Big Data: the process and the tools

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions.

Business Intelligence Roadmap HDT923 Three Days

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Data Mining By IK Unit 4. Unit 4

Knowledge Modelling and Management. Part B (9)

Data analysis using Microsoft Excel

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

A Brief Introduction to Data Mining

Machine Learning - Regression. CS102 Fall 2017

ECLT 5810 Clustering

Querying Data with Transact-SQL (761)

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Step-by-step data transformation

Knowledge Discovery and Data Mining

BIG DATA COURSE CONTENT

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

DB2 SQL Class Outline

ECLT 5810 Clustering

The Pliny Database PDB

Data Warehouses and OLAP. Database and Information Systems. Data Warehouses and OLAP. Data Warehouses and OLAP

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Practical Machine Learning Agenda

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

TIM 50 - Business Information Systems

The Evolution of Big Data Platforms and Data Science

SOCIAL MEDIA MINING. Data Mining Essentials

OLAP and Data Warehousing

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

MTA Database Administrator Fundamentals Course

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

8) A top-to-bottom relationship among the items in a database is established by a

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Chapter 3: Data Mining:

Data Science Course Content

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Approaching the Petabyte Analytic Database: What I learned

Transcription:

Computational Databases: Inspirations from Statistical Software Linnea Passing, linnea.passing@tum.de Technical University of Munich

Data Science Meets Databases Data Cleansing Pipelines Fuzzy joins Data input errors e.g., Alteryx Statistics Pre-tests Handling missing data e.g., SPSS, R Big Data, BI OLAP cubes Joins and aggregation Roll-up and drilldown e.g., Spark, DBS Data Mining Classification Clustering Forecasting Regression e.g., R, python ML, AI Neural networks Linear algebra e.g., TensorFlow 2

Inspirations from Statistical Software Similarities Connecting to flat files, RDBMS, Big Data environments Fixed-point numerics as main datatype Extensions for e.g. geospatial and streaming data Differences New queries (e.g. statistical properties) Higher quality requirements (e.g. numerical stability) New workflows (e.g. metadata output) Richer semantics (e.g. handling missing values) 3

OPTIMIZING COMMON QUERIES QUALITY IMPROVEMENTS WORKFLOWS SELF-CONTAINED DATABASE 4

Schema, Joins, Enums One big fact table, user-defined enum datatypes¹ Derived columns are added to the fact table Normalized schema with fact and dimension tables (avoid anomalies, keep database small, ) Derived columns can be added to (materialized) views (then joins are required to query all columns) or via ALTER TABLE ADD COLUMN Column stores: light-weight ALTER TABLE ADD COLUMN, maybe even for non-materialized columns? ¹ e.g. education: {(0, none), (1, highschool diploma), (2, college degree), (3, PhD degree),...} 5

Cumulative and total values E.g. variance explained using 1, 2, 3, factors E.g. number of participants per age group and total Running sums (cumulative values) supported via window functions No native support for analyses on multiple levels in one query, one has to use multiple GROUP BY expressions and UNION ALL or non-standardized ROLLUP/CUBE Combination of rollup/cube (implemented as part of GROUP BY) and windows required 6

Common Statistical Properties/Pretests Often required for statistical methods E.g. homoscedasticity¹ for Pearson correlation coefficient, normal distribution for T-test Small materialized aggregates for common aggregates (SUM, MIN,...) Small materialized aggregates for common statistical properties/pre-tests, also spanning multiple columns e.g. standard deviation of column x; covariance of columns x,y ¹ variance around regression line is the same for all values, not given in e.g. weather forecast data 7

OPTIMIZING COMMON QUERIES QUALITY IMPROVEMENTS WORKFLOWS SELF-CONTAINED DATABASE 8

High-quality Sampling Techniques For running expensive statistical simulations on representative subsets To adjust weights for underrepresented participant groups in a survey Often only low-quality sampling using LIMIT or WHERE (ROWID%X)=0 or expensive ORDER BY RAND() sampling is supported Providing high-quality sampling techniques that go with the query execution flow of databases 9

Numerical Stability Often deals with small (e.g. z-transformed) numbers Stable two-pass implementation AVG rewritten to SUM/COUNT Naive one-pass implementation of covariance Stable implementations require re-streaming or materialization of input data which is expensive 10

OPTIMIZING COMMON QUERIES QUALITY IMPROVEMENTS WORKFLOWS SELF-CONTAINED DATABASE 11

Data Provenance and Scripts A single GUI action often corresponds to multiple queries, e.g. adding statistical pre- and post-tests Running algorithms often to overcome local minima Interactive data preparation Transactional guarantees allow executing pre-tests and actual actions on exactly the same data Capture and replay of queries Provenance metadata attached to tables and views 12

Multiple Output Tables One action/query often results in multiple output tables, e.g. containing additional statistics, results of pre- and post-tests, comments about result quality, footnotes The output of an operator/query can only be one relation. Thus while many temporary tables can be created using common table expressions, a query can only materialize/output one table Being able to output multiple relations? 13

OPTIMIZING COMMON QUERIES QUALITY IMPROVEMENTS WORKFLOWS SELF-CONTAINED DATABASE 14

Handling Missing Values / Replacing NULL values Replacement value for missing data is stored as part of the schema and applied throughout all functions that use the column Replacement values for missing data can only be given per query, using COALESCE clauses, not as part of the schema Achieving self-contained databases, where less business logic or domain knowledge is stored solely in applications 15

Restricting Computations / Levels of measurement nominal (enum, e.g. gender), ordinal (ordered enum, e.g. educational degree), interval (number with arbitrary zero-point, e.g. degree Celsius), ratio (number with meaningful zero-point, e.g. quantity) Operations restricted to what the level allows (e.g. no mean for ordinal data, no ratio for interval data) String, bool, numbers, (sometimes enum), and more Operations restricted to what is mathematically possible Enrich table metadata to achieve a more self-contained database Can speed up computations, e.g. GROUP BY on a known-to-be limited number of groups 16

Summary and Outlook Enabling new types of data scientists to do new types of analytics in computational databases Focus on statistics also benefits other types of data scientists and workloads Further inspirations for database systems 17