Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Similar documents
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Statistical and Graph Analysis with R and Scalable Matrix Multiplication

Scalable Machine Learning Computing a Data Summarization Matrix with a Parallel Array DBMS

Optimizing SQL Queries To Compute a General Summarization Matrix for Machine Learning Models

Bayesian Classifiers Programmed in SQL

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

OPTIMIZED ALGORITHMS FOR DATA ANALYSIS IN PARALLEL DATABASE SYSTEMS. A Dissertation Presented to. the Faculty of the Department of Computer Science

Advanced Databases: Parallel Databases A.Poulovassilis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Statistical Model Computation with UDFs

Fast PCA Computation in a DBMS with Aggregate UDFs and LAPACK

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Time Complexity and Parallel Speedup of Relational Queries to Solve Graph Problems

2.3 Algorithms Using Map-Reduce

Integrating the R Language Runtime System with a Data Stream Warehouse

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

MapReduce for Data Warehouses 1/25

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Fundamentals of Database Systems

Joint Entity Resolution

Horizontal Aggregations for Mining Relational Databases

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

1 (eagle_eye) and Naeem Latif

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Advanced Database Systems

MICROSOFT BUSINESS INTELLIGENCE

CSE 190D Spring 2017 Final Exam Answers

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Processing of Very Large Data

CSIT5300: Advanced Database Systems

On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs

Chapter 17: Parallel Databases

Time Complexity and Parallel Speedup of Relational Queries to Solve Graph Problems

Database System Concepts

Chapter 13: Query Processing

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Chapter 12: Query Processing

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

CSE 544, Winter 2009, Final Examination 11 March 2009

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Naive Bayes Classifiers Programmed in Query Language

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

From SQL-query to result Have a look under the hood

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

A Better Approach for Horizontal Aggregations in SQL Using Data Sets for Data Mining Analysis

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Chapter 12: Query Processing

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Chapter 3: Supervised Learning

Principles of Database Management Systems

Matrix Multiplication

Bayeux: An Architecture for Scalable and Fault Tolerant Wide area Data Dissemination

Database Systems CSE 414

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

CSE 344 MAY 2 ND MAP/REDUCE

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016


Preparation of Data Set for Data Mining Analysis using Horizontal Aggregation in SQL

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

Introduction Alternative ways of evaluating a given query using

MapReduce ML & Clustering Algorithms

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

Indexing: Overview & Hashing. CS 377: Database Systems

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Matrix Multiplication

HadoopDB: An open source hybrid of MapReduce

10-701/15-781, Fall 2006, Final

CSE Midterm - Spring 2017 Solutions

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CSE 544 Principles of Database Management Systems

Review. Support for data retrieval at the physical level:

University of Waterloo Midterm Examination Sample Solution

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Lecture 12. Lecture 12: Access Methods

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Modelling Structures in Data Mining Techniques

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Machine Learning Techniques for Data Mining

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Transcription:

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear Gaussian models, widely used in big data analytics. Computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features for scalable computation. By exploiting Gamma, iterative algorithms are changed to work in two phases: (1) Incremental-parallel data set summarization (i.e. in one scan and distributive); (2) Iteration in main memory exploiting the summarization matrix in intermediate matrix computations (i.e. reducing number of scans). Most intermediate computations on large matrices collapse to computations based on Gamma, a much smaller matrix. We present specialized database algorithms for dense and sparse matrices, respectively. Assuming a distributed memory model (i.e. shared-nothing) and a larger number of points than processing nodes, we show Gamma parallel computation has close to linear speedup. We explain how to compute Gamma with existing database systems processing mechanisms and their impact on time complexity. 1 Introduction Machine learning models generally require demanding iterative computations on matrices. This problem becomes significantly more difficult when input matrices (i.e. the data set) are large, residing on secondary storage, and when they need to be processed in parallel. We propose the Γ (Gamma) data summarizaton matrix [7] for linear Gaussian models [8]. From a computational perspective Γ has outstanding features: it produces an exact solution, it can be incrementally computed in one pass and it can be computed in parallel, with low communication overhead in a distributed memory model (i.e. shared-nothing [1]). We show computing Γ can be reduced to a single matrix self-multiplication. We study time and space complexity and parallel speedup of such multiplication. 2 Definitions We first define X, the input data set. Let X = {x 1,..., x n }, where x i is a columnvector in R d. In other words, X is a d n matrix; intuitively, X can be visualized

as a wide rectangular matrix. Supervised (predictive) models require an extra attribute. For regression models X is augmented with a (d + 1)th dimension containing an output variable Y. On the other hand, for classification models, there is an extra discrete attribute G, where G is in general binary (e.g. false/true, bad/good). We use i = 1... n and j = 1... d as matrix subscripts. The model, a bag of vectors and matrices, is called Θ. Therefore, Θ can represent principal component analysis (PCA) [4], linear regression [5], K-means clustering [3] and Naive Bayes classification [6], among others. 3 The Gamma Data Summarization Matrix We introduce a general augmented matrix Z, by prefixing it with a row of 1s and appending a [d + 2]th row to X, containing vector Y (in transposed form). Since X is d n, Z is (d + 2) n, where row [1] are 1s and row [d + 2] is Y. For classification and clustering models X is partitioned using a discrete attribute G (e.g. class 0/1, cluster 1... k). In this paper, we focus on Z extended with Y. Multidimensional sufficient statistics [4] summarize statistical properties of a data set: n = X, L = n i=1 x i, Q = XX T. Intuitively, n counts points, L is a simple linear sum (a vector) and Q is a quadratic matrix. We now introduce Γ, our main contribution, which generalizes previous sufficient statistics. Γ = n L T 1 T Y T L Q XY T = Y 1 Y X T Y Y T n x T i xi xi x T i yi yi x T i yi xi y i y 2 i The fundamental property of Γ is that it can be computed by a single matrix multiplication with Z: Γ = ZZ T. Notice ZZ T Z T Z, a property of Gramian matrices [2]. Matrix Γ is fundamentally the result of squaring matrix Z. Γ is much smaller than X for large n: O(d 2 ) O(dn), symmetric and computable via vector outer products. The property below gives two equivalent forms to summarize the data set: one based on matrix-matrix multiplication and a second one based on vector-vector multiplication, which is more efficient for incremental and parallel processing. Proposition: Γ can be equivalently computed as follows: 4 Serial Computation Γ = ZZ T = n z i zi T. (1) Recall from Section 2 that Θ represents a model. We start by presenting the standard iterative algorithm to compute a model Θ in two phases. Phase 1: initialize Θ; Phase 2: Iterate until covergence (i.e. until Θ does not change) i=1

reading X from secondary storage at every iteration (once or several times). Our proposed equivalent optimized iterative 2-phase algorithm based on Γ follows. Phase 1: Compute Γ, updating Γ in main memory reading X from secondary storage. Initialize Θ from Γ ; Phase 2: Iterate until convergence exploiting Γ in main memory in intermediate matrix computations. Lemma (Succinctness asymptotically small size): Given a large data set X where d n, matrix Γ size is O(d 2 ) O(dn) as n. Our time and space analysis is based on the 2-phase algorithm introduced above. Phase 1, which corresponds to the summarization matrix operator, is the most important. Computing Γ with a dense matrix multiplication is O(d 2 n). On the other hand, computing Γ with a sparse matrix multiplication is O(k 2 n) for the average case assuming k entries from x i are non-zero; assuming X is hyper-sparse k 2 = O(d) then the time is O(dn) on average. In Phase 2 we take advantage of Γ to accelerate computations involving X. Since we are computing matrix factorizations derived from Γ time is Ω(d 3 ), which for a dense matrix it may approach O(d 4 ), when the number of iterations in the factorization numerical method is proportional to d. In short, time complexity for Phase 2 for the models we consider does not depend on n. I/O cost is just reading X with the corresponding number of I/Os in time O(dn) for dense matrix storage and O(kn) for sparse matrix storage. 5 Parallel Computation Let N be the number of processing nodes, under a distributed memory architecture (shared-nothing [1]). That is, each node has its own main memory and secondary storage. We assume d n and N n, but d is independent from N: d N or N d. Our parallel algorithm follows. Phase 1 with N nodes: Compute Γ in parallel, updating Γ in main memory reading X from secondary storage. Phase 2 with 1 node: Iterate until convergence exploiting Γ in main memory in intermediate matrix computations to compute Θ. Recall we assume d < n and n. For Phase 1 each x i X is hashed to N processors. Then our matrix operator can compute each outer product z i z T i in parallel, resulting in O(d 2 n/n) work per processor for a dense matrix and O(dn/N) work per processor for a sparse matrix. That is, both dense and sparse matrix multiplication algorithms become optimal as N increases or n. Given its importance from a time complexity point of view, we present the following result as a theorem: Theorem (linear speedup): Let T j be processing time using j nodes, where 1 j N. Under our main assumption and Θ fits in main memory then our optimized algorithm gets close to optimal speedup T 1 /T N O(N). For completeness, we provide an analysis of a parallel system where N is large. That is, N = O(n).

Lemma (sequential communication bottleneck): Assume only one node can receive partial summarizations. Then time complexity with N messages to a single coordinator is O(d 2 n/n + Nd 2 ). Lemma (improved communication: global summarization with a balanced binary tree): Assume partial results Γ [I] can be sent to other workers in a binary tree fashion. Then T (d, n, N) O(d 2 N +log 2 (N)d 2 ) (a lower communication bound). 6 Computing Γ in a Database System Here we connect our matrix-based parallel computation with a database system with: (1) relational queries; (2) extending it with matrix functions. We assume X can be evenly partitioned by i, a reasonable assumption when d n. We now treat X as a table in order to process with relational algebra, the formalization of SQL. In order to simplify presentation we use X as the table name and we treat Y as another dimension. There are two major alternatives: (1) Defining a table with d columns, plus primary key i X(i, X 1, X 2,..., X d ), ideal for dense matrices. (2) Defining a table with triples, using the matrix subscripts as primary key: X(i, j, v), where i means point id, j dimension subscript and v is the matrix entry value, excluding zeroesm, where v = 0 (i.e. sparse matrix). Alternative (1) requires expressing matrix product as d 2 aggregations, which requires a wrapper program to generate the query. Notice X j is a column name: we cannot assume there is subscript-based access. Alternative (2) can express the computation as a single query, making the solution expressible within relational algebra and it is natural for sparse matrices. In the following query X1 and X2 are alias of X because it is referenced twice: Γ = π X1.j,X2.j,sum(X1.v X2.v) (X1 X1.i=X2.i X2). The time complexity to evaluate the Gamma query depends on the join algorithm. Therefore, it can be: O(d 2 n) (hash join, balanced buckets), O(d 2 nlog(n)) (sort-merge join, merge join), to O(d 2 n 2 ) (nested loop join). These bounds can be tighter if the join operation can be avoided using a subscript based mechanism in main memory when Γ fits in main memory. The previous query solution requires an expensive join ( ), which can be slow. Therefore, our goal is to eliminate the join operation by enabling efficient subscript-based access. Our proposal is to introduce a matrix function in an extended database model, where tuples are transfomed into vectors and a matrix can be the result of the query. This function bridges relational tables on one side with matrices on the other side. Let Gamma() be an aggregation which returns a (d + 2) (d + 2) matrix. There are two elegant solutions: (1) Γ = Gamma(X 1, X 2,..., X d )(X) for X with a horizontal schema. (2) Γ = Gamma(X) for a vertical schema, where the function assume X has the schema defined above. In this case time complexity ranges from O(d 2 n) to O(d 2 n log(n)), depending on the aggregation algorithm. The price we are paying is that the query is no longer expressible within relational algebra and therefore it is not feasible to treat Γ as a new relational operator. This second mechanism corresponds to so-called SQL UDFs [4, 9] or user-defined array operators [7, 10].

References 1. D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85 98, 1992. 2. G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins University, 3rd edition, 1996. 3. C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188 201, 2006. 4. C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(12):1752 1765, 2010. 5. C. Ordonez, C. Garcia-Alvarado, and V. Baladandayuthapani. Bayesian variable selection in linear regression in one pass for large datasets. ACM TKDD, 9(1):3, 2014. 6. C. Ordonez and S. Pitchaimalai. Bayesian classifiers programmed in SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(1):139 144, 2010. 7. C. Ordonez, Y. Zhang, and W. Cabrera. The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2016. 8. S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11:305 345, 1999. 9. M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64 71, 2010. 10. M. Stonebraker, P. Brown, D. Zhang, and J. Becla. SciDB: A Database Management System for Applications with Complex Analytics. Computing in Science and Engineering, 15(3):54 62, 2013.