Presentation of Current Industry Research Papers on Significant Topics in Enterprise Database Systems and Data Analytics

Size: px

Start display at page:

Download "Presentation of Current Industry Research Papers on Significant Topics in Enterprise Database Systems and Data Analytics"

Arabella Hudson
5 years ago
Views:

1 Presentation of Current Industry Research Papers on Significant Topics in Enterprise Database Systems and Data Analytics Technical Topics : Select one paper listed below and prepare a 25 min talk on the subject of the paper. 1. Data Warehouse and OLAP: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Microsoft Adam Bosworth Andrew Layman Hamid Pirahesh Data Mining with data warehouse Integrating association rule mining with relational database systems: Alternatives and implications in SIGMOD'98 S. Sarawagi, IBM Almaden Research Center S. Thomas, IBM Almaden Research Center R. Agrawal. IBM Almaden Research Center Abstract Abstract Data mining on large data warehouses is becoming increasingly important. In support of this trend, we consider a spectrum of architectural alternatives for coupling mining with database systems. These alternatives include: loose-coupling through a SQL cursor interface; encapsulation of a mining algorithm in a stored procedure; caching the data to a _le system on-the-y and mining; tight-coupling using primarily user-defined functions; and SQL implementations for processing in the DBMS. We comprehensively study the option of expressing the mining algorithm in the form of SQL queries using Association rule mining as a case in point. We consider four options in SQL-92 and six options in SQL enhanced with object-relational extensions (SQL-OR). Our evaluation of the di_erent architectural alternatives shows that from a performance perspective, the Cache-Mine option is superior, although the performance of the SQL- OR option is within a factor of two. Both the Cache-Mine and the SQL-OR approaches incur a higher storage penalty than the loose-coupling approach which performance-wise is a factor of 3 to 4 worse than Cache-Mine. The SQL-92 implementations were too slow to qualify as a competitive option.

2 We also compare these alternatives on the basis of qualitative factors like automatic parallelization, development ease, portability and inter-operability. 3. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers and the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al, Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and nonengineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more

3 advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on. 16. Pig Latin: A Not-So-Foreign Language for Data Processing in SIGMOD 2008 Christopher Olston (Yahoo! Research) Benjamin Reed (Yahoo! Research) Utkarsh Srivastava (Yahoo! Research) Ravi Kumar (Yahoo! Research) Andrew Tomkins (Yahoo! Research) There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, o_er a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who _nd the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to _t in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce.

4 The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use. 4. Query Optimization in Microsoft SQL Server PDW in SIGMOD 2012 Srinath Shankar, Microsoft; Rimma Nehme, Microsoft; Josep Aguilar-Saborit, Microsoft; Andrew Chung, Microsoft; David DeWitt, Microsoft; César Galindo-Legaria, Microsoft 5. Advanced Partitioning Techniques for Massively Distributed Computation, in SIGMOD 2012 Jingren Zhou, Microsoft; Nico Bruno, Microsoft; Wei Lin, Microsoft 6 Efficient Transaction Processing in SAP HANA Database--The End of a Column Store Myth (SAP) in SIGMOD 2012 By Vishal Sikka, SAP; Franz Färber, SAP; Wolfgang Lehner, TUD/SAP; Sang Kyun Cha, SAP; Thomas Peh, SAP;

5 Christof Bornhövd, SAP 7, A Storage Advisor for HybridStore Databases Philipp R osch SAP Research, SAP AG Lars Dannecker SAP Research, SAP AG With the SAP HANA database, SAP offers a high-performance in-memory hybrid-store database. Hybrid-store databases that is, databases supporting row- and column-oriented data management are getting more and more prominent. While the columnar management offers high-performance capabilities for analyzing large quantities of data, the row-oriented store can handle transactional point queries as well as inserts and updates more efficiently. To effectively take advantage of both stores at the same time the novel question whether to store the given data row- or columnoriented arises. We tackle this problem with a storage advisor tool that supports database administrators at this decision. Our proposed storage advisor recommends the optimal store based on data and query characteristics; its core is a cost model to estimate and compare query execution times for the different stores. Besides a per-table decision, our tool also considers to horizontally and vertically partition the data and manage the partitions on different stores. We evaluated the storage advisor for the use in the SAP HANA database; we show the recommendation quality as well as the benefit of having the data in the optimal store with respect to increased query performance April 23-2 Finding Related Tables, in SIGMOD 2012

6 Anish Das Sarma, Google; Lujun Fang, Google; Nitin Gupta, Google; Alon Halevy, Google; Hongrae Lee, Google; Fei Wu, Google; Reynold Xin, Google; Cong Yu, Google 17. Oracle In-Database Hadoop: When MapReduce Meets RDBMS in SIGMOD 2012 Xueyuan Su, Yale University; Garret Swart, Oracle Integrating Hadoop and parallel DBMS in SIGMOD2010 Yu Xu (Teradata), Pekka Kostamaa (Teradata), Like Gao (Teradata) Teradata s parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made

7 popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW. 19 Interactive Analytical Processing in Big Data Systems: A CrossIndustry Study of MapReduce Workloads Yanpei Chen, Sara Alspaugh, Randy Katz University of California, Berkeley Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a

8 TPC-like data processing benchmark for MapReduce. ldb2012.pdf 20 Avatara: OLAP for Webscale Analytics Products Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin Kim Jay Kreps Sam Shah LinkedIn Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like Who s Viewed My Profile? and Who s Viewed This Job? The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160 million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years The Unified Logging Infrastructure for Data Analytics at Twitter George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy @squarecog In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A lessexplored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitter s production logging infrastructure

9 and its evolution from application-specific logging to a unified client events log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize session sequences, which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou University of Wisconsin-Madison floratou@cs.wisc.edu Nikhil Teletia Microsoft Jim Gray Systems Lab nikht@microsoft.com David J. DeWitt Microsoft Jim Gray Systems Lab dewitt@microsoft.com Jignesh M. Patel University of Wisconsin-Madison jignesh@cs.wisc.edu Donghui Zhang Paradigm4 dzhang@paradigm4.com In this new era of big data, traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other end of the spectrum, big data DSS analytics that used to be the domain of parallel RDBMSs is now under attack by another class of NoSQL data analytics systems, such as Hive on Hadoop. So, are the traditional RDBMSs, aka big elephants, doomed as they are challenged from both ends of this big data spectrum? In this paper, we compare one representative NoSQL system from each end of this spectrum with SQL Server, and analyze the performance and scalability aspects of each of these approaches

10 (NoSQL vs. SQL) on two workloads (decision support analysis and interactive data-serving) that represent the two ends of the application spectrum. We present insights from this evaluation and speculate on potential trends for the future MapReducebased Dimensional ETL Made Easy Xiufeng Liu, Christian Thomsen, Torben Bach Pedersen Dept. of Computer Science, Aalborg University, Denmark {xiliu, chr, This paper demonstrates ETLMR, a novel dimensional Extract Transform Load (ETL) programming framework that uses Map- Reduce to achieve scalability. ETLMR has built-in native support of data warehouse (DW) specific constructs such as star schemas, snowflake schemas, and slowly changing dimensions (SCDs). This makes it possible to build MapReduce-based dimensional ETL flows very easily. The ETL process can be configured with only few lines of code. We will demonstrate the concrete steps in using ETLMR to load data into a (partly snowflaked) DW schema. This includes configuration of data sources and targets, dimension processing schemes, fact processing, and deployment. In addition, we also present the scalability on large data sets Locality-aware Partitioning in Parallel Database Systems (SIGMOD 2015) Erfan Zamaniany Carsten Binnig Abdallah Salama Brown University Baden-Wuerttemberg Cooperative State University Providence, USA Mannheim, Germany Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in sharednothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common

11 technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to selectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout (SIGMOD 2015) Ziqiang Fengy Eric Loy Ben Kao Wenjian Xuyy Department of Computing, The Hong Kong Polytechnic University Scan and lookup are two core operations in main memory column stores. A scan operation scans a column and returns a result bit vector that indicates which records satisfy a filter. Once a column scan is completed, the result bit vector is converted into a list of record numbers, which is then used to look up values from other columns of interest for a query. Recently there are several inmemory data layout proposals that aim to improve the performance of in-memory data processing. However, these solutions all stand at either end of a trade-off each is either good in lookup performance or good in scan performance, but not both. In this paper we present ByteSlice, a new main memory storage layout that supports both highly efficient scans and lookups. ByteSlice is a bytelevel columnar layout that fully leverages SIMD data-parallelism. Micro-benchmark experiments show that ByteSlice achieves a data scan speed at less than 0.5 processor cycle per column value a new limit of main memory data scan, without sacrificing lookup performance. Our experiments on TPC-H data and real data show that ByteSlice offers significant performance improvement over all state-of-the-art approaches.

12 27 Implicit Parallelism through Deep Language Embedding (Sigmod 2015) The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-De_ned Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control ow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and a_ects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative speci_cation of dataows and (ii) hides the notion of dataparallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates e_cient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System (Sigmod 2015) Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively par-

13 allel architecture. We build on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms.

14 29 Skew-Aware Join Optimization for Array Databases (Sigmod2015) Jennie Duggan, Olga Papaemmanouil, Leilani Battle, Michael Stonebraker Northwestern University, Brandeis University, MIT {leilani, Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably chunk their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query s algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline Analytics in Motion High Performance Event-Processing AND Real-Time Analytics in the Same Database Modern data-centric ows in the telecommunications industry require real time analytical processing over a rapidly changing and large dataset. The traditional approach of separating OLTP and OLAP workloads cannot satisfy this requirement. Instead, a new class of integrated solutions for handling hybrid workloads is needed. This paper presents an industrial use case and a novel architecture that integrates key-value-based event processing and SQL-based analytical processing on the same distributed store while minimizing the total cost of ownership. Our approach combines several

15 well-known techniques such as shared scans, delta processing, a PAX-fashioned storage layout, and an interleaving of scanning and delta merging in a completely new way. Performance experiments show that our system scales out linearly with the number of servers. For instance, our system sustains event streams of 100,000 events per second while simultaneously processing 100 ad-hoc analytical queries per second, using a cluster of 12 commodity servers. In doing so, our system meets all response time goals of our telecommunication customers; that is, 10 milliseconds per event and 100 milliseconds for an ad-hoc analytical query. Moreover, our system beats commercial competitors by a factor of 2.5 in analytical and two orders of magnitude in update performance Cost-based Fault-tolerance for Parallel Data Processing VLDB2015 In order to deal with mid-query failures in parallel data engines (PDEs), di_erent fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a _negrained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can e_ciently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes e_ciently support a wide range of di_erent cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for di_erent cluster setups Schema-Agnostic Indexing with Azure DocumentDB Dharma Shukla (Microsoft) VLDB2015

16 Azure DocumentDB is Microsoft s multi-tenant distributed database service for managing JSON documents at Internet scale. DocumentDB is now generally available to Azure developers. In this paper, we describe the DocumentDB indexing subsystem. DocumentDB indexing enables automatic indexing of documents without requiring a schema or secondary indices. Uniquely, DocumentDB provides real-time consistent queries in the face of very high rates of document updates. As a multi-tenant service, DocumentDB is designed to operate within extremely frugal resource budgets while providing predictable performance and robust resource isolation to its tenants. This paper describes the DocumentDB capabilities, including document representation, query language, document indexing approach, core index support, and early production experiences JetScope: Reliable and Interactive Analytics at Cloud Scale VLDB2015 Eric Boutin, Paul Brett, Xiaoyu Chen, Jaliya Ekanayake Tao Guan, Anna Korsun, Zhicheng Yin, Nan Zhang, Jingren Zhou Microsoft Interactive, reliable, and rich data analytics at cloud scale is a key capability to support low latency data exploration and experimentation over terabytes of data for a wide range of business scenarios. Besides the challenges in massive scalability and low latency distributed query processing, it is imperative to achieve all these requirements with e_ective fault tolerance and e_cient recovery, as failures and uctuations are the norm in such a distributed environment. We present a cloud scale interactive query processing system, called JetScope, developed at Microsoft. The system has a SQL-like declarative scripting language and delivers massive scalability and high performance through advanced optimizations. In order to achieve low latency, the system leverages various access methods, optimizes delivering _rst rows, and maximizes network and scheduling e_ciency. The system also provides a _ne-grained fault tolerance mechanism which is able to e_ciently detect and mitigate failures without signi_cantly impacting the query latency and user experience. JetScope has been deployed to hundreds of servers in production at Microsoft, serving a few million queries every day. 34, Optimization of Common Table Expressions in MPP Database Systems Big Data analytics often include complex queries with similar or identical expressions, usually referred to as Common Table Expressions

17 (CTEs). CTEs may be explicitly defined by users to simplify query formulations, or implicitly included in queries generated by business intelligence tools, financial applications and decision support systems. In Massively Parallel Processing (MPP) database systems, CTEs pose new challenges due to the distributed nature of query processing, the overwhelming volume of underlying data and the scalability criteria that systems are required to meet. In these settings, the effective optimization and efficient execution of CTEs are crucial for the timely processing of analytical queries over Big Data. In this paper, we present a comprehensive framework for the representation, optimization and execution of CTEs in the context of Orca Pivotal s query optimizer for Big Data. We demonstrate experimentally the benefits of our techniques using industry standard decision support benchmark.

Presentation of Current Industry Research Papers on Significant Topics in Enterprise Database Systems and Data Analytics

Presentation of Current Industry Research Papers on Significant Topics in Enterprise Database Systems and Data Analytics Technical Topics : Select one paper listed below and prepare a 25 min talk on the