Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics

Size: px

Start display at page:

Download "Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics"

Blanche Nicholson
5 years ago
Views:

1 EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement of data analytics systems as the amount of data being collected, stored and processed on a daily basis continues to grow rapidly. While the MapReduce framework [7] is especially designed to satisfy this requirement, it has often been criticized for its poor performance [14]. In this report we present three papers that show how the performance of MapReduce framework can be significantly improved by using techniques inherited from Parallel DataBase Management Systems (PDBMS). Two of these systems present techniques for data placement, indexing and a new join operator designed specifically for MapReduce, while the third examines whether the performance enhancements of a database can be used directly, as is, by inserting such a database as a resource in each MapReduce node. Our conclusions show that while these approaches look promising, there is still space for further improvements. Index Terms Analytical Workloads, MapReduce, Parallel DBMS, Data Placement, Indexing, Joins on MapReduce. I. INTRODUCTION Nowadays, we are witnessing an explosion in the amount of data being collected, stored and processed in data warehouses on a daily basis. It is becoming increasingly common to hear companies claiming to load more than a terabyte of data per day into their systems, whose total data footprint may exceed one petabyte [12]. For example, Facebook reported pushing 20TB of data every day to its (currently) 2.5PB warehouse ([10],[13]). As a result of this data deluge, as it has been called, corporations have significantly changed the way they process data [17]. Companies are moving away from Proposal submitted to committee: August 27, 2012; Candidacy exam date: September 3rd, 2012; Candidacy exam committee: Prof. Willy Zwaenepoel, Prof. Christoph Koch, Prof. Anastasia Ailamaki. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (R. Urbanke) (signature) performing data analysis 1 on high-end proprietary machines in favor of shared-nothing cluster architectures composed of thousands of low-cost machines built by commodity unreliable hardware. Those architectures are usually provided in the form of virtualized environments of private or public clouds, such as Amazon s Elastic Compute Cloud (EC2) [4]. The fundamental reason behind this shift is cost. Given that efficiently processing datasets in the petabyte scale requires exploiting a great degree of parallelism, the proven scalability [11], price to performance and power to performance [9] characteristics of these architectures make them ideal candidates for large-scale data analysis. This is because data analysis workloads tend to consist of many large scan operations, multi dimensional aggregations and star schema joins, all of which are fairly easy to parallelize across nodes in a shared-nothing network [2]. Furthermore, cloud offerings today can reduce operational costs by maximizing their underlying hardware utilization. This reduction is then passed to clients, who basically pay only for what they use, thus reducing their own operating, facilities, and other hardwarerelated costs. As a result, cloud offerings can be used today as the infrastructure on top of which one can build highlyscalable services [9]. Unfortunately however, as we explain in Sections II-B and II-C, there are many reasons that make it difficult for existing parallel database systems to scale to thousands of nodes located in those cloud environments. Based on the above analysis, and on the observation that the scalability requirements of applications are not likely to go down, the MapReduce framework [7] and its open source implementation called Hadoop [1] have been developed. This framework has been designed from the start in order to scale to thousands of nodes, and its use in Google is proof of that [6]. Despite its high scalability, MapReduce has been frequently criticized for its poor performance, which can be up to an order of magnitude slower than a traditional parallel DBMS [14]. In this report we present three papers that show that one can use techniques inherited from parallel databases, or even a database itself directly as is, in order to boost the performance of MapReduce. This report is structured as follows. In Section II we summarize the requirements a modern state-ofthe-art system should satisfy in order to efficiently process data in the petabyte scale. Then, in Section III, we study whether data placement techniques proposed for parallel database can boost the performance of MapReduce. The conclusion of this 1 Data analysis workloads are read-intensive and produce write requests only during data loading.

2 EDIC RESEARCH PROPOSAL 2 discussion is that those techniques do not completely satisfy our requirements and, thus, a new technique called RCFile [10] is developed. In Section IV, we present HadoopDB [2]. which examines whether the performance enhancements of a database can be used directly, as is, by inserting such a database as a resource in each node of MapReduce. On the other hand, Hadoop++ [8], presented in Section V, targets to enrich MapReduce not through a database, but by extending the framework so that it uses indices and a special join technique. Finally, in Section VI, we present our conclusions and we state some interesting research paths which may extend the systems presented in this report. II. DESIRED PROPERTIES OF ANALYTICS PLATFORMS In this section we present the properties a modern state-ofthe-art data analytics system should have in order to meet the requirements of modern workloads. While doing so, we also analyze whether parallel databases and MapReduce satisfy each requirement or not. A. High-Performance High performance is the primary requirement for a modern data analytics system. High performance brings cost savings, as it can help delay a costly hardware update as the application CPU, storage and network requirements continue to grow. There are three main aspects of this requirement: Fast data loading. Short loading times are becoming a crucial requirement, since data loading causes significant network and disk traffic, which interferes with normal query operations. Furthermore, short loading times increase the precision and freshness of related computations, since they are performed on more up-to-date data. Fast query processing. Traditionally, data warehouses must be able to concurrently execute multiple batches of heavy decision support queries. At the same time, there are nowadays response-time critical queries, originating by users in the web. To guarantee short query execution times, the volume of network and storage I/O must be minimized. The system must also retain this characteristic as the number of queries increases over time. Highly efficient storage space utilization. Though storage is becoming continuously cheaper, there are still power (fewer hard disks means less power consumption) and performance (fewer hard disk seeks means better performance) considerations that make this requirement essential. Most importantly, the trend is that data footprints will soon grow faster than storage densities [17]. Out of the two systems, parallel databases have been long optimized for high performance, and they currently incorporate at least one decade of optimization techniques published in the database literature (such as indexing, compression and directly operating on compressed data, materialized views, result caching and I/O sharing). Furthermore, PDBMS usually use cost-based optimizers that employ statistics or simple optimization rules to improve performance. Those statistics, along with modeling data on well defined schemas, can help make optimal choices, according to the query executed. On the other hand, MapReduce uses unstructured data analysis, thus it cannot employ many of the performance optimizations of databases. This can lead to an order of magnitude worse performance compared to a PDBMS [14]. B. Fault Tolerance and Scalability For data analysis systems, where there are no write operations except those caused by data loading, fault tolerance is defined as not having to restart a query when some node working on this query fails. Given that the possibility of a node failure increases as one adds more commodity failureprone machines to the cluster, restarting a query on a node failure may make long-running queries difficult to complete. Usually there is a clear tradeoff between performance and fault-tolerance. For instance, checkpointing the result of completed sub-tasks increases the fault tolerance of long-running queries, but imposes a significant overhead which reduces performance. As another example consider the pipelining of intermediate results between query operators; a design choice that improves performance, but also increases the amount of work lost when a failure occurs. This last observation guides the design choices of the two data analytics platforms. On the one hand, PDBMS are historically designed for small clusters. This is, however, an environment in which failures are rare events. Thus, these systems simply restart the entire query processing when a failure is detected, using replica nodes, in order to avoid the overheads of techniques like checkpointing. As a result, those systems cannot efficiently scale in todays commodity clusters, and there is no known parallel database system today that scales to more than one hundred nodes. On the other hand, MapReduce is designed based on the observation that failures are frequently occurring events. Thus, it optimizes for faulttolerance through checkpointing of results in the map phase, and achieves high scalability to thousands of nodes. C. Managing Heterogeneity Acquiring homogeneous performance from all cluster nodes becomes an increasingly difficult task as more nodes are assigned to a data analytics system. This is true, even if all machines have the same, virtualized or physical, hardware. Partial node failures, fragmentation of individual disks, concurrent execution of tasks by different users or even software configuration errors all seriously degrade performance. Contrary to parallel DBMS, MapReduce uses a runtime-determined execution plan. This allows MapReduce to trade off the runtime scheduling overhead with the ability to handle slow nodes through redundant task execution. With this technique, the tasks of slow workers are also executed in faster nodes, thus making the execution time equal to the time required by the faster node to complete the task. D. Not workload specific Data stored in modern warehouses are analyzed by different applications and users in significantly diverse ways. As a result, the warehouse accesses do not conform to any regular workload pattern. Thus, the data analytics system should not assume any a-priori knowledge (e.g. of queries executed). Instead, it should quickly adapt to any workload it receives.

3 EDIC RESEARCH PROPOSAL 3 E. Wide Variety of Exported Interfaces There are three aspects of this requirement. First, given that many users of data analytics systems are not experienced programmers but analysts, the interface exported by the system should be as flexible as possible. To this end, access through both SQL and general-purpose languages like Java or C++ should be available to users. Furthermore, there should be no restrictions in the data format used (e.g. structured data). Second, there has been extensive prior work on business tools that help on visualization, query generation and data analysis. Analysts should be able to continue using these tools on a new system, without any modifications required to their code-base. Finally, users should be able to extend the system by programming UDFs. Then, those functions should be automatically parallelized by the system. F. Low Cost Finally, there is the issue of cost. On the one hand, Hadoop is an open-source implementation of MapReduce. On the other hand, parallel databases are extremely costly, usually with seven-figure prices. Furthermore, while there have been advances in automatic configuration, deployment and tuning of parallel databases, these systems still usually require highly skilled database administrators to maintain the system. Ideally, users would like a free system, which would satisfy all the above requirements out of the box as is. III. DATA PLACEMENT TECHNIQUES FOR MAPREDUCE Data placement can fundamentally affect the performance and workload independence aspects we described in Sections II-A and II-D respectively. Thus, there has been a long line of work in the field of parallel databases regarding how data should be placed to the underlying storage. Currently, there are three popular data placement techniques, namely row-store, column-store and PAX-store [3]. Given that the default placement technique of the MapReduce framework resembles a row-store, one reasonable question to ask is whether the other two schemes can perform better than a rowstore in this environment. To answer this question, one can examine the merits and drawbacks of each technique in all aspects of performance. We present such an analysis next. In what follows, we note that an HDFS block contains many disk pages, each one in the format described in each case below. A. Row-Storage In this data placement technique, which is formally called the N-ary storage model [15] and is displayed in Figure 1(A), records of a database relation are stored one after another in each disk page, in the order of their appearance. The major advantage of this scheme is that it achieves fast loading times, since it requires no preprocessing of the relational records before they are transmitted to the cluster nodes. It also does not require any a-priori knowledge of queries executed, thus it is workload independent. However, this scheme typically performs poorly at query execution time and it does not efficiently utilize the underlying storage space. This is because relational records currently tend to contain hundreds of attributes out of which only a few are used in the typical aggregation queries of warehouse workloads. As a result, row-storage performs unnecessary read requests to fetch columns from the underlying storage devices that are not used by the query. Furthermore, this scheme suffers from bad compression ratio, since the mixed data domains of the different attributes of each record cause high information entropy, which is known to affect the performance of compression algorithms. B. Column-Storage Columns stores improve I/O performance by taking advantage of the previous observation that only a few attributes of a relation are used in typical data warehousing queries. They do so by performing vertical partitioning of the attributes of a database relation into several sub-relations, so that each subrelation contains one or more attributes of the original relation. Then, each such sub-relation is stored separately from others in one or more regular disk pages. There are two basic variations of this format, described next. The first variation, called the Decomposition Storage Model [5], stores one column per sub-relation in a page, as shown in Figure 1(B). The performance of this placement technique is highly dependent on the query being executed. On the one hand, if the query references only columns in HDFS blocks that are locally available to the cluster node, then this scheme avoids performing unnecessary column reads from the underlying storage, since it only reads the columns required by the query. On the other hand, the above requirement is not always met in the MapReduce framework. Thus, if the query references columns not locally available, then the query has to perform excessive network transfers to fetch columns from multiple other cluster nodes in order to do record reconstruction, resulting in poor query performance. Despite this fact, this scheme does not require any prior knowledge given and, thus, it is workload independent. The second variation, called column-group storage, stores multiple columns per sub-relation in a page instead of only one, as shown in Figure 1(C). The way that data are organized (row or column oriented) inside a group is dependent on system implementation, while some columns may exist in multiple column groups. This scheme improves over the column stores when the queries to be executed are known a-priori, and when appropriate column groups have been created for them. In this case, column-group storage avoids the excessive network transfers of column-stores, since all the required columns exist in local storage. Otherwise, a record reconstruction is still necessary to merge two or more column groups. Thus, column-group sacrifices workload independence to get better performance. Finally, overlapping columns increase the data footprint, resulting in underutilization of the storage space. Finally, we note two things. First, neither of the two column schemes achieve fast data loading, since they both require preprocessing of each relational record in order to split the different attributes to different pages. Second, both schemes achieve a high compression ratio by compressing only one column at a time, thus distinguishing between data domains. This results in efficient storage space utilization.

4 EDIC RESEARCH PROPOSAL 4 Fig. 1: The four different data placement strategies, namely row-store, column-store, column-group store and PAX-store. C. PAX-Store The Partition Attributes Across (PAX) [3] technique, whose design is shown in Figure 1(D), enhances the overall performance by improving CPU cache performance. PAX adopts the concept of first horizontally partition data like a row store, then vertically partition data within each page like a column store. Thus, it inherits the fast data loading and the workload independence of row-stores. Furthermore, it avoids network reconstructions for a record by storing together all the attributes of a record inside the same page. Finally, it improves cache performance by grouping all values of an attribute together in a mini-page inside the page, so that a cache miss brings into main memory related elements of the same column. Fast Query Processing Fast Data Loading Efficient Storage Space Util. Row Store Column Store? Column Group?? PAX? Not workload specific TABLE I: Merits and drawbacks of the four data placement structures in each requirement of big data processing Though PAX improves system performance by exploiting better cache usage, it is not designed to satisfy the requirements of big data processing for the following three reasons. First, PAX does not employ data compression, a technique that can significantly reduce the storage space required. Second, since PAX does not actually change the contents of a page, it performs the exact same number of I/O requests as a row store. This causes poor query execution performance. Finally, given that modern data analytics databases nowadays contain records with thousands of wide attributes, it may be the case that a single record may not fit in a 4KB page, which is the fixed unit of data organization for PAX. Such an event causes multiple (and not necessarily sequential) I/O requests for a single record, significantly degrading performance. D. Summary of analysis so far Table I summarizes the behavior of the four placement techniques in each of the requirements of big data processing. We use symbol? to denote when a scheme conditionally satisfies the requirement (e.g. when there is no overlapping of groups in column-group stores). We observe there is no placement scheme that satisfies all the requirements. This observation is the starting point of the RCFile placement format, described next. E. Record Columnar File (RCFile) [10] The authors of [10] propose a new data placement technique called Record Columnar File (RCFile), whose aim is to satisfy all the requirements of large scale data analysis in systems like MapReduce. Since PAX already satisfies two of the requirements, RCFile adopts the concept of first horizontally partition, then vertically partition data from this system, while adding to it the missing role of I/O performance. We describe the overall design of RCFile next, and how it overcomes the limitations of PAX. To begin with, PAX performs poorly when a relational record cannot fit in its fixed operational unit of 4KB pages. To overcome this limitation, RCFile increases the basic operation unit and organizes records into row groups, whose size can be configured to more than 4KB. For instance, RCFile adapted in Facebook uses 4MB as the default row-group size. Thus, each HDFS block may contain one or more row groups, each in turn containing multiple records. With this design RCFile achieves a minimal cost of record reconstruction since, given the relational schema, all record attributes can be guaranteed to be together in the same row-group in the same node. Then, and similarly to PAX, each row group is internally organized like a column store. This makes it possible to skip unnecessary column reads at query execution time, since RCFile can read from the underlying storage only the columns required by the executed query. Both these design choices enable fast query processing times for RCFile. Secondly, RCFile achieves efficient storage space utilization, by compressing each column separately, like a columnstore. RCFile uses the GZIP algorithm which provides high compression ratios, while its high decompression overheads are alleviated in RCFile by using a technique called lazy decompression. This allows for only some of the columns of a record inside a row-group to be decompressed. For instance, consider the query SELECT c 1 FROM tbl(c 1,c 2,c 3,c 4 ) WHERE c 4 = 1. Then, at query execution time, a mapper processes the row-groups inside its assigned HDFS block sequentially. However, RCFile does not read the contents of

5 EDIC RESEARCH PROPOSAL 5 the whole row group in memory. Instead, it only reads the rowgroup metadata and the referenced columns. In the example above, these columns are c 1 and c 4. At this point, column c 4 must be decompressed in memory in order to check the WHERE condition. However, column c 1 is only decompressed if there exists a record within the row-group that satisfies the condition, and not otherwise. Finally, we make two observations. First, while a larger row group size may improve compression ratios, it may cancel out the performance benefits of lazy decompression. This is because it becomes increasingly likely to find a record that satisfies the where condition within a row-group, as its size gets bigger. Second, while the design of RCFile manages to satisfy the requirements of efficient storage space utilization and fast query processing, it still inherits the fast data loading and workload independence from PAX. IV. HADOOPDB: A STACKED ARCHITECTURE In the previous section we discussed how the performance of the MapReduce framework can be improved by using (and extending) data placement techniques of parallel DBMS. An alternative and orthogonal approach examines whether one can use a database system directly, as is, by inserting such a database as a resource in each of the MapReduce nodes. The main intuition is that, if we push as much operations as possible inside those databases, then we can effectively take full advantage of their performance optimizations. Thus, the main research question is whether one can build a hybrid system which has the scalability and fault-tolerance behavior of MapReduce, while it maintains the performance characteristics of a parallel database in order to be able to effectively achieve all the requirements presented in Section II. HadoopDB [2], whose architecture is presented in Figure 2a, follows exactly this idea and places a database system in each node, while using MapReduce as a coordination and network communication medium between those databases. By doing so, HadoopDB is able to effectively transform any singlenode database system into a shared-nothing parallel database, while inheriting the job tracking, runtime scheduling and fault tolerance of MapReduce. HadoopDB integrates almost seamlessly with Hadoop, since the databases are just data sources to the framework, similar to the data blocks in HDFS. This system operates as follows. First, SQL queries are translated into MapReduce jobs using a modified version of Hive [18]. Then, each MapReduce job connects to the underlying databases, executes as much of the query as possible there, and returns a set of key-value pairs to the MapReduce framework for further processing. Given this description, HadoopDB operates into three phases: data loading, query generation and finally query execution. We describe each of these phases next. A. Data loading HadoopDB must first load the input data into the individual databases, before commencing processing inside those engines. This operation consists of the following two steps. First, the data loader component of HadoopDB performs a global and a local repartitioning operation to the raw HDFS data files, before loading them into the databases. During the global repartitioning, HDFS data are re-partitioned across the cluster nodes on a given partition key. Then, each node separately performs a secondary local repartitioning, where it breaks apart its HDFS partition into several smaller chunks, based on a given secondary key. The partitioning functions are chosen so that they ensure good load-balancing and uniform chunk size. Second, each chunk is bulk-loaded at each node into its corresponding single-node database, at which point an appropriate index may be created inside the DBMS. Though these repartitioning operations incur high network overheads, they only affect performance at data loading time. Thus, not only they pose no overhead at query execution time, but they can significantly improve the performance of analytical workloads. For instance, consider a query including a join operation between two relations. In the MapReduce framework, joins are traditionally performed by repartitioning records of the two relations by the join key in the map phase, so that reducers can join groups of records with the same key in the reduce phase. This approach, while it does not require any schema knowledge, it incurs high network transfers due to the reshuffling of data between the map and reduce phase, thus causing high query execution overheads. In contrast, HadoopDB can choose its repartition keys to match the joining attribute for both relations, effectively placing the records with the same key in the same node at data loading time. By doing so, the join operation can then be pushed completely to the local databases. Notice that in order to do this, HadoopDB assumes knowledge of joining keys, thus it requires a-priori knowledge of the queries to be executed. Finally, we note that HadoopDB must also maintain information about the individual databases. This information is stored in the Catalog component of HadoopDB and includes, among other things, connection parameters and credentials, schema information and replication and partitioning properties. B. Query generation HadoopDB provides an SQL to MapReduce to SQL (SMS) planner in order to execute SQL queries on the system. This component is basically an extension of Hive [18]. Due to space constraints, we omit how Hive produces the original MapReduce plans, and we prompt the reader to the original paper for further details. However, we note that Hive by default assumes no co-location of data and produces its plans accordingly. In contrast, HadoopDB depends on such colocation to improve query performance. Next, we present the SMS modifications that manage to achieve co-location of data and to push most of the query logic to the local databases at query generation time. SMS extends Hive into two main areas. First, it updates the metadata catalog of Hive with schema information and references to the database tables. Though Hive by default stores each table in a separate file in HDFS, it also allows for tables to exist externally outside HDFS. Second, it transforms Hive s physical execution plan by performing two passes over it. In the first pass, it retrieves the partitioning keys used by Hive s repartitioning operators called Reduce Sink operators. In the second pass, SMS goes through this list of operators until

6 EDIC RESEARCH PROPOSAL 6 (a) Fig. 2: (a) The overall architecture of HadoopDB, with its various components and (b) An example query, showing the modifications made by the SMS planner of HadoopDB to the physical execution plan produced by Hive. (b) it determines the first such operator for which the database repartitioning key differs from the operator key. At this point, it creates an SQL query for all the operators encountered so far, using a rule-based SQL generator to produce SQL out of Hive physical operators. This SQL query is executed by the local databases, as we describe in the next section. Then, it continues to find the next group of operators with matching keys, until all operators have been examined. For instance, consider the query shown in Figure 2b, where the original Hive physical plan is shown in the left. There is only one reduce sink operation for this query, and the two possible SMS plans produced are shown in the right. If the sales table is partitioned by YEAR(saleDate), then the entire processing logic is pushed in the databases, as shown in the top right part of the figure. In this case, a single map task per node suffices to complete the given query. Otherwise, SMS must produce the plan shown in the lower right part of the figure, in which partial aggregates must first be produced in the Map phase, while the Reduce phase merges those partial aggregates from each node to produce the final result. C. Query execution Finally, at query execution time, the MapReduce job is executed over the cluster nodes: each node connects to the local database engine to execute the queries generated by the SMS planner as described above. The interface between the database engines and MapReduce framework is provided by the Database Connector of HadoopDB. This entity provides information to MapReduce about which JDBC driver to use as well as other query tuning parameters (such as query fetch size). The connector is basically an extension of the Input- Format library class which is responsible for transforming data into key-value pairs and connecting to various resources (like the databases in the case of HadoopDB). After execution of the query, the connector is responsible for returning all produced results as key-value pairs to MapReduce for further processing. V. HADOOP++: EXTENDING MAPREDUCE WITH INDEXING AND EFFICIENT JOIN PROCESSING [8] In the previous section we presented HadoopDB which aims to improve the performance of the MapReduce framework by using the performance enhancements of a database directly as is, by inserting such a database in each node of MapReduce. This work, while it optimizes MapReduce performance, raises the next research question about whether it is possible to match the performance enhancements of HadoopDB (or even improve it) by not using a DBMS system. This question becomes even more pronounced especially when we consider the following two facts. First, the performance benefits of HadoopDB are mostly due to indexing and proper co-partition of data. However, these are both techniques that can be developed outside a DBMS and used on any processing system. Second, HadoopDB effectively changes the interface to SQL and requires installing and configuring a database at each node, a task that can be tedious as we discussed in Section II-F. Thus, it is imperative that we study, given schema and query knowledge, the feasibility of incorporating indexing and data co-partition techniques in the MapReduce framework, so that the original map-reduce interface is maintained. Such an analysis is carried out in the Hadoop++ system [8]. Our presentation of this system is structured as follows. First, we analyze how Hadoop++ manages to transparently integrate its changes into Hadoop. Then, we present the indexing and joining techniques proposed by Hadoop++, called Trojan Index and Trojan Join, respectively. A. Hadoop++ modifications to Hadoop Hadoop++ incorporates indexing and data co-partitioning in Hadoop by changing the internal layout of a Hadoop split, which is a large horizontal partition of data. However, Hadoop++ does not change the underlying framework implementation directly to do so, but instead it overrides specific Hadoop functions with its own UDFs. This allows Hadoop++ to behave like the original Hadoop when necessary by the user.

The first two, cmp and grp, are used when sorting data on a given key.

7 EDIC RESEARCH PROPOSAL 7 Fig. 3: Split format(s) used for (a) indexing, (b) data copartition and (c) indexing over Co-partitioned data More specifically, it overrides the cmp, grp, sh, split, itemize functions of Hadoop. The first two, cmp and grp, are used when sorting data on a given key. Sorting is performed after the map and reduce functions have been executed, to perform partial and full aggregations on the key value pairs. Function sh repartitions data between the map and reduce phases, while split and itemize functions define how data are organized in and read from the HDFS data blocks. We provide more details about those UDFs in the following discussion. B. Trojan Index The basic idea behind the Trojan Index is similar to that employed by parallel database systems: create the index at data loading time, so that its use improves performance at query execution time by avoiding unnecessary I/O and processing. Trojan Indices can be optionally used, require no SQL engine to be used for their creation and usage, make no modifications in the underlying implementation of Hadoop and allow partial and multiple indices to be build on an input split. To begin with, at data loading time, Hadoop++ uses a cache conscious CSS-tree [16] to represent the index. In order to be I/O efficient, the system places this structure along with all related metadata (header and footer) and the corresponding data, as shown in Figure 3(a). Index creation operates as follows. First, a custom-built MapReduce job reads the (non indexed) input data set stored on HDFS. For each record read, the corresponding mapper constructs and emits a new record withsplitid prj a (k v) as composite key, and the old record (k v) as value, where symbol stands for concatenation and prj a stands for projection on attribute a. By performing such a map function, and by proper overriding of specific Hadoop s functions as described next, data arrive sorted on the index attribute per split at the reduce side. At this point, the reduce function creates a clustered index by simply emitting the set of values concatenated with the Trojan index, index header and split footer. The output data is then stored on the distributed file system. We note that indexing incurs an overhead of about 8MB for 1GB of initial data. Hadoop++ needs to override three Hadoop functions to implement the above functionality properly. First, in order to guarantee that all reducers receive almost the same amount of work, the partitioning UDF function sh is changed so that Hadoop repartitions using the splitid portion of the composite key modulo the number of nodes in the cluster. Notice that since the split size is fixed and splitid is an always increasing counter, this hashing function ensures negligible work imbalance between different nodes. Second, Hadoop++ changes the cmp function of the framework so that it sorts records by considering only the index attribute of the composite key. Finally, since Hadoop++ builds an index per split, the split format must be preserved in each reducer call. Thus, Hadoop++ provides the grp UDF so that records with the same split identifier are grouped together based on the splitid part of the composite key. Then, at query execution time, Hadoop++ proceeds as follows. First, the MapReduce query job extracts logical splits at each node from the HDFS data files created at data loading time. To do so, the framework overrides the default split function used by Hadoop, so that it uses split footers to identify the boundaries of the logical splits within HDFS blocks in each file. Then, for each logical split obtained, Hadoop++ first reads the header of the split in order to obtain the key range of the index in this split. If the key requested by the query does not overlap with the key range in the split, then the whole split is skipped. Otherwise, there is some overlap, and the CSStree is read into main memory. The CSS index is then used to read only the records that satisfy the search predicate of the query executed, and only those records are passed to the map function. This functionality is achieved by overriding the itemize function of Hadoop. C. Trojan Join Trojan Join allows for more efficient join processing, by exploiting schema knowledge and properly co-partitioning at data loading time. With this design, similarly to HadoopDB, it becomes possible to compute all join results locally at the map phase only at query execution time, thus reducing network overheads (since the shuffle and reduce phases of MapReduce are completely skipped). Finally, like Trojan Indices, this join technique does not require any modification to the underlying implementation of Hadoop. Hadoop++ implements co-partitioning by placing records with the same join key from the two relations on the same split, thus forming co-groups that are processed at query execution time on the same node. The co-partitioned data layout generated by Hadoop++ is shown in Figure 3(b) and is produced by executing the following MapReduce job. First, at the map phase, the job outputs the corresponding join attribute for each record as key and the record itself as value. By doing so, records from both relations that have the same key go to the same reducer. Then, it suffices to put them together in the same co-group and split. Then, at query execution time, the following algorithm is used. First, the split footer is read to get the split boundaries and the boundaries of each co-group. The map function that process the split has all the records from both relations locally to perform the join operation. Thus, it just suffices to read the records from the underlying local storage, buffer them to memory, perform the join and then output the join result. Since

8 EDIC RESEARCH PROPOSAL 8 there exists no need for a reduce function, the output is then written to HDFS immediately. D. Trojan Index over Co-Partitioned Data Finally, we discuss how the two aforementioned techniques can be combined in Hadoop++, which provides the possibility of deploying a Trojan index along with co-partitioned data. The index can be built over one or both of the two copartitioned relations. An example of this format is shown in 3(c), where only one of the relations is indexed. Furthermore, the indexing key does not necessarily have to be the same as the joining key. However, if that is not the case, then additional sorting may have to be performed in the indexed relation, in order to build the clustered index. In any case, the only change in the design of Hadoop++ is customization of the itemize UDF, in order for it to correspond to the hybrid index and joining structures, and properly scan or skip the corresponding information inside the split. However, such adaptations are straightforward. VI. CONCLUSIONS & RESEARCH DIRECTION In this report we present three papers that aim to improve the performance of the MapReduce framework by using techniques typically used in PDBMS, or a database itself directly as is. Such techniques become necessary today because, though MapReduce can provide great scalability, its performance does not often match that of a parallel database system [14]. The first paper, RCFile [10], concludes that none of the traditional data placement techniques of parallel databases can satisfy all the performance requirements of modern analytical workloads, thus a new model is proposed especially for this purpose. The model, which is basically an extension of PAX, employs compression and large operational units, called rowgroups, and manages to provide fast data loading and query processing, efficient storage space utilization while it does not also require any prior workload knowledge. HadoopDB [2] examines whether one can take advantage of the performance enhancements provided by a database directly, by inserting such a database as a resource in each node of MapReduce. The proposed system mainly takes advantage of indexing and data co-partitioning in the database in order to achieve this goal, thus significantly enhancing performance. However, it changes the interface to SQL and requires installing and maintaining a database in each node, a task that can be significantly tedious. Finally, for this reason, Hadoop++ [8] attempts to enrich the performance of MapReduce not through a database, but by directly extending the framework so that it uses indices and a special join technique. This allows Hadoop++ to keep the original map-reduce interface. The authors justify their approach with the observation that indexing and data copartitioning can be developed outside a DBMS, and be used on any processing system. An important aspect of Hadoop++ is that it does not modify Hadoop directly, but instead it overrides specific functions of the framework. This allows Hadoop++ to behave like the original Hadoop when required by the user. There are two possible research directions to follow. First, there are still some performance improvements from PDBMS that can be applied to the MapReduce framework. As two examples, we mention result caching and directly operating on compressed data. Furthermore, given our emphasis on data analytics workloads, it would be interesting to examine how to adjust the result set when new data arrive, instead of recalculating the whole query. Second, and most importantly, observe that even though the performance of MapReduce is significantly improved from the systems presented in this report, neither of the architectures satisfy all of our requirements yet. For instance, HadoopDB and Hadoop++ manage to improve query performance, but they sacrifice the fast data loading requirement to do so. The author believes that such a system is possible, especially if a from-scratch approach is followed. Hybrid approaches, like those presented here, achieve high performance and scalability, however they also incorporate the inherent limitations of the individual approaches. Most importantly, there are interface overheads (e.g. converting relational records to key-value pairs in HadoopDB) that only grow as more data are pushed into the system. A newly designed system would not suffer from such problems, while it would (come closer to) satisfy the requirements of modern analytics workloads. REFERENCES [1] [2] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB Endow., 2(1): , Aug [3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. VLDB 01, pages , [4] Amazon Inc. Amazon Elastic Compute Cloud (Amazon EC2). [5] G. P. Copeland and S. N. Khoshafian. A decomposition storage model. SIGMOD 85, pages , New York, NY, USA, ACM. [6] G. Czajkowski. Sorting 1pb with mapreduce, Nov [7] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): , Jan [8] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). VLDB Endow., 3(1-2): , Sept [9] J. Hamilton. Cooperative expendable micro-slice servers (cems): Low cost, low power servers for internet-scale services. CIDR 09, [10] Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. ICDE 11, pages , [11] S. Madden, D. DeWitt, and M. Stonebraker. Database parallelism choices greatly impact scalability. database-parallelism-choices-greatly-impact-scalability, Oct [12] C. Monash. The 1-petabyte barrier is crumbling. Aug [13] C. Monash. Cloudera presents the mapreduce bull case. Apr [14] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. SIGMOD 09, pages , New York, NY, USA. ACM. [15] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill Science/Engineering/Math, 3 edition, Aug [16] J. Rao and K. A. Ross. Cache conscious indexing for decision-support in main memory. VLDB 99, pages 78 89, San Francisco, CA, USA. [17] The Economist. The data deluge , Feb [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop. ICDE 10, pages , Mar

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data