Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics

Size: px
Start display at page:

Download "Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics"

Transcription

1 EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement of data analytics systems as the amount of data being collected, stored and processed on a daily basis continues to grow rapidly. While the MapReduce framework [7] is especially designed to satisfy this requirement, it has often been criticized for its poor performance [14]. In this report we present three papers that show how the performance of MapReduce framework can be significantly improved by using techniques inherited from Parallel DataBase Management Systems (PDBMS). Two of these systems present techniques for data placement, indexing and a new join operator designed specifically for MapReduce, while the third examines whether the performance enhancements of a database can be used directly, as is, by inserting such a database as a resource in each MapReduce node. Our conclusions show that while these approaches look promising, there is still space for further improvements. Index Terms Analytical Workloads, MapReduce, Parallel DBMS, Data Placement, Indexing, Joins on MapReduce. I. INTRODUCTION Nowadays, we are witnessing an explosion in the amount of data being collected, stored and processed in data warehouses on a daily basis. It is becoming increasingly common to hear companies claiming to load more than a terabyte of data per day into their systems, whose total data footprint may exceed one petabyte [12]. For example, Facebook reported pushing 20TB of data every day to its (currently) 2.5PB warehouse ([10],[13]). As a result of this data deluge, as it has been called, corporations have significantly changed the way they process data [17]. Companies are moving away from Proposal submitted to committee: August 27, 2012; Candidacy exam date: September 3rd, 2012; Candidacy exam committee: Prof. Willy Zwaenepoel, Prof. Christoph Koch, Prof. Anastasia Ailamaki. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (R. Urbanke) (signature) performing data analysis 1 on high-end proprietary machines in favor of shared-nothing cluster architectures composed of thousands of low-cost machines built by commodity unreliable hardware. Those architectures are usually provided in the form of virtualized environments of private or public clouds, such as Amazon s Elastic Compute Cloud (EC2) [4]. The fundamental reason behind this shift is cost. Given that efficiently processing datasets in the petabyte scale requires exploiting a great degree of parallelism, the proven scalability [11], price to performance and power to performance [9] characteristics of these architectures make them ideal candidates for large-scale data analysis. This is because data analysis workloads tend to consist of many large scan operations, multi dimensional aggregations and star schema joins, all of which are fairly easy to parallelize across nodes in a shared-nothing network [2]. Furthermore, cloud offerings today can reduce operational costs by maximizing their underlying hardware utilization. This reduction is then passed to clients, who basically pay only for what they use, thus reducing their own operating, facilities, and other hardwarerelated costs. As a result, cloud offerings can be used today as the infrastructure on top of which one can build highlyscalable services [9]. Unfortunately however, as we explain in Sections II-B and II-C, there are many reasons that make it difficult for existing parallel database systems to scale to thousands of nodes located in those cloud environments. Based on the above analysis, and on the observation that the scalability requirements of applications are not likely to go down, the MapReduce framework [7] and its open source implementation called Hadoop [1] have been developed. This framework has been designed from the start in order to scale to thousands of nodes, and its use in Google is proof of that [6]. Despite its high scalability, MapReduce has been frequently criticized for its poor performance, which can be up to an order of magnitude slower than a traditional parallel DBMS [14]. In this report we present three papers that show that one can use techniques inherited from parallel databases, or even a database itself directly as is, in order to boost the performance of MapReduce. This report is structured as follows. In Section II we summarize the requirements a modern state-ofthe-art system should satisfy in order to efficiently process data in the petabyte scale. Then, in Section III, we study whether data placement techniques proposed for parallel database can boost the performance of MapReduce. The conclusion of this 1 Data analysis workloads are read-intensive and produce write requests only during data loading.

2 EDIC RESEARCH PROPOSAL 2 discussion is that those techniques do not completely satisfy our requirements and, thus, a new technique called RCFile [10] is developed. In Section IV, we present HadoopDB [2]. which examines whether the performance enhancements of a database can be used directly, as is, by inserting such a database as a resource in each node of MapReduce. On the other hand, Hadoop++ [8], presented in Section V, targets to enrich MapReduce not through a database, but by extending the framework so that it uses indices and a special join technique. Finally, in Section VI, we present our conclusions and we state some interesting research paths which may extend the systems presented in this report. II. DESIRED PROPERTIES OF ANALYTICS PLATFORMS In this section we present the properties a modern state-ofthe-art data analytics system should have in order to meet the requirements of modern workloads. While doing so, we also analyze whether parallel databases and MapReduce satisfy each requirement or not. A. High-Performance High performance is the primary requirement for a modern data analytics system. High performance brings cost savings, as it can help delay a costly hardware update as the application CPU, storage and network requirements continue to grow. There are three main aspects of this requirement: Fast data loading. Short loading times are becoming a crucial requirement, since data loading causes significant network and disk traffic, which interferes with normal query operations. Furthermore, short loading times increase the precision and freshness of related computations, since they are performed on more up-to-date data. Fast query processing. Traditionally, data warehouses must be able to concurrently execute multiple batches of heavy decision support queries. At the same time, there are nowadays response-time critical queries, originating by users in the web. To guarantee short query execution times, the volume of network and storage I/O must be minimized. The system must also retain this characteristic as the number of queries increases over time. Highly efficient storage space utilization. Though storage is becoming continuously cheaper, there are still power (fewer hard disks means less power consumption) and performance (fewer hard disk seeks means better performance) considerations that make this requirement essential. Most importantly, the trend is that data footprints will soon grow faster than storage densities [17]. Out of the two systems, parallel databases have been long optimized for high performance, and they currently incorporate at least one decade of optimization techniques published in the database literature (such as indexing, compression and directly operating on compressed data, materialized views, result caching and I/O sharing). Furthermore, PDBMS usually use cost-based optimizers that employ statistics or simple optimization rules to improve performance. Those statistics, along with modeling data on well defined schemas, can help make optimal choices, according to the query executed. On the other hand, MapReduce uses unstructured data analysis, thus it cannot employ many of the performance optimizations of databases. This can lead to an order of magnitude worse performance compared to a PDBMS [14]. B. Fault Tolerance and Scalability For data analysis systems, where there are no write operations except those caused by data loading, fault tolerance is defined as not having to restart a query when some node working on this query fails. Given that the possibility of a node failure increases as one adds more commodity failureprone machines to the cluster, restarting a query on a node failure may make long-running queries difficult to complete. Usually there is a clear tradeoff between performance and fault-tolerance. For instance, checkpointing the result of completed sub-tasks increases the fault tolerance of long-running queries, but imposes a significant overhead which reduces performance. As another example consider the pipelining of intermediate results between query operators; a design choice that improves performance, but also increases the amount of work lost when a failure occurs. This last observation guides the design choices of the two data analytics platforms. On the one hand, PDBMS are historically designed for small clusters. This is, however, an environment in which failures are rare events. Thus, these systems simply restart the entire query processing when a failure is detected, using replica nodes, in order to avoid the overheads of techniques like checkpointing. As a result, those systems cannot efficiently scale in todays commodity clusters, and there is no known parallel database system today that scales to more than one hundred nodes. On the other hand, MapReduce is designed based on the observation that failures are frequently occurring events. Thus, it optimizes for faulttolerance through checkpointing of results in the map phase, and achieves high scalability to thousands of nodes. C. Managing Heterogeneity Acquiring homogeneous performance from all cluster nodes becomes an increasingly difficult task as more nodes are assigned to a data analytics system. This is true, even if all machines have the same, virtualized or physical, hardware. Partial node failures, fragmentation of individual disks, concurrent execution of tasks by different users or even software configuration errors all seriously degrade performance. Contrary to parallel DBMS, MapReduce uses a runtime-determined execution plan. This allows MapReduce to trade off the runtime scheduling overhead with the ability to handle slow nodes through redundant task execution. With this technique, the tasks of slow workers are also executed in faster nodes, thus making the execution time equal to the time required by the faster node to complete the task. D. Not workload specific Data stored in modern warehouses are analyzed by different applications and users in significantly diverse ways. As a result, the warehouse accesses do not conform to any regular workload pattern. Thus, the data analytics system should not assume any a-priori knowledge (e.g. of queries executed). Instead, it should quickly adapt to any workload it receives.

3 EDIC RESEARCH PROPOSAL 3 E. Wide Variety of Exported Interfaces There are three aspects of this requirement. First, given that many users of data analytics systems are not experienced programmers but analysts, the interface exported by the system should be as flexible as possible. To this end, access through both SQL and general-purpose languages like Java or C++ should be available to users. Furthermore, there should be no restrictions in the data format used (e.g. structured data). Second, there has been extensive prior work on business tools that help on visualization, query generation and data analysis. Analysts should be able to continue using these tools on a new system, without any modifications required to their code-base. Finally, users should be able to extend the system by programming UDFs. Then, those functions should be automatically parallelized by the system. F. Low Cost Finally, there is the issue of cost. On the one hand, Hadoop is an open-source implementation of MapReduce. On the other hand, parallel databases are extremely costly, usually with seven-figure prices. Furthermore, while there have been advances in automatic configuration, deployment and tuning of parallel databases, these systems still usually require highly skilled database administrators to maintain the system. Ideally, users would like a free system, which would satisfy all the above requirements out of the box as is. III. DATA PLACEMENT TECHNIQUES FOR MAPREDUCE Data placement can fundamentally affect the performance and workload independence aspects we described in Sections II-A and II-D respectively. Thus, there has been a long line of work in the field of parallel databases regarding how data should be placed to the underlying storage. Currently, there are three popular data placement techniques, namely row-store, column-store and PAX-store [3]. Given that the default placement technique of the MapReduce framework resembles a row-store, one reasonable question to ask is whether the other two schemes can perform better than a rowstore in this environment. To answer this question, one can examine the merits and drawbacks of each technique in all aspects of performance. We present such an analysis next. In what follows, we note that an HDFS block contains many disk pages, each one in the format described in each case below. A. Row-Storage In this data placement technique, which is formally called the N-ary storage model [15] and is displayed in Figure 1(A), records of a database relation are stored one after another in each disk page, in the order of their appearance. The major advantage of this scheme is that it achieves fast loading times, since it requires no preprocessing of the relational records before they are transmitted to the cluster nodes. It also does not require any a-priori knowledge of queries executed, thus it is workload independent. However, this scheme typically performs poorly at query execution time and it does not efficiently utilize the underlying storage space. This is because relational records currently tend to contain hundreds of attributes out of which only a few are used in the typical aggregation queries of warehouse workloads. As a result, row-storage performs unnecessary read requests to fetch columns from the underlying storage devices that are not used by the query. Furthermore, this scheme suffers from bad compression ratio, since the mixed data domains of the different attributes of each record cause high information entropy, which is known to affect the performance of compression algorithms. B. Column-Storage Columns stores improve I/O performance by taking advantage of the previous observation that only a few attributes of a relation are used in typical data warehousing queries. They do so by performing vertical partitioning of the attributes of a database relation into several sub-relations, so that each subrelation contains one or more attributes of the original relation. Then, each such sub-relation is stored separately from others in one or more regular disk pages. There are two basic variations of this format, described next. The first variation, called the Decomposition Storage Model [5], stores one column per sub-relation in a page, as shown in Figure 1(B). The performance of this placement technique is highly dependent on the query being executed. On the one hand, if the query references only columns in HDFS blocks that are locally available to the cluster node, then this scheme avoids performing unnecessary column reads from the underlying storage, since it only reads the columns required by the query. On the other hand, the above requirement is not always met in the MapReduce framework. Thus, if the query references columns not locally available, then the query has to perform excessive network transfers to fetch columns from multiple other cluster nodes in order to do record reconstruction, resulting in poor query performance. Despite this fact, this scheme does not require any prior knowledge given and, thus, it is workload independent. The second variation, called column-group storage, stores multiple columns per sub-relation in a page instead of only one, as shown in Figure 1(C). The way that data are organized (row or column oriented) inside a group is dependent on system implementation, while some columns may exist in multiple column groups. This scheme improves over the column stores when the queries to be executed are known a-priori, and when appropriate column groups have been created for them. In this case, column-group storage avoids the excessive network transfers of column-stores, since all the required columns exist in local storage. Otherwise, a record reconstruction is still necessary to merge two or more column groups. Thus, column-group sacrifices workload independence to get better performance. Finally, overlapping columns increase the data footprint, resulting in underutilization of the storage space. Finally, we note two things. First, neither of the two column schemes achieve fast data loading, since they both require preprocessing of each relational record in order to split the different attributes to different pages. Second, both schemes achieve a high compression ratio by compressing only one column at a time, thus distinguishing between data domains. This results in efficient storage space utilization.

4 EDIC RESEARCH PROPOSAL 4 Fig. 1: The four different data placement strategies, namely row-store, column-store, column-group store and PAX-store. C. PAX-Store The Partition Attributes Across (PAX) [3] technique, whose design is shown in Figure 1(D), enhances the overall performance by improving CPU cache performance. PAX adopts the concept of first horizontally partition data like a row store, then vertically partition data within each page like a column store. Thus, it inherits the fast data loading and the workload independence of row-stores. Furthermore, it avoids network reconstructions for a record by storing together all the attributes of a record inside the same page. Finally, it improves cache performance by grouping all values of an attribute together in a mini-page inside the page, so that a cache miss brings into main memory related elements of the same column. Fast Query Processing Fast Data Loading Efficient Storage Space Util. Row Store Column Store? Column Group?? PAX? Not workload specific TABLE I: Merits and drawbacks of the four data placement structures in each requirement of big data processing Though PAX improves system performance by exploiting better cache usage, it is not designed to satisfy the requirements of big data processing for the following three reasons. First, PAX does not employ data compression, a technique that can significantly reduce the storage space required. Second, since PAX does not actually change the contents of a page, it performs the exact same number of I/O requests as a row store. This causes poor query execution performance. Finally, given that modern data analytics databases nowadays contain records with thousands of wide attributes, it may be the case that a single record may not fit in a 4KB page, which is the fixed unit of data organization for PAX. Such an event causes multiple (and not necessarily sequential) I/O requests for a single record, significantly degrading performance. D. Summary of analysis so far Table I summarizes the behavior of the four placement techniques in each of the requirements of big data processing. We use symbol? to denote when a scheme conditionally satisfies the requirement (e.g. when there is no overlapping of groups in column-group stores). We observe there is no placement scheme that satisfies all the requirements. This observation is the starting point of the RCFile placement format, described next. E. Record Columnar File (RCFile) [10] The authors of [10] propose a new data placement technique called Record Columnar File (RCFile), whose aim is to satisfy all the requirements of large scale data analysis in systems like MapReduce. Since PAX already satisfies two of the requirements, RCFile adopts the concept of first horizontally partition, then vertically partition data from this system, while adding to it the missing role of I/O performance. We describe the overall design of RCFile next, and how it overcomes the limitations of PAX. To begin with, PAX performs poorly when a relational record cannot fit in its fixed operational unit of 4KB pages. To overcome this limitation, RCFile increases the basic operation unit and organizes records into row groups, whose size can be configured to more than 4KB. For instance, RCFile adapted in Facebook uses 4MB as the default row-group size. Thus, each HDFS block may contain one or more row groups, each in turn containing multiple records. With this design RCFile achieves a minimal cost of record reconstruction since, given the relational schema, all record attributes can be guaranteed to be together in the same row-group in the same node. Then, and similarly to PAX, each row group is internally organized like a column store. This makes it possible to skip unnecessary column reads at query execution time, since RCFile can read from the underlying storage only the columns required by the executed query. Both these design choices enable fast query processing times for RCFile. Secondly, RCFile achieves efficient storage space utilization, by compressing each column separately, like a columnstore. RCFile uses the GZIP algorithm which provides high compression ratios, while its high decompression overheads are alleviated in RCFile by using a technique called lazy decompression. This allows for only some of the columns of a record inside a row-group to be decompressed. For instance, consider the query SELECT c 1 FROM tbl(c 1,c 2,c 3,c 4 ) WHERE c 4 = 1. Then, at query execution time, a mapper processes the row-groups inside its assigned HDFS block sequentially. However, RCFile does not read the contents of

5 EDIC RESEARCH PROPOSAL 5 the whole row group in memory. Instead, it only reads the rowgroup metadata and the referenced columns. In the example above, these columns are c 1 and c 4. At this point, column c 4 must be decompressed in memory in order to check the WHERE condition. However, column c 1 is only decompressed if there exists a record within the row-group that satisfies the condition, and not otherwise. Finally, we make two observations. First, while a larger row group size may improve compression ratios, it may cancel out the performance benefits of lazy decompression. This is because it becomes increasingly likely to find a record that satisfies the where condition within a row-group, as its size gets bigger. Second, while the design of RCFile manages to satisfy the requirements of efficient storage space utilization and fast query processing, it still inherits the fast data loading and workload independence from PAX. IV. HADOOPDB: A STACKED ARCHITECTURE In the previous section we discussed how the performance of the MapReduce framework can be improved by using (and extending) data placement techniques of parallel DBMS. An alternative and orthogonal approach examines whether one can use a database system directly, as is, by inserting such a database as a resource in each of the MapReduce nodes. The main intuition is that, if we push as much operations as possible inside those databases, then we can effectively take full advantage of their performance optimizations. Thus, the main research question is whether one can build a hybrid system which has the scalability and fault-tolerance behavior of MapReduce, while it maintains the performance characteristics of a parallel database in order to be able to effectively achieve all the requirements presented in Section II. HadoopDB [2], whose architecture is presented in Figure 2a, follows exactly this idea and places a database system in each node, while using MapReduce as a coordination and network communication medium between those databases. By doing so, HadoopDB is able to effectively transform any singlenode database system into a shared-nothing parallel database, while inheriting the job tracking, runtime scheduling and fault tolerance of MapReduce. HadoopDB integrates almost seamlessly with Hadoop, since the databases are just data sources to the framework, similar to the data blocks in HDFS. This system operates as follows. First, SQL queries are translated into MapReduce jobs using a modified version of Hive [18]. Then, each MapReduce job connects to the underlying databases, executes as much of the query as possible there, and returns a set of key-value pairs to the MapReduce framework for further processing. Given this description, HadoopDB operates into three phases: data loading, query generation and finally query execution. We describe each of these phases next. A. Data loading HadoopDB must first load the input data into the individual databases, before commencing processing inside those engines. This operation consists of the following two steps. First, the data loader component of HadoopDB performs a global and a local repartitioning operation to the raw HDFS data files, before loading them into the databases. During the global repartitioning, HDFS data are re-partitioned across the cluster nodes on a given partition key. Then, each node separately performs a secondary local repartitioning, where it breaks apart its HDFS partition into several smaller chunks, based on a given secondary key. The partitioning functions are chosen so that they ensure good load-balancing and uniform chunk size. Second, each chunk is bulk-loaded at each node into its corresponding single-node database, at which point an appropriate index may be created inside the DBMS. Though these repartitioning operations incur high network overheads, they only affect performance at data loading time. Thus, not only they pose no overhead at query execution time, but they can significantly improve the performance of analytical workloads. For instance, consider a query including a join operation between two relations. In the MapReduce framework, joins are traditionally performed by repartitioning records of the two relations by the join key in the map phase, so that reducers can join groups of records with the same key in the reduce phase. This approach, while it does not require any schema knowledge, it incurs high network transfers due to the reshuffling of data between the map and reduce phase, thus causing high query execution overheads. In contrast, HadoopDB can choose its repartition keys to match the joining attribute for both relations, effectively placing the records with the same key in the same node at data loading time. By doing so, the join operation can then be pushed completely to the local databases. Notice that in order to do this, HadoopDB assumes knowledge of joining keys, thus it requires a-priori knowledge of the queries to be executed. Finally, we note that HadoopDB must also maintain information about the individual databases. This information is stored in the Catalog component of HadoopDB and includes, among other things, connection parameters and credentials, schema information and replication and partitioning properties. B. Query generation HadoopDB provides an SQL to MapReduce to SQL (SMS) planner in order to execute SQL queries on the system. This component is basically an extension of Hive [18]. Due to space constraints, we omit how Hive produces the original MapReduce plans, and we prompt the reader to the original paper for further details. However, we note that Hive by default assumes no co-location of data and produces its plans accordingly. In contrast, HadoopDB depends on such colocation to improve query performance. Next, we present the SMS modifications that manage to achieve co-location of data and to push most of the query logic to the local databases at query generation time. SMS extends Hive into two main areas. First, it updates the metadata catalog of Hive with schema information and references to the database tables. Though Hive by default stores each table in a separate file in HDFS, it also allows for tables to exist externally outside HDFS. Second, it transforms Hive s physical execution plan by performing two passes over it. In the first pass, it retrieves the partitioning keys used by Hive s repartitioning operators called Reduce Sink operators. In the second pass, SMS goes through this list of operators until

6 EDIC RESEARCH PROPOSAL 6 (a) Fig. 2: (a) The overall architecture of HadoopDB, with its various components and (b) An example query, showing the modifications made by the SMS planner of HadoopDB to the physical execution plan produced by Hive. (b) it determines the first such operator for which the database repartitioning key differs from the operator key. At this point, it creates an SQL query for all the operators encountered so far, using a rule-based SQL generator to produce SQL out of Hive physical operators. This SQL query is executed by the local databases, as we describe in the next section. Then, it continues to find the next group of operators with matching keys, until all operators have been examined. For instance, consider the query shown in Figure 2b, where the original Hive physical plan is shown in the left. There is only one reduce sink operation for this query, and the two possible SMS plans produced are shown in the right. If the sales table is partitioned by YEAR(saleDate), then the entire processing logic is pushed in the databases, as shown in the top right part of the figure. In this case, a single map task per node suffices to complete the given query. Otherwise, SMS must produce the plan shown in the lower right part of the figure, in which partial aggregates must first be produced in the Map phase, while the Reduce phase merges those partial aggregates from each node to produce the final result. C. Query execution Finally, at query execution time, the MapReduce job is executed over the cluster nodes: each node connects to the local database engine to execute the queries generated by the SMS planner as described above. The interface between the database engines and MapReduce framework is provided by the Database Connector of HadoopDB. This entity provides information to MapReduce about which JDBC driver to use as well as other query tuning parameters (such as query fetch size). The connector is basically an extension of the Input- Format library class which is responsible for transforming data into key-value pairs and connecting to various resources (like the databases in the case of HadoopDB). After execution of the query, the connector is responsible for returning all produced results as key-value pairs to MapReduce for further processing. V. HADOOP++: EXTENDING MAPREDUCE WITH INDEXING AND EFFICIENT JOIN PROCESSING [8] In the previous section we presented HadoopDB which aims to improve the performance of the MapReduce framework by using the performance enhancements of a database directly as is, by inserting such a database in each node of MapReduce. This work, while it optimizes MapReduce performance, raises the next research question about whether it is possible to match the performance enhancements of HadoopDB (or even improve it) by not using a DBMS system. This question becomes even more pronounced especially when we consider the following two facts. First, the performance benefits of HadoopDB are mostly due to indexing and proper co-partition of data. However, these are both techniques that can be developed outside a DBMS and used on any processing system. Second, HadoopDB effectively changes the interface to SQL and requires installing and configuring a database at each node, a task that can be tedious as we discussed in Section II-F. Thus, it is imperative that we study, given schema and query knowledge, the feasibility of incorporating indexing and data co-partition techniques in the MapReduce framework, so that the original map-reduce interface is maintained. Such an analysis is carried out in the Hadoop++ system [8]. Our presentation of this system is structured as follows. First, we analyze how Hadoop++ manages to transparently integrate its changes into Hadoop. Then, we present the indexing and joining techniques proposed by Hadoop++, called Trojan Index and Trojan Join, respectively. A. Hadoop++ modifications to Hadoop Hadoop++ incorporates indexing and data co-partitioning in Hadoop by changing the internal layout of a Hadoop split, which is a large horizontal partition of data. However, Hadoop++ does not change the underlying framework implementation directly to do so, but instead it overrides specific Hadoop functions with its own UDFs. This allows Hadoop++ to behave like the original Hadoop when necessary by the user.

7 EDIC RESEARCH PROPOSAL 7 Fig. 3: Split format(s) used for (a) indexing, (b) data copartition and (c) indexing over Co-partitioned data More specifically, it overrides the cmp, grp, sh, split, itemize functions of Hadoop. The first two, cmp and grp, are used when sorting data on a given key. Sorting is performed after the map and reduce functions have been executed, to perform partial and full aggregations on the key value pairs. Function sh repartitions data between the map and reduce phases, while split and itemize functions define how data are organized in and read from the HDFS data blocks. We provide more details about those UDFs in the following discussion. B. Trojan Index The basic idea behind the Trojan Index is similar to that employed by parallel database systems: create the index at data loading time, so that its use improves performance at query execution time by avoiding unnecessary I/O and processing. Trojan Indices can be optionally used, require no SQL engine to be used for their creation and usage, make no modifications in the underlying implementation of Hadoop and allow partial and multiple indices to be build on an input split. To begin with, at data loading time, Hadoop++ uses a cache conscious CSS-tree [16] to represent the index. In order to be I/O efficient, the system places this structure along with all related metadata (header and footer) and the corresponding data, as shown in Figure 3(a). Index creation operates as follows. First, a custom-built MapReduce job reads the (non indexed) input data set stored on HDFS. For each record read, the corresponding mapper constructs and emits a new record withsplitid prj a (k v) as composite key, and the old record (k v) as value, where symbol stands for concatenation and prj a stands for projection on attribute a. By performing such a map function, and by proper overriding of specific Hadoop s functions as described next, data arrive sorted on the index attribute per split at the reduce side. At this point, the reduce function creates a clustered index by simply emitting the set of values concatenated with the Trojan index, index header and split footer. The output data is then stored on the distributed file system. We note that indexing incurs an overhead of about 8MB for 1GB of initial data. Hadoop++ needs to override three Hadoop functions to implement the above functionality properly. First, in order to guarantee that all reducers receive almost the same amount of work, the partitioning UDF function sh is changed so that Hadoop repartitions using the splitid portion of the composite key modulo the number of nodes in the cluster. Notice that since the split size is fixed and splitid is an always increasing counter, this hashing function ensures negligible work imbalance between different nodes. Second, Hadoop++ changes the cmp function of the framework so that it sorts records by considering only the index attribute of the composite key. Finally, since Hadoop++ builds an index per split, the split format must be preserved in each reducer call. Thus, Hadoop++ provides the grp UDF so that records with the same split identifier are grouped together based on the splitid part of the composite key. Then, at query execution time, Hadoop++ proceeds as follows. First, the MapReduce query job extracts logical splits at each node from the HDFS data files created at data loading time. To do so, the framework overrides the default split function used by Hadoop, so that it uses split footers to identify the boundaries of the logical splits within HDFS blocks in each file. Then, for each logical split obtained, Hadoop++ first reads the header of the split in order to obtain the key range of the index in this split. If the key requested by the query does not overlap with the key range in the split, then the whole split is skipped. Otherwise, there is some overlap, and the CSStree is read into main memory. The CSS index is then used to read only the records that satisfy the search predicate of the query executed, and only those records are passed to the map function. This functionality is achieved by overriding the itemize function of Hadoop. C. Trojan Join Trojan Join allows for more efficient join processing, by exploiting schema knowledge and properly co-partitioning at data loading time. With this design, similarly to HadoopDB, it becomes possible to compute all join results locally at the map phase only at query execution time, thus reducing network overheads (since the shuffle and reduce phases of MapReduce are completely skipped). Finally, like Trojan Indices, this join technique does not require any modification to the underlying implementation of Hadoop. Hadoop++ implements co-partitioning by placing records with the same join key from the two relations on the same split, thus forming co-groups that are processed at query execution time on the same node. The co-partitioned data layout generated by Hadoop++ is shown in Figure 3(b) and is produced by executing the following MapReduce job. First, at the map phase, the job outputs the corresponding join attribute for each record as key and the record itself as value. By doing so, records from both relations that have the same key go to the same reducer. Then, it suffices to put them together in the same co-group and split. Then, at query execution time, the following algorithm is used. First, the split footer is read to get the split boundaries and the boundaries of each co-group. The map function that process the split has all the records from both relations locally to perform the join operation. Thus, it just suffices to read the records from the underlying local storage, buffer them to memory, perform the join and then output the join result. Since

8 EDIC RESEARCH PROPOSAL 8 there exists no need for a reduce function, the output is then written to HDFS immediately. D. Trojan Index over Co-Partitioned Data Finally, we discuss how the two aforementioned techniques can be combined in Hadoop++, which provides the possibility of deploying a Trojan index along with co-partitioned data. The index can be built over one or both of the two copartitioned relations. An example of this format is shown in 3(c), where only one of the relations is indexed. Furthermore, the indexing key does not necessarily have to be the same as the joining key. However, if that is not the case, then additional sorting may have to be performed in the indexed relation, in order to build the clustered index. In any case, the only change in the design of Hadoop++ is customization of the itemize UDF, in order for it to correspond to the hybrid index and joining structures, and properly scan or skip the corresponding information inside the split. However, such adaptations are straightforward. VI. CONCLUSIONS & RESEARCH DIRECTION In this report we present three papers that aim to improve the performance of the MapReduce framework by using techniques typically used in PDBMS, or a database itself directly as is. Such techniques become necessary today because, though MapReduce can provide great scalability, its performance does not often match that of a parallel database system [14]. The first paper, RCFile [10], concludes that none of the traditional data placement techniques of parallel databases can satisfy all the performance requirements of modern analytical workloads, thus a new model is proposed especially for this purpose. The model, which is basically an extension of PAX, employs compression and large operational units, called rowgroups, and manages to provide fast data loading and query processing, efficient storage space utilization while it does not also require any prior workload knowledge. HadoopDB [2] examines whether one can take advantage of the performance enhancements provided by a database directly, by inserting such a database as a resource in each node of MapReduce. The proposed system mainly takes advantage of indexing and data co-partitioning in the database in order to achieve this goal, thus significantly enhancing performance. However, it changes the interface to SQL and requires installing and maintaining a database in each node, a task that can be significantly tedious. Finally, for this reason, Hadoop++ [8] attempts to enrich the performance of MapReduce not through a database, but by directly extending the framework so that it uses indices and a special join technique. This allows Hadoop++ to keep the original map-reduce interface. The authors justify their approach with the observation that indexing and data copartitioning can be developed outside a DBMS, and be used on any processing system. An important aspect of Hadoop++ is that it does not modify Hadoop directly, but instead it overrides specific functions of the framework. This allows Hadoop++ to behave like the original Hadoop when required by the user. There are two possible research directions to follow. First, there are still some performance improvements from PDBMS that can be applied to the MapReduce framework. As two examples, we mention result caching and directly operating on compressed data. Furthermore, given our emphasis on data analytics workloads, it would be interesting to examine how to adjust the result set when new data arrive, instead of recalculating the whole query. Second, and most importantly, observe that even though the performance of MapReduce is significantly improved from the systems presented in this report, neither of the architectures satisfy all of our requirements yet. For instance, HadoopDB and Hadoop++ manage to improve query performance, but they sacrifice the fast data loading requirement to do so. The author believes that such a system is possible, especially if a from-scratch approach is followed. Hybrid approaches, like those presented here, achieve high performance and scalability, however they also incorporate the inherent limitations of the individual approaches. Most importantly, there are interface overheads (e.g. converting relational records to key-value pairs in HadoopDB) that only grow as more data are pushed into the system. A newly designed system would not suffer from such problems, while it would (come closer to) satisfy the requirements of modern analytics workloads. REFERENCES [1] [2] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB Endow., 2(1): , Aug [3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. VLDB 01, pages , [4] Amazon Inc. Amazon Elastic Compute Cloud (Amazon EC2). [5] G. P. Copeland and S. N. Khoshafian. A decomposition storage model. SIGMOD 85, pages , New York, NY, USA, ACM. [6] G. Czajkowski. Sorting 1pb with mapreduce, Nov [7] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): , Jan [8] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). VLDB Endow., 3(1-2): , Sept [9] J. Hamilton. Cooperative expendable micro-slice servers (cems): Low cost, low power servers for internet-scale services. CIDR 09, [10] Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. ICDE 11, pages , [11] S. Madden, D. DeWitt, and M. Stonebraker. Database parallelism choices greatly impact scalability. database-parallelism-choices-greatly-impact-scalability, Oct [12] C. Monash. The 1-petabyte barrier is crumbling. Aug [13] C. Monash. Cloudera presents the mapreduce bull case. Apr [14] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. SIGMOD 09, pages , New York, NY, USA. ACM. [15] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill Science/Engineering/Math, 3 edition, Aug [16] J. Rao and K. A. Ross. Cache conscious indexing for decision-support in main memory. VLDB 99, pages 78 89, San Francisco, CA, USA. [17] The Economist. The data deluge , Feb [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop. ICDE 10, pages , Mar

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

HadoopDB: An open source hybrid of MapReduce

HadoopDB: An open source hybrid of MapReduce HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009

More information

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY *S. ANUSUYA,*R.B. ARUNA,*V. DEEPASRI,**DR.T. AMITHA *UG Students, **Professor Department Of Computer Science and Engineering Dhanalakshmi College of

More information

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Query processing on raw files. Vítor Uwe Reus

Query processing on raw files. Vítor Uwe Reus Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017 Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Anurag Sharma (IIT Bombay) 1 / 13

Anurag Sharma (IIT Bombay) 1 / 13 0 Map Reduce Algorithm Design Anurag Sharma (IIT Bombay) 1 / 13 Relational Joins Anurag Sharma Fundamental Research Group IIT Bombay Anurag Sharma (IIT Bombay) 1 / 13 Secondary Sorting Required if we need

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access Map/Reduce vs. DBMS Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: sharma@cse.uta.edu

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Multi-indexed Graph Based Knowledge Storage System

Multi-indexed Graph Based Knowledge Storage System Multi-indexed Graph Based Knowledge Storage System Hongming Zhu 1,2, Danny Morton 2, Wenjun Zhou 3, Qin Liu 1, and You Zhou 1 1 School of software engineering, Tongji University, China {zhu_hongming,qin.liu}@tongji.edu.cn,

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 7 Things To Know When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 1 What Shoes? Why Shoes? 3 Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop ABSTRACT Idris Hanafi and Amal Abdel-Raouf Computer Science Department, Southern Connecticut State University, USA Computers and Systems

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

VLDB. Partitioning Compression

VLDB. Partitioning Compression VLDB Partitioning Compression Oracle Partitioning in Oracle Database 11g Oracle Partitioning Ten Years of Development Core functionality Performance Manageability Oracle8 Range partitioning

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

5 Fundamental Strategies for Building a Data-centered Data Center

5 Fundamental Strategies for Building a Data-centered Data Center 5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Teradata Analyst Pack More Power to Analyze and Tune Your Data Warehouse for Optimal Performance

Teradata Analyst Pack More Power to Analyze and Tune Your Data Warehouse for Optimal Performance Data Warehousing > Tools & Utilities Teradata Analyst Pack More Power to Analyze and Tune Your Data Warehouse for Optimal Performance By: Rod Vandervort, Jeff Shelton, and Louis Burger Table of Contents

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Database System Architectures Parallel DBs, MapReduce, ColumnStores Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

Column-Stores vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really? Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian

More information

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

Column-Oriented Database Systems. Liliya Rudko University of Helsinki Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop #IDUG IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop Frank C. Fillmore, Jr. The Fillmore Group, Inc. The Baltimore/Washington DB2 Users Group December 11, 2014 Agenda The Fillmore

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Falling Out of the Clouds: When Your Big Data Needs a New Home

Falling Out of the Clouds: When Your Big Data Needs a New Home Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

The Hadoop Paradigm & the Need for Dataset Management

The Hadoop Paradigm & the Need for Dataset Management The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Data Warehousing 11g Essentials

Data Warehousing 11g Essentials Oracle 1z0-515 Data Warehousing 11g Essentials Version: 6.0 QUESTION NO: 1 Indentify the true statement about REF partitions. A. REF partitions have no impact on partition-wise joins. B. Changes to partitioning

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis 1 NoSQL So-called NoSQL systems offer reduced functionalities compared to traditional Relational DBMS, with the aim of achieving

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

How to survive the Data Deluge: Petabyte scale Cloud Computing

How to survive the Data Deluge: Petabyte scale Cloud Computing How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 Outline Part 1: Introduction What,

More information

Data Storage Infrastructure at Facebook

Data Storage Infrastructure at Facebook Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information