Evaluating Data Storage Structures of Map Reduce

Size: px
Start display at page:

Download "Evaluating Data Storage Structures of Map Reduce"

Transcription

1 The 8th nternational Conference on Computer Science & Education (CCSE 2013) April 26-28, Colombo, Sri Lanka MoB3.2 Evaluating Data Storage Structures of Map Reduce Haiming Lai, Ming Xu, Jian Xu, Yizhi Ren, Ning Zheng College of Computer, Hangzhou Dianzi University Hangzhou, China {mxu, Abstract-MapReduce framework and its open-source implementation Hadoop, a scalable and fault-tolerant infrastructure for big data analysis on large clusters, can achieve different performance with different data storage structures. This paper evaluates the performance about three kinds of data storage structures of MapReduce, namely row-store, columnstore, and RCFile. The evaluating experiments are designed to test three data storage structures in terms of data loading time, data storage space, and query execution time. The experimental results show that RCFile data storage structure can achieve better performance in most cases. Keywords-MapReduce; data storage structure; row-stor;, column-store; RCFile. NTRODUCTON We have entered an era of data explosion, where many data sets being processed and analyzed are called "big data". Big data not only requires a huge amount of storage, but also demands new data management on large distributed systems because conventional database systems have difficulty to manage big data. The popular MapReduce framework [2] and its opensource implementation Hadoop [3] provide a scalable and fault-tolerant infrastructure for big data analysis on large clusters. However, the performance of MapReduce is still far from ideal in the database context. According to a recent study [4], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. n order to achieve better performance more compute nodes can be allocated to speed up computation; however, this approach is not really cost effective. namely row-store, column-store, and RCFile. MapReduce can achieve different performance with different data storage structures. Row-store is the most common data storage structure, which stores data by rows. The common used data storage structures in Hadoop are TextFile and SequenceFile, which all belong to row-store. Column-store is a data storage structure, which stores data by columns. Column-store is integrated in the Pig data analysis system. RCFile is a data store structure, which gathers many advantages from row-store and column-store. t has also been adopted by Hive [5] and Pig [6]. This paper briefly summarizes major features of these structures for big data, and provides a detailed evaluation on the three data storage structures. The rest of this paper is organized as follows: Section 2 generally reviews related work. Section 3 presents the MapReduce programming model and the execution flow of a MapReduce job. Section 4 presents several different compression formats in Hadoop. Section 5 presents a detailed analysis of existing three data storage structures. Section 6 presents and discusses our benchmark results. Section 7 concludes the paper.. RELATED WORKS MapReduce has become a popular tool for processing large-scale data analytical tasks. However, the performance of MapReduce is still far from ideal in the database context. n [4] [7-8], the authors compared the performance of MapReduce with two parallel database systems. The authors noted that while the process to load data into DBMSs and the tuning of DBMSs incurred much longer time than a MapReduce system, the performance of parallel DBMSs is significantly better. So far, in order to improve the performance of MapReduce three main approaches have been proposed: based on configuration parameter[9], scheduling algorithm[ 10] and data storage structure[12-13]. First, in [9], an extensive experiment was performed to study how the job configuration parameters affect the observed performance of Hadoop. Second, in [10], the authors investigated the scheduling algorithm of Hadoop and proposed a LATE scheduling algorithm which improves Hadoop response times by a factor of two. Third, in [11], the authors extended C-Store (a column-oriented DBMS) with a compression sub-system, and show how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. n [12], the authors describe how the column-oriented storage techniques found in many parallel DBMSs can be used to improve Hadoop's performance. n [13], in order to improve the performance of MapReduce the authors present a data storage structure, called RCFile, and its implementation in Hadoop. What's more, in [8], the authors presented some techniques to improve the performance of MapReduce, including using a binary record format, indexing, merging the results etc. n [14], the authors have an in-depth study to MapReduce, and identify several desigu factors that affect the performance of Hadoop. n [15], the authors described the Manimal system for optimizing Map Reduce programs. namely row-store, column-store, and RCFile. This paper analyzes and compares three kinds of data storage structures of /13/$ EEE 1041

2 MapReduce, and evaluates the performance about three data storage structures with different compression formats in terms of data loading time, data storage space, and query execution time. The aim is to investigate which data storage structure and which compression format more suitable for MapReduce.. BREF NTRODUCTON ABOUT MAPREDUCE This section briefly introduces MapReduce and the common compression format used in Hadoop. A. Map and Reduce Operation According to [2], MapReduce is a programming model for processing large-scale datasets in computer clusters. The Map Reduce programming model consists of two functions, mapo and reduceo. The mapo function takes an input key/value pair and produces a list of intermediate key/value pairs. The MapReduce runtime system groups together all intermediate pairs based on the intermediate keys and passes them to reduceo function for producing the final results. A MapReduce cluster employs a master-slave architecture where one master node manages a number of slave nodes. n the Hadoop, the master node is called 10bTracker and the slave node is called TaskTracker. Hadoop launches a MapReduce job by first splitting the input dataset into even-sized data blocks. Each data block is then scheduled to one Task-Tracker node and is processed by a map task. The task assignment process is implemented as a heartbeat protocol. The TaskTracker node notifies the 10bTracker when it is idle. The scheduler then assigns new tasks to it. The scheduler takes data locality into account when it disseminates data blocks. t always tries to assign a local data block to a TaskTracker. f the attempt fails, the scheduler will assign a rack-local or random data block to the TaskTracker instead. When mapo functions complete, the runtime system groups all intermediate pairs and launches a set of reduce tasks to produce the final results. B. Compression Format Hadoop uses compression to reduce the space needed to store files and speeds up data transfer across the network, or to or from disk. There are many different compression formats, tools and algorithms, each with different characteristics. Table lists some common compression formats used in Hadoop. TABLE. SOME COMMON COMPRESSON FORMATS USED N HADOOP Compression Tool Algorithm format DEFLATE N/A DEFLATE Filename extension. deflate Multiple files Splittable gzip gzip DEFLATE.gz bzip2 bzip2 bzip2.bz2 Yes LZO Lzop LZO.zo All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. The different tools have very different compression characteristics. DEFLATE is a compression algorithm whose standard implementation is zlib. (te that the gzip file format is DEFLATE with extra headers and a footer.) The ".deflate" filename extension is a Hadoop convention. Gzip is general-purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2's decompresssion speed is faster than its compression speed, but it is still slower than the other formats. V. DATA STORE STRUCTURES OF MAPREDUCE namely row-store, column-store, and RCFile. This section compares these three data storage structures. A. Row-store Row-store is the most common data storage structure of MapReduce, which stores data by rows. Records are placed contiguously in a disk page. All fields of one record are padded one by one in the order of their occurrences. Figure 1 gives an example to show how a table is placed by the row-store in an HDFS (Hadoop Distributed File System) block. The major advantage of row-store is that it has fast data loading and strong adaptive ability to dynamic workloads. This is because row-store guarantees that all fields in the same record are located in the same cluster node since they are in the same HDFS block. However, row-store has two major weaknesses. First, row-store cannot provide fast query processing due to unnecessary column reads if only a subset of columns in a table are needed in a query. Second, it is not easy for row-store to achieve a high data compression ratio due to mixed columns with different data domains [16]. C C2 C ,..,. B. Column-store.,. '0' Block 1 Block 2 Block 3.,..,. j 1\ \ 16 Record Bytes Sync Number,, Compressed Compressed Keys Len Jths Keys Data Compressed Values Lengths Compressed Values Data (01, 11, 21) (02, 12, 22) ( ) (04, 14, 24) Fig.. An Example about Row-store in an. Column-store is a column-oriented store model, which stores data tables as sections of columns of data rather than as rows of data. Figure 2 shows an example on how a table is stored by column-store on HDFS. n this example, column C1, column C2 and column C3 are stored in three independent columns. The major advantage of column-store is that it can avoid reading unnecessary columns during a query execution, and can easily achieve a high compression ratio by compressing each column within the same data domain. However, the major weakness of column-store is that it cannot provide fast query processing due to high overhead of a tuple reconstruction. Column-store cannot guarantee that all fields in the same record are located in the same cluster node. For instance, in the example in Figure 2, the three fields of a record are stored in three HDFS blocks that can be located in different nodes. Therefore, a record reconstruction will cause a large 1042

3 amount of data transfers via networks among multiple cluster nodes. As introduced in the original MapReduce paper [2], excessive network transfers in a MapReduce cluster can always be a bottleneck source, which should be avoided if possible. C. RCFile C11C21C Column 1 Column 2 Column 3 V 02 J v.' 04 L-- _ " L Fig. 2. An Example about Column-store in an. RCFile [13] (Record Columnar File) is a data storage structure that determines how to store data tables on computer clusters. t is designed for data warehouse systems using the MapReduce framework. The RCFile structure is a systematic combination of multiple components including data storage format, data compression approach, and optimization techniques for data reading. Figure 3 shows an example on how a table is stored by RCFile on HDFS. RCFile applies the concept of "first horizontally-partition, then verticallypartition" from PAX [17]. t combines the advantages of both row-store and column-store. First, as row-store, RCFile guarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction. Second, as column-store, RCFile can exploit a column-wise data compression and skip unnecessary column reads. What is more, RCFile provides a lazy decompression technique to avoid unnecessary column decompression during query execution. RCFile allows a flexible row group size. A default size is given considering both data compression performance and query execution performance. Cl C2 C RCFile Row Group 16 Bytes Sync Row Group Row Group 2 01,02, OJ, ,12,13, ".vv\' '-J1UUp 3 21,22,23,24 Fig. 3. An Example about RCFile in an. V. EVALUATNG METHOD This section mainly presents evaluating method of data storage structures. The performance of MapReduce is determined by it self structure and data compression format. So, in order to comprehensively evaluate the effectiveness of different data storage structures, the evaluation should be operated on dataset with different data storage structures and compression formats. The effectiveness of different data storage structures with different data compression formats will be evaluated on three aspects: data storage space, data loading time, and query execution time. 1) Data Storage Space At present, the dataset is becoming bigger and bigger. How to store such a big dataset is a critical problem. The data storage space sizes required by different data storage structures are different. The data storage space size is important factor to evaluate performance. 2) Data Loading Time Data loading time is an important factor to evaluate a data storage structure. Especially, an era of data explosion has come. We often need to load big data. Under the current circumstances the data loading time become more and more important. f a data storage structure has less data loading time, it will become more popular. 3) Query Execution Time Query execution time is an important factor for a data storage structure. n order to evaluate the performance of lazy decompression of data storage structures, t will be better to execute several query with different selectivities according to their where conditions. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At present, Hive supports three data storage structures: TextFiles, SequenceFiles and RCFile. We compared these three data storage structures with different compression formats. At first, the dataset was loaded into Hive using different data storage structures with different compression formats. The loading time of different data storage structures were recorded in the loading period. Then, the data storage space of different data storage structures were calculated in the HDFS management page of Hadoop. At last, several query languages with different selectivities were executed in Hive. The query execution time of different data storage structures were recorded to evaluate the query execution effectiveness of different data storage structures with different data compression formats. V. EXPERMENT AND RESULTS This section conducted an experiment to evaluate different data storage structures with different compression formats. A. Experimental Setup The experiments are run on a cluster with 5 nodes connected by a lobit ethernet switch. One node is reserved to run the Hadoop jobtracker and the namenode. The remaining

4 nodes are used for HDFS and MapReduce. The namenode has two 2.20GHz ntel Core processors, 4GB of main memory, and one locally attached 500GB DE disk. Each of other nodes has one 2.66GHz ntel Pentium processor, 1 GB of main memory, and one locally attached 40G DE disk. Hadoop version and Hive version have been used in the experiment. B. Dataset The industry standard TPC-H benchmark is used as our dataset. We selected the LNETEM table from the benchmark, and generate 8.7GB dataset. The generated data is in plain text. Each task has been executed three times and the average of the trials has been reported. Each system executes the benchmark tasks separately to ensure exclusive access to the cluster's resources. This paper only reports results using trials where all nodes are available and the system's software operates correctly during the benchmark execution. e. Evaluation n the experiment, we examined the loading time, storage space for the LNETEM table and query execution times. The aim is to demonstrate the effectiveness of different data storage structures in three aspects as follows. 1) Data Storage Space The generated data are loaded into Hive using different data storage structures. During loading, data is compressed by different format for each data storage structure. Figure 4 shows the data storage space sizes required by different data storage structures. Data compression can significantly reduce the data storage space, and different data storage structures show different compression efficiencies. Several major observations can be found as follows. Row-store has the worst compression efficiency. This S expected because that a column-wise data compression is better than a row-wise data compression with mixed data domains. Bzip2 has higher data compression ratio than other compression formats. tj dramatically less than uncompressed data CT j 800 & l The data storage space of compressed data is :. Un compressed Gzip Bzip2 DEFLATE 2) Data Loading Time Compression Fauna! Fig. 4. Data Storage Space. D Textfi1e D RCFile Data loading time is an important factor to evaluate data storage structure. The data loading times have been recorded for the LNETEM table in the experiment. The results are shown in Figure 5. Several major observations can be found as follows. Among all cases, row-store always has the smallest data loading time. This is because it has the minimum overhead to re-organize records in the raw text file. RCFile is slightly slower than row-store with a comparable performance in practice. This reflects the small overhead of RCFile since it only needs to reorganize records inside each row group. Uncompressed data has the smallest data loading time while bzip2 has the largest data loading time among all cases. E= 'ii i j : j 1 m rtl rtl Uncompressed Gzip Bzip2 DEFLATE 3) Query Execution Time COlTllTession Fonmt Fig. 5. Data Loading Time. n this section, two queries have been executed on the largest table LNETEM as follows. The LNETEM has 16 columns, and the two queries only use a small number of columns. Query 1: SELECT FROM WHERE Query 2: SELECT FROM WHERE sum(l_extendedprice * _discount) as revenue line item l_shipdate >= ' ' and 1 discount >= 0.02 and _quantity> 4; ljetumflag, l_linestatus, sum(l_ quantity) as sum _ qty, sum(l_ extendedprice) as sum_base rice, sum(l_ extendedprice * (1 _discount)) as sum_disc rice, sum(l_ extendedprice * (1 _discount) * (1 + _tax)) as sum_charge, avg(t quantity) as avg_ qty, avg(l_ extendedprice) as avg rice, avg(l_ discount) as avg_ disc, count(*) as count order line item _ship date <= ' ' GROUP BY ljetumflag, l_linestatus ORDER BY ljetumflag, Uinestatus; n order to evaluate the performance of lazy decompression, the two queries were designed to have different selectivities according to their where conditions. t is less than 1 % for 1044

5 Query 1, and about 72% for Qurey 2. Figure 6 and 7 show the execution times of the two queries with different data storage structures. Several major observations can be found as follows. For Query 1, RCFile outperforms other data storage structures. This is because the lazy decompression technique in RCFile can accelerate the query execution with low query selectivity. For Query 2, as the selectivity become higher, lazy decompression becomes useless. n this case, RCFile still outperforms other data storage structures. But the advantage of RCFile becomes smaller. Among all cases, gzip has the fastest query speed "'" S & 'j;i while bzip2 has the slowest query speed.., Uncompresseu Gzip Bzip2 DEFLATE Compression Fonmt Fig. 6. Query Execution Time of Query r , : :; : j ;: ell [JJ :- CJ D. Summary Uncompresseu Gzip Bzip2 DEFLATE Co ression Funnat Fig. 7. Query Execution Time of Query 2. D Textfile D RCFi1e D Textfile D RCFile Different data storage structures are compared in three aspects of data loading time, storage space and query execution time. The result shows that each structure has its own merits. Among all cases, RCFile, which adopts advantages of other structures, outperforms other data storage structures. Although bzip2 has higher data compression ratio, but it wastes too much time to load data and execute query. Gzip is general-purpose compressor, and sits in the middle of the space/time trade-off. V. CONCLUSON n this paper, we describe and compare three data storage structures, namely row-store, column-store, and RCFile in the context of large data analysis using MapReduce. Besides, we present some common compression formats in Hadoop. n the end, a benchmark was conducted to evaluate different data storage structures with different compression formats. The results show that each data storage structure has its own merits, and RCFile storage structure with gzip compress format outperforms the other structures in most cases. ACKNOWLEDGMENT This paper is supported by NSFC ( and ), Natural Science Foundation of Zhej iang Province, China (. Y and LY2F02006), the State Key Program of Major Science and Technology (Priority Topics) of Zhejiang Province, China under ( 20OC050), and the science and technology search planned projects of Zhejiang Province (. 2012C21040). Corresponding author: REFERENCES [1] D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 2008, 51(1), pp [2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. n OSD '04, 2004, pp [3] Hadoop. [4] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. n SGMOD, ACM, 2009, pp [5] Hive. [6] Pig. [7] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 20 0, 53( ), pp [8] J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 2010, 53(1), pp [9] S. Babu. Towards automatic optimization of mapreduce programs. n SoCC, ACM, 2010, pp [10] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and. Stoica. mproving mapreduce performance in heterogeneous environments. n OSD, 2008, pp [11] D. J. Abadi, S. Madden, and M. Ferreira. ntegrating Compression and Execution in Column-Oriented Database Systems. n SGMOD, 2006, pp [12] A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 2011,4(7), pp [13] Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReducebased Warehouse Systems. n CDE, 2011, pp [14] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of Map Reduce: An n-depth Study. PYLDB, 2010, 3(1), pp [15] E. Jahani, M. J. Cafarella, and C. R'e. Automatic Optimization for MapReduce Programs. PVLDB, 2011, 4(6), pp [16] D. Abadi, S. R. Madden, and N. Hachem. Colunm-Stores vs. Row Stores: How Different Are They Really? n SGMOD, 2008, pp [17] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. n VLDB, 2001, pp

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop ABSTRACT Idris Hanafi and Amal Abdel-Raouf Computer Science Department, Southern Connecticut State University, USA Computers and Systems

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics

Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics EDIC RESEARCH PROPOSAL 1 Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics Ioannis Klonatos DATA, I&C, EPFL Abstract High scalability is becoming an essential requirement

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Database System Architectures Parallel DBs, MapReduce, ColumnStores Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet Motivation:

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

I. Introduction. FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1

I. Introduction. FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1 FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1 1 IBM Research - Almaden Abstract High performance storage layer is vital for allowing interactive

More information

I am: Rana Faisal Munir

I am: Rana Faisal Munir Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Column-Stores vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really? Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian

More information

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 7 Things To Know When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich 1 What Shoes? Why Shoes? 3 Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Hive SQL over Hadoop

Hive SQL over Hadoop Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms , pp.289-295 http://dx.doi.org/10.14257/astl.2017.147.40 Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms Dr. E. Laxmi Lydia 1 Associate Professor, Department

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin and Avi Silberschatz Presented by

More information

Accelerating Analytical Workloads

Accelerating Analytical Workloads Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014 Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

Delft University of Technology Parallel and Distributed Systems Report Series

Delft University of Technology Parallel and Distributed Systems Report Series Delft University of Technology Parallel and Distributed Systems Report Series An Empirical Performance Evaluation of Distributed SQL Query Engines: Extended Report Stefan van Wouw, José Viña, Alexandru

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Query processing on raw files. Vítor Uwe Reus

Query processing on raw files. Vítor Uwe Reus Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB

More information

Load Balancing Through Map Reducing Application Using CentOS System

Load Balancing Through Map Reducing Application Using CentOS System Load Balancing Through Map Reducing Application Using CentOS System Nidhi Sharma Research Scholar, Suresh Gyan Vihar University, Jaipur (India) Bright Keswani Associate Professor, Suresh Gyan Vihar University,

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012 ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1 Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

A Recovery Approach for SQLite History Recorders from YAFFS2

A Recovery Approach for SQLite History Recorders from YAFFS2 A Recovery Approach for SQLite History Recorders from YAFFS2 Beibei Wu, Ming Xu, Haiping Zhang, Jian Xu, Yizhi Ren, and Ning Zheng College of Computer, Hangzhou Dianzi University, Hangzhou 310018 Jhw_1314@126.com,{mxu,zhanghp}@hdu.edu.cn

More information

CLoud computing is a service through which a

CLoud computing is a service through which a 1 MAP-JOIN-REDUCE: Towards Scalable and Efficient Data Analysis on Large Clusters Dawei Jiang, Anthony K. H. TUNG, and Gang Chen Abstract Data analysis is an important functionality in cloud computing

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Exploiting Bloom Filters for Efficient Joins in MapReduce

Exploiting Bloom Filters for Efficient Joins in MapReduce Exploiting Bloom Filters for Efficient Joins in MapReduce Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim School of Computer Science and Engineering, Seoul National University 1 Gwanak-ro, Seoul 151-742, Republic

More information

HadoopDB: An open source hybrid of MapReduce

HadoopDB: An open source hybrid of MapReduce HadoopDB: An open source hybrid of MapReduce and DBMS technologies Azza Abouzeid, Kamil Bajda-Pawlikowski Daniel J. Abadi, Avi Silberschatz Yale University http://hadoopdb.sourceforge.net October 2, 2009

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

IC-Data: Improving Compressed Data Processing in Hadoop

IC-Data: Improving Compressed Data Processing in Hadoop 2015 IEEE 22nd International Conference on High Performance Computing IC-Data: Improving Compressed Data Processing in Hadoop Adnan Haider, Xi Yang, Ning Liu, Xian-He Sun Computer Science Department Illinois

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

MapReduce for Data Warehouses 1/25

MapReduce for Data Warehouses 1/25 MapReduce for Data Warehouses 1/25 Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman cstore_fdw Columnar store for analytic workloads Hadi Moshayedi & Ben Redman What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL Citus shards your data and automa/cally parallelizes

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE) PRESENTATION BY PRANAV GOEL Introduction On analytical workloads, Column

More information

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi Column-Stores vs. Row-Stores How Different are they Really? Arul Bharathi Authors Daniel J.Abadi Samuel R. Madden Nabil Hachem 2 Contents Introduction Row Oriented Execution Column Oriented Execution Column-Store

More information

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Progress on Efficient Integration of Lustre* and Hadoop/YARN Progress on Efficient Integration of Lustre* and Hadoop/YARN Weikuan Yu Robin Goldstone Omkar Kulkarni Bryon Neitzel * Some name and brands may be claimed as the property of others. MapReduce l l l l A

More information

BIGData (massive generation of content), has been growing

BIGData (massive generation of content), has been growing 1 HEBR: A High Efficiency Block Reporting Scheme for HDFS Sumukhi Chandrashekar and Lihao Xu Abstract Hadoop platform is widely being used for managing, analyzing and transforming large data sets in various

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Exploring MapReduce Efficiency with Highly-Distributed Data

Exploring MapReduce Efficiency with Highly-Distributed Data Exploring MapReduce Efficiency with Highly-Distributed Data Michael Cardosa, Chenyu Wang, Anshuman Nangia, Abhishek Chandra, Jon Weissman University of Minnesota Minneapolis, MN, A {cardosa,chwang,nangia,chandra,jon}@cs.umn.edu

More information

A Review Approach for Big Data and Hadoop Technology

A Review Approach for Big Data and Hadoop Technology International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 A Review Approach for Big Data and Hadoop Technology Prof. Ghanshyam Dhomse

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

An Enhanced Approach for Resource Management Optimization in Hadoop

An Enhanced Approach for Resource Management Optimization in Hadoop An Enhanced Approach for Resource Management Optimization in Hadoop R. Sandeep Raj 1, G. Prabhakar Raju 2 1 MTech Student, Department of CSE, Anurag Group of Institutions, India 2 Associate Professor,

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information