SURVEY OF MAPREDUCE OPTIMIZATION METHODS

Size: px
Start display at page:

Download "SURVEY OF MAPREDUCE OPTIMIZATION METHODS"

Transcription

1 SURVEY OF MAPREDUCE OPTIMIZATION METHODS 1 Parmeshwari P. Sabnis, 2 Chaitali A.Laulkar Computer Department Sinhgad College of Engineering,Pune,India 1 sabnis.parmeshwari@gmail.com, 2 calaulkar.scoe@sinhgad.edu Abstract MapReduce is a widely used data-parallel programming model for large-scale data analysis. The framework is shown to be scalable to thousand of computing nodes and reliable on commodity clusters. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge escalates when we consider that data are dynamically and continuously produced, from different geographical locations. For dynamically-generated data, an efficient online algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency. Keywords- Mapreduce, Hadoop, Optimization, Mapreduce Framework, MapReduce Performance I. INTRODUCTION MapReduce is a simple but efficient solution towards large-scale data processing and analysis. Apache Hadoop is an open-source implementation of GFS and MapReduce. Hadoop s MapReduce framework consists of a job scheduler (JobTracker) running on the master node and a task manager (TaskTracker) is running on each slave node.0 Although MapReduce framework frees the users from the labor of cluster management and job scheduling, there are many problems with mapreduce performance such as Hadoop uses a unified master server to control sub servers tasks executing, leading to shortcomings like fatal single point of failure and lacking of space capacity, which seriously affect its scalability. HDFS data are stored in object form, and each object occupies about 150 byte.0 If there is a large number of these small files for storage, NameNode will request for a lot of space. That will severely restrict the scalability of the cluster. JobTracker may be over loaded since it is responsible for monitoring and dispatching simultaneously. Hadoop is similar to the database. It requires specialized optimization according to actual application needs. Many experiments show that there is still much room for the improvement of processing performance. To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on high throughput of data than on job execution performance. This causes performance limitation when Hadoop MapReduce is used to execute short jobs that require quick responses. Under a widely distributed environment with high network heterogeneity, Hadoop does not always perform well.[3] The main reason for this performance degradation is the interaction and heavy dependency across different MapReduce phases. This happens because the data placement and task execution are highly coupled in the MapReduce paradigm. [4] Users usually expect short execution or quick response time from a short MapReduce job. To provide SQL like 14

2 queries or analysis, some query systems are available, such as Google s Sawzall [22], Facebook s Hive and Yahoo! s Pig. These systems execute users requests by converting SQL-like queries to a series of MapReduce jobs which are usually short. Obviously, these systems are very sensitive to the execution time of underlying MapReduce jobs. Therefore, reducing the executing time of short jobs is very important to these types of applications. So for this an optimized version of Hadoop which is designed to reduce the time consumed in the execution process of a job. For improving the performance of MapReduce due to above reason, many performance optimization methods are introduced. The model of remote computing is not new and has been very successful in supporting computationally intensive jobs. One of the challenges to apply this model for running a Hadoop cluster is the data transfer cost in and out of the dynamically constructed cluster. For an ondemand Hadoop cluster, data to be processed must first be imported into the cluster. Because processing large amounts of data is central to the typical Hadoop program, the cost of importing those data into a cluster is extremely relevant to overall performance.0 Map-Reduce have inherent limitations on its performance and efficiency. Therefore, many studies have endeavored to overcome the limitations of the Map- Reduce framework. In the replica selection of the input files for map tasks, the Map-Reduce framework does not take into account the distribution of the input data blocks in the distributed file system and the load of the computing nodes themselves, which leading to increase the amount of network data transfer and system load when running map tasks. 0 Especially when the framework uses the FIFO job scheduling strategy to deal with a large number of small jobs, the performance of the framework will be very low. II. BACKGROUND AND WORKING OF MAPREDUCE MapReduce framework is scheduled by JobTracker and TaskTracker. [16] The relationship of tasks allocation is shown in Fig JobTracker is the only master control, which can run on any computer in the cluster for scheduling and managing other TaskTrackers, allocating Map task and Reduce task to free TaskTrackers for parallel running and monitoring the condition of the tasks. There can be more than one TaskTracker. TaskTracker is in charge of the implementation of the tasks [15]. It must run on DataNode, which means that DataNode is not only a data storage node, but also a computing node. If a TaskTracker s task fails, JobTracker will allocate the task to one of other free TaskTrackers, and rerunning. 0 TABLE I shows the process of Hadoop dealing with large data sets. MapRecude model abstracts the parallel computing process on the large clusters into two functions, Map function and Reduce function. Map function accepts a key-value pairs set as input, and outputs one or more intermediate state key-value pairs set. Fig. 1.1 Working of mapreduce When a job is submitted to the MapReduce framework, MapReduce will divide it into several Map tasks and assign them to different nodes for running. Every Map task only deals with a part of the input data. After Map task processing, the results, those intermediate state keyvalue pairs, will be sent to the Reduce function. Reduce function will merge the pairs based on a specific key, then generate and output the value-keys that client requires. TABLE I MapReduce I/O Function Input Output Directions Map (K1,V1) (K2,V2) The input keys (K1,V1)is mapped to keys of an intermediate format (K2,V2) collection Reduce (K1,V1) (K2,V2) Reduce a group of middle set values associated with k2 to smaller set of values III. POSSIBLE OPTIMIZATIONS List below are the some optimization method to increase the performance of Mapreduce. A. From the perspective of application for optimization[26]: Since MapReduce parses the data file iteratively and line by line, programming application programs with high efficiency under this circumstance is a way to optimization. Performance of MapReduce can be improved form the following aspects: 15 Avoid unnecessary Reduce tasks. Pull in external file. Add a Combiner to Job. Reuse Writable type.

3 Use StringBuffer instead of String to track program bottlenecks. B. Hadoop system parameter optimization: There are over 190 configuration parameters in current Hadoop system. How to adjust these parameters so that jobs can run as fast as possible is also a kind of optimization idea. 0 Hadoop system parameters optimization can start from the following three aspects: Parameters of Linux file system. General parameters of Hadoop. Hadoop jobs parameters. C. Hadoop job scheduling algorithm optimization: The fact that Hadoop configuration based on cluster hardware information and the number of nodes can greatly improve the performance of Hadoop cluster has been proved. However, this method just optimizes the performance statically. It cannot modify the configuration files, load them or put them into force dynamically during the running time. Optimizing the job scheduling algorithm can solve this problem well. The scheduler is a pluggable module in Hadoop, and users can design their own dispatchers according to their actual application requirements.0 Here are three task dispatchers. 1. The default dispatcher: This dispatcher adopts FIFO algorithm, which is simple and clear, and the burden of Jobtracker is not so heavy. 2. Dispatcher with computational capability: This kind of dispatcher supports multiple queues. Each queue has a certain amount of resources and uses FIFO scheduling policy. Jobs are scheduled in accordance with job priority and the order of submitted time. 3. Fair share scheduling algorithms: This solution is proposed by Facebook. The design philosophy is to ensure that all jobs can obtain the amount of resources as equal as possible. When there is only one task running in the system, it will monopolize all the resources of the cluster. When there is more than one task, there will be TaskTracker being released and assigned to the newly submitted job to ensure all the tasks can obtain the same computing resources roughly. D. Data Transfer Bottlenecks: Big challenge is to how to minimize the cost of data transmission for cloud user. Map-Reduce-Merge [8] is a new model that adds a Merge phase after Reduce phase that combines two reduced outputs from two different MapReduce jobs into one, Map-Join-Reduc adds Join stage before Reduce stage. T. Nykiel proposed MRShare[29] is a sharing framework that transforms a batch of queries into a new batch that can be executed more efficiently by merging jobs into groups. Further it evaluates each group as a single query. Data skew is also an important factor that affects data transfer cost. In order to overcome this, a method that divides a MapReduce job into two phases was proposed: sampling MapReduce job and expected MapReduce job was proposed. The first phase is to sample the input data, gather the inherent distribution on keys frequencies and then make a partition scheme. In the second phase, expected MapReduce job to group the intermediate keys quickly applies partition scheme to every mapper. E. Iterative Optimization: For iterative problems MapReduce need lots of input-outputs and unnecessary computations while solving it. Twister proposed by J. Ekanayake is an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently, which adds an extra Combine stage after Reduce stage, which results in data output from combine stage which results into next iteration s Map stage. It avoids instantiating workers repeatedly and previously instantiated workers are reused for the next iteration with different inputs. HaLoop is similar to Twister, which is a modified version of the MapReduce framework that supports for iterative applications by adding a Loop control. F. Online: MapReduce Online raises an issue that frequent checkpointing and shuffling of intermediate results limit pipelined processing. MapReduce framework was modified by making Mappers push their data temporarily stored in local storage to Reducers in the same MR job periodically. In addition, Map-side pre-aggregation is used to reduce communication. Tyson Condie proposed Hadoop Online Prototype (HOP) is similar to MapReduce Online. HOP is a modified version of MapReduce framework that helps users to get returns from a job as it is being computed. D. Jiang et al [28] found that the merge sort in MapReduce costs lots of I/Os and seriously affects the performance of MapReduce. G. Short Job Optimizations: The focus is on improving the execution performance of short jobs on Hadoop. After analyzing shortcomings of the job execution mechanism in the standard Hadoop, implement an optimized version of Hadoop MapReduce which is designed to reduce the time consumed in the execution process of a job. The first optimization is to reduce the time cost during the initialization and termination stages of a job by removing the constant time cost of 4 heartbeats for its setup and cleanup tasks. For the second optimization, instead of using the heartbeat-based pull-model task assignment, we design and implement a push-model task assignment mechanism. For the third optimization, design and implement an instant message communication mechanism for events notification between the JobTracker and TaskTrackers to separate the message communication from heartbeats. 0 16

4 H. Optimization of Data Import: In the traditional HDFS architecture the Client uses a single input stream and buffer to import a local file, making the transfer process sequential. That is, the Client passes the input stream to the Datanode to copy the first block of the file and then waits for a response from the Datanode indicating the transfer was a success before continuing on to the second block. This situation creates a bottleneck because all data from the file must pass through the Client before it is transferred to the Datanodes. The last block of every file must wait for all of the previous blocks to finish before it is copied. The sequential transfer process is a hindrance, particularly when extremely large files are transferred from the local file system to HDFS. If the data can be accessed directly by the Datanodes, propose an alternate approach that maintains much of the traditional process while allowing for the initial data transfesr into HDFS to bypass the buffer in the Client. 0 In implementation, the initial part of the process in which the Client and Namenode communicate to determine where the file will be stored on the Datanodes occurs normally. But instead of opening an input stream to the local file and passing it along to the first Datanode through a socket, the new Client sends the file path, along with the offset in the file and the amount of bytes of data to be copied, in a byte array. The Datanode then parses the incoming byte array to determine the path to the file on the local file system, how far to offset within that file and how many bytes of the file it will transfer to itself. I. Task Assignment: In order to overcome the shortcomings of the FIFO scheduling strategy, we add multiple FIFO queues to the Map- Reduce framework in Hadoop. With this, several jobs will be able to run at the same time in the Map-Reduce framework. 0The optimized map task assignment strategy consists of two parts: 1. Data locality scheduling strategy: add several job queues into the Map-Reduce framework, so the JT will be able to schedule more than one job into running state at the same time to take full advantage of the computing capacity of the nodes and shorten the average execution time of the jobs. Ultimately, it will improve the performance and efficiency of the system 2. Replica selection strategy: On the premise of all map tasks scheduled to execute locally, we should consider the load of the system. Load balancing can further enhance the performance and efficiency of the entire system. All the nodes in our Hadoop cluster is isomorphic, so we can assume that the overhead of the execution of the operating system and hardware on each node is a constant value, referred to as A. Parameters used to describe the load information includes: number of tasks in the run queue, speed of the system call, CPU context switching rate, percentage of CPU idle time, the idle memory size and so on. IV. PROPOSED METHOD To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on high throughput of data than on job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs that requires quick responses. In order to speed up the execution of short jobs optimization methods are required to improve the execution performance of MapReduce jobs. This can be achieved by improving communication between jobtracker and tasktracker. For comparison of previous working of mapreduce and this one, this system needs to be tested on an application. For this K-means clustering algorithm would be considered. V. COMPARATIVE STUDY With the task assignment strategy, all the map tasks can be assigned to the TaskTracker containing the data input fragments for the tasks. Taking into account the load balancing, improved model can increase the throughput and reduce the average response time of the system effectively. The process of importing data, as discussed in optimization of data import, into Hadoop data has drawn significant attention in the high performance computing industry. Short Job optimization has successful increased the execution speed of Standard Hadoop. Data transfer optimization helps to reduce the transformation cost. Combine stage for iterative optimization has increased the data processing speed for the iterative process. Proper selection of dispatcher and parameters for Hadoop can increase the execution speed and performance of Hadoop. Various optimizations are available of increasing the performance of Hadoop. We can choose according to our application need. VI. CONCLUSIONS As an open source implementation of cloud computing system, Hadoop achieves more and more attention by the academia industry. And its application is more and more widespread. Though Hadoop shows good performance in dealing with large data sets concurrently, there are still some shortcomings. This paper describes the working of MapReduce and analyzes existing problems of Hadoop data processing platform, and gives some suggestions of Hadoop cluster optimization. ACKNOWLEDGMENT We take this opportunity to express my deep sense of gratitude towards those, who have helped us in various ways, for preparing this paper. We would like to thank the reviewers of this paper for their constructive comments. REFERENCES [1] Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang, Performance Optimization for Short MapReduce Job Execution in Hadoop, 2012 Second 17

5 International Conference on Cloud and Green Computing [2] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li, Big Data Processing in Cloud Computing Environments, 2012 International Symposium on Pervasive Systems, Algorithms and Networks. [3] Benjamin Heintz, Chenyu Wang, Abhishek Chandra, and Jon Weissman, Cross-Phase Optimization in MapReduce, 2013 IEEE International Conference on Cloud Engineering. [4] XiaohongZhang, GuoweiWang, ZijingYang, YangDing, A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce, 2012 International Conference on Systems and Informatics (ICSAI 2012). [5] Congchong Liu and Shujia Zhou, Local and Global Optimization of MapReduce Program Model, 2011 IEEE World Congress on Services [6] Huang Lu, Hu Ting-ting and Chen Hai-shan, Research on Hadoop Cloud Computing Model and its Applications, 2012 Third International Conference on Networking and Distributed Computing [7] Lijie Xu, MapReduce Framework Optimization via Performance Modeling, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum R. Nicole. [8] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, Map-reduce-merge: simplified relational data processing on large clusters, in Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007, pp [9] D. Jiang, A. Tung, and G. Chen, Map-Join- Reduce: Toward scalable and efficient data analysis on large clusters, Knowledge and Data Engineering, IEEE Transactions on, vol. 23, no. 9, pp , 2011 [10] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li, Mapdupreducer: detecting near duplicates over massive datasets, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp [11] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, Mars: a mapreduce framework on graphics processors, in Proceedings of the 17th international conference on Parallel architectures and compilation techniques.acm,2008,pp [12] C. Zhang, F. Li, and J. Jestes, Efficient parallel knn joins for large data in mapreduce, in Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012, pp [13] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears, Online aggregation and continuous query support in mapreduce, in ACM SIGMOD, 2010, pp [14] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and J. McPherson, Ricardo: integrating r and hadoop, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp ] J. Dean, and S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1,pp , [16] T. White, Hadoop: The Definitive Guide: O'Reilly Media, [17] Weijia Xu* Wei Luo Nicholas Woodward, Analysis and Optimization of Data Import with Hadoop, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. [18] K. Shvachko, K. Hairong, S. Radia et al., "The Hadoop Distributed File System."pp [19] Songchang Jin,Shuqiang Yang,Yan Jia, Optimization of Task Assignment Strategy for Map-Reduce, nd International Conference on Computer Science and Network Technology. [20] Guangbin Xu, Load balancing principle and algorithm implementation on the Linux cluster [BE,OL], 1.html, [21] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox, Twister: a runtime for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp [22] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming Journal, vol. 13, no. 4, Oct. 2005, pp [23] K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the progress of MapReduce pipelines, in ICDE, [24] Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, Seungryoul Maeng. HPMR:Prefetching and Pre-Shuffling in Shared MapReduce Computation Environment. IEEE 18

6 International Conference on Cluster Computing and Workshops, 2009 [25] Shubin Zhang, Jizhong Han, Zhiyong Liu, Kai Wang, Shengzhong Feng. Accelerating MapReduce with Distributed Memory Cache. IEEE, th International Conference on Parallel and Distributed Systems [26] Xin Daxin, Liu Fei. Research on optimization techniques for Hadoop cluster performance [J]. Computer Knowledge and Technology, 2011,8(7):5484~5486. [27] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: a system for largescale graph processing, in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp [28] D. Jiang, B. Ooi, L. Shi, and S. Wu, The performance of mapreduce: An in-depth study, Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp , [29] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas, Mrshare: Sharing across multiple queries in mapreduce, Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp ,

Performance Optimization for Short MapReduce Job Execution in Hadoop

Performance Optimization for Short MapReduce Job Execution in Hadoop 2012 Second International Conference on Cloud and Green Computing Performance Optimization for Short MapReduce Job Execution in Hadoop Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING 1 KONG XIANGSHENG 1 Department of Computer & Information, Xin Xiang University, Xin Xiang, China E-mail: fallsoft@163.com ABSTRACT

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online): 2321-0613 Incremental Map Reduce Framework for Efficient Mining Evolving in Big Data Environment

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

The Optimization and Improvement of MapReduce in Web Data Mining

The Optimization and Improvement of MapReduce in Web Data Mining Journal of Software Engineering and Applications, 2015, 8, 395-406 Published Online August 2015 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2015.88039 The Optimization and

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Parallel data processing with MapReduce

Parallel data processing with MapReduce Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data Sachin Jadhav, Shubhangi Suryawanshi Abstract Nowadays, the volume of data is growing at an nprecedented rate, big

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2 MSc Research Project Cloud Computing Ashwini Rajaram Chandanshive x15043584 School of Computing National College

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Data Prefetching for Scientific Workflow Based on Hadoop

Data Prefetching for Scientific Workflow Based on Hadoop Data Prefetching for Scientific Workflow Based on Hadoop Gaozhao Chen, Shaochun Wu, Rongrong Gu, Yongquan Xu, Lingyu Xu, Yunwen Ge, and Cuicui Song * Abstract. Data-intensive scientific workflow based

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

THE SURVEY ON MAPREDUCE

THE SURVEY ON MAPREDUCE THE SURVEY ON MAPREDUCE V.VIJAYALAKSHMI Assistant professor, Department of Computer Science and Engineering, Christ College of Engineering and Technology, Puducherry, India, E-mail: vivenan09@gmail.com.

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

QADR with Energy Consumption for DIA in Cloud

QADR with Energy Consumption for DIA in Cloud Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment Sangwon Seo 1, Ingook Jang 1, 1 Computer Science Department Korea Advanced Institute of Science and Technology (KAIST), South

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT S.SURESH KUMAR, Jay Shriram Group of Institutions Tirupur sureshmecse25@gmail.com Mr.A.M.RAVISHANKKAR M.E., Assistant Professor,

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY , pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

IN organizations, most of their computers are

IN organizations, most of their computers are Provisioning Hadoop Virtual Cluster in Opportunistic Cluster Arindam Choudhury, Elisa Heymann, Miquel Angel Senar 1 Abstract Traditional opportunistic cluster is designed for running compute-intensive

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

ENHANCING MAP-REDUCE JOB EXECUTION ON GEODISTRIBUTED DATA ACROSS DATACENTERS

ENHANCING MAP-REDUCE JOB EXECUTION ON GEODISTRIBUTED DATA ACROSS DATACENTERS International Conference on Information Engineering, Management and Security [ICIEMS] 323 International Conference on Information Engineering, Management and Security 2015 [ICIEMS 2015] ISBN 978-81-929742-7-9

More information

A Survey on Job Scheduling in Big Data

A Survey on Job Scheduling in Big Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 3 Sofia 2016 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2016-0033 A Survey on Job Scheduling in

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment ISSN (e): 2250 3005 Volume, 07 Issue, 07 July 2017 International Journal of Computational Engineering Research (IJCER) On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment Ms.

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

Exploiting Bloom Filters for Efficient Joins in MapReduce

Exploiting Bloom Filters for Efficient Joins in MapReduce Exploiting Bloom Filters for Efficient Joins in MapReduce Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim School of Computer Science and Engineering, Seoul National University 1 Gwanak-ro, Seoul 151-742, Republic

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing Strategies of MapReduce for Information Retrieval in Big Data International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya

More information

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B. APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.Kadu PREC, Loni, India. ABSTRACT- Today in the world of

More information

A Brief on MapReduce Performance

A Brief on MapReduce Performance A Brief on MapReduce Performance Kamble Ashwini Kanawade Bhavana Information Technology Department, DCOER Computer Department DCOER, Pune University Pune university ashwinikamble1992@gmail.com brkanawade@gmail.com

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

An Enhanced Approach for Resource Management Optimization in Hadoop

An Enhanced Approach for Resource Management Optimization in Hadoop An Enhanced Approach for Resource Management Optimization in Hadoop R. Sandeep Raj 1, G. Prabhakar Raju 2 1 MTech Student, Department of CSE, Anurag Group of Institutions, India 2 Associate Professor,

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information