Performance Optimization for Short MapReduce Job Execution in Hadoop

Size: px
Start display at page:

Download "Performance Optimization for Short MapReduce Job Execution in Hadoop"

Transcription

1 2012 Second International Conference on Cloud and Green Computing Performance Optimization for Short MapReduce Job Execution in Hadoop Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang Department of Computer Science and Technology National Key Laboratory for Novel Software Technology Nanjing University Nanjing , China Abstract Hadoop MapReduce is a widely used parallel computing framework for solving data-intensive problems. To be able to process large-scale datasets, the fundamental design of the standard Hadoop places more emphasis on highthroughput of data than on job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs that requires quick responses. In order to speed up the execution of short jobs, this paper proposes optimization methods to improve the execution performance of MapReduce jobs. We made three major optimizations: first, we reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks; second, we replace the pull-model task assignment mechanism with a push-model; third, we replace the heartbeat-based communication mechanism with an instant message communication mechanism for event notifications between the JobTracker and TaskTrackers. Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for our test application. Keywords-MapReduce; parallel computing; job execution; performance optimization I. INTRODUCTION The MapReduce parallel computing framework [1], proposed by Google in 2004, has become an effective and attractive solution for large scale data processing problems. Unlike many classic parallel programming models such as MPI [2] and PVM [3], MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance, which make it a widely adopted parallel computing framework that also gained attention in academia. Hadoop [4], an open-source project under Apache Software Foundation, is an implementation of Google s MapReduce framework and is widely used by research communities and industry. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms (machine learning algorithms [5] etc.) on MapReduce parallel computing framework for high processing efficiency. At the same time, there are also quite a few researchers exploring performance improvements for MapReduce or designing and implementing novel parallel computing architectures inspired by MapReduce framework for various purposes. Facebook proposed fair scheduler [6] to better solve the job scheduling problem in MapReduce framework. Researchers in UC Berkley proposed a task scheduling algorithm called LATE (longest Approximate Time to End) [7], which execute speculative tasks to improve the response of Hadoop. A delay task scheduling method is introduced in [8], which aims to achieve locality and fairness in task scheduling for increasing throughput. HPMR [9] adopts pre-fetching and pre-shuffling optimization strategies, which improve the overall performance of the standard Hadoop. Researcher Shubin Zhang and his partner have tried to accelerate MapReduce Job execution by caching intermediate data in distribute memory [10], estimating the progress of MapReduce pipelines [11] or optimizing MapReduce programs automatically [12]. Users usually expect short execution or quick response time from a short MapReduce job. Especially, this is true for online query or analysis-type applications. To provide SQLlike queries or analysis, some query systems are available, such as Google s Sawzall [13], Facebook s Hive [14], and Yahoo! s Pig [15]. These systems execute users requests by converting SQL-like queries to a series of MapReduce jobs which are usually short. Obviously, these systems are very sensitive to the execution time of underlying MapReduce jobs. Therefore, reducing the executing time of short jobs is very important to these types of applications. In this paper, we focus on improving the execution performance of short jobs on Hadoop. After analyzing shortcomings of the job execution mechanism in the standard Hadoop, we implement an optimized version of Hadoop MapReduce which is designed to reduce the time consumed in the execution process of a job. The first optimization is to reduce the time cost during the initialization and termination stages of a job by removing the constant time cost of 4 heartbeats for its setup and cleanup tasks. For the second optimization, instead of using the heartbeat-based pull-model task assignment, we design and implement a push-model task assignment mechanism. For the third optimization, we design and implement an instant message communication mechanism for events notification between the JobTracker and TaskTrackers to separate the message communication from heartbeats. Experimental results on our test application Correspondence should be addressed to: yhuang@nju.edu.cn, Yihua Huang, Ph.D /12 $ IEEE DOI /CGC

2 show improved performance with the optimized Hadoop for executing short jobs. Comparing with a long job, the short job refers to a job which costs several minutes during the whole execution. The rest of the paper is organized as follows. Section gives a brief introduction to the Hadoop MapReduce architecture, analyzes current job execution mechanism, and identifies the places that cause performance problems in the standard Hadoop. Section describes our optimization methods to solve the problems discussed in section. Section IV discusses experiments and performance evaluations of the optimization II. BACKGROUND A. Hadoop MapReduce Job Execution Process The Hadoop MapReduce framework, deployed on top of HDFS, consists of a JobTracker running on the master node and many TaskTrackers running on slave nodes. As the core component of the MapReduce framework, the JobTracker is in charge of scheduling and monitoring all the tasks of a MapReduce job. Tasks are distributed to the TaskTrackers on which the map and reduce functions implemented by users are executed. When a MapReduce job is submitted into Hadoop, the input data of the job would be split into several independent data splits with equal sizes with each map task processing one data split. These map tasks run in parallel and their outputs would be sorted by the framework and then fetched by reduce tasks for further processing. During the job execution, the JobTracker monitors the execution of each task, assigning failed tasks and altering state of the job in each phase. Job and task are two important concepts in Hadoop MapReduce architecture. In order to elaborate our problems, we firstly present the state transition of the job execution and the sequential process of a task. Figure 1 shows the state transition of the job execution. Generally, the execution process of a job can be partitioned into 3 stages: PREP, RUNNING and FINISHED. When the client submits a job to the Hadoop cluster, the execution process is as follows: PREP stage: at the beginning, the job is in the NEW state. Then, the job enters the PREP.INITIALIZING state to initialize job, conducting some initialization processing such as reading information of splits from HDFS, and creating map and reduce tasks on the JobTracker. Next, the setup task of the job will be scheduled to the TaskTracker. When the setup task is completed, the job will enter the RUNNING state. RUNNING stage: first, the job is in the RUNNING.RUN_WAIT state to wait for tasks to be scheduled. When a task has been scheduled to a TaskTracker, the job will enter the RUNNING.RUNING_TASKS state to execute all map/reduce tasks. Once all the map and reduce tasks are completed, the job moves to the RUNNING.SUC_WAIT state. In this state, the cleanup task of the job will be scheduled to the TaskTrackers to clean up the environment of the job. FINISHED stage: After the job cleanup task is done, the job will go into the SUCCEEDED state, or in other words, the job has been completed. In any state, a job can be killed by the client and go into the KILLED state or go into the FAILED state due to some failure. Figure 1. The state transition of a job Figure 2 shows the sequential process of a task. When a task is being processed in Hadoop, it has two components: the TaskInProgress in the JobTracker and the TaskTracker. TaskInProgress in the TaskTracker. According to Figure 1, we know that when a job is initialized, many map/reduce tasks of the job will be created. These tasks are waiting to be scheduled to the TaskTrackers for execution. In Figure 2, we illustrate how a task is processed. 1) The JobTracker creates a JobInProgress instance for each job and many corresponding map/reduce tasks are created. At this time, the tasks are in the UNASSIGNED state. 689

3 2) Each TaskTracker sends a heartbeat to the JobTracker for requesting tasks. In response, the JobTracker allocates one or several tasks to each TaskTracker. This is done by the first heartbeat. 3) After receiving tasks, the TaskTracker does the following work: creating a TaskTracker.TaskInProgress instance, running an independent child thread to execute the task, and then changing the task state of the TaskTracker to the RUNNING state. 4) Each TaskTracker reports the information of its task to the JobTracker, and the JobTracker changes the task state into the RUNNING state. This is done by the second heartbeat. Figure 2. The sequential process of a task 5) When the child thread is completed, each TaskTracker changes the task state into the COMMIT_PENDING state. 6) Each TaskTracker reports the information to the JobTracker again through a heartbeat. In response, the JobTracker changes the task state into the COMMIT_PENDING state to allow the TaskTrackers to commit the task results. 7) Each TaskTracker submits task results and changes the task state into the SUCCEEDED state. 8) Each TaskTracker reports success through a heartbeat. The JobTracker changes the task state into SUCCEEDED. By this time, the task is completed. B. Hadoop MapReduce Job Setup/Cleanup As presented in the state transition of a job, prior to scheduling map/reduce tasks of a job, a setup task should be scheduled first. In brief, the task is processed as follows: 1) Launch job setup task: through a heartbeat, the JobTracker discovers a TaskTracker with free map/reduce slot which can accept a new task, and then the JobTracker schedules a task to this TaskTracker. 2) Job setup task completed: the TaskTracker processes the task, and then reports information of the task to the JobTracker. The two steps described above will take two heartbeats (at least 6 seconds as a heartbeat interval is at least 3 seconds). Similarly, a cleanup task must be scheduled after all map/reduce tasks are completed and will take another 2 heartbeats with at least 6 seconds. As a result, the setup and cleanup tasks will take at least 12 seconds total. For a short job which runs only in a couple of minutes, these two special tasks may take around 10% or more of the total execution time. If we can cut down the fixed time cost of 4 heartbeats for a short job, that will be a noticeable performance improvement for a job execution. By taking a closer look at the implementation of the setup and cleanup tasks, we find that we can modify these two tasks to remove the time cost of the 4 heartbeats. Section will discuss this optimization in more detail. C. Heartbeat Delay From Figure 2 we can see that, each TaskTracker periodically sends information to the JobTracker and performs the pull-model task requests, and the JobTracker responds. We refer to this as the pull-model heartbeat communication mechanism. With the heartbeat communication, the TaskTrackers report node information to the JobTracker and then the JobTracker issues control commands to the TaskTrackers. For controlling and managing a Hadoop cluster, an appropriate heartbeat period should be set. Now, for a cluster with less than 100 nodes, the default heartbeat interval is 3 seconds. An additional 1 second is needed per 100 extra nodes. To some extent, the pull-model heartbeat communication mechanism can help prevent the JobTracker from being overwhelmed. But it comes with a heavy time cost: 1) The JobTracker has to wait for the TaskTrackers to request tasks passively, and as a result, there will be a delay between submitting a job and scheduling the job due to the heartbeat interval. 2) Important information (task success, commit, failure etc.) cannot be immediately reported from the 690

4 TaskTrackers to the JobTracker and this delays the task schedule, further increasing the time cost of job execution and decreasing the utilization efficiency of computing resources. Merely decreasing the value of heartbeat interval cannot improve the resource utilization of Hadoop cluster. In part B and C of section, we propose a task assignment and communication mechanism between the JobTracker and the TaskTrackers to reduce the delay caused by the pull-model heartbeat communication mechanism. III. OPTIMIZATION METHODS A. Dismiss of the Job Setup and Cleanup Tasks In the standard Hadoop, we observe that, the job setup task simply makes a temp directory for data output, and the job cleanup task deletes the directory. The actual time cost of these two tasks is very small. Thus, instead of sending message to the TaskTrackers to launch the job setup/cleanup task by a heartbeat, we can immediately execute the job setup/cleanup task in the JobTracker. That means, when the JobTracker initializes a job, the setup task of the job will be immediately executed in the JobTracker. After all map/reduce tasks of the job are completed, a cleanup task of the job will be immediately executed in the JobTracker as well. PREP.SETUP and CLEANUP states are incorporated into the PREP.INITIALIZED and RUNNING.SUC_WAIT states respectively. We implement a new version of Hadoop MapReduce from the standard Hadoop framework to achieve our proposal: 1) Add the methods setupjob() and cleanupjob() to JobInProgress (the object of a job); the method setupjob() implements what the method runjobsetuptask() in the Task class does and similarly the method cleanupjob() implements what the method runjobcleanup() does in the Task class. 2) Calls the method setupjob() from JobInProgress.initTask() and then alters the state of job to the RUNNING state; 3) Calls the method cleanupjob() from JobInProgress.completedTask() when all map/reduce tasks have been completed. B. Change the Task Assignment from Pull to Push In the standard Hadoop, the JobTracker never actively communicates with the TaskTrackers; even if it has received a new job and created a lot of map/reduce tasks. It has to wait for the TaskTrackers to issue requests by heartbeats. This delays the task schedule and further increase the time cost of job execution. As in Figure 4, we modify the task assignment mechanism from its original pull model to push model. After initializing a job, the JobTracker will actively send messages to start task assignments to the TaskTrackers. FIGURE 3. JOB STATE TRANSITION AFTER OPTIMIZATION Figure 3 shows the modified state transition of a job after making the optimization (to simplify, we omit conditions of kill and failed ). After making this optimization, the Figure 4. Push-model task assignment and instant message communication mechanisms after optimization 691

5 C. Separate the Job/Tasks Control Messages from Heartbeatsto The JobTracker and TaskTrackers perform the message communication with each other by heartbeats. The content of a heartbeat includes information of the TaskTracker, and task state, etc. To improve the communication performance, we separate the job/tasks control message communication from heartbeats and provide an instant message communication mechanism as shown in Figure 4. In this new mechanism, when important events such as task completion happen, the information will be send to the JobTracker immediately. For all job/tasks scheduling events, we use the instant message communication, but for those cluster management events that are not that performancesensitive we still use the heartbeat communication mechanism. Figure 4 shows how the messages of job/tasks scheduling are sent after making optimization. It is noticeable that, for achieving good scalability, we can increase the interval of heartbeat while using this optimization. IV. PERFORMANCE EVALUATION In this section, we conduct experiments to evaluate the performance for our optimized version of the Hadoop MapReduce framework compared to the standard Hadoop. A. Environment Setup We build an experimental environment of a Hadoop cluster with 19 nodes connected by Gigabit Ethernet. The hardware of the cluster is described in Table 1. We use the parallelized BLAST [16] that we developed as our test application. BLAST is a sequence alignment tool widely used by biology researchers. In our parallelized BLAST, we designed and implemented two type of BLAST algorithms: the map side extension BLAST without any reduce processing and the reduce side extension BLAST with both map and reduce processing. The details of these two parallelized BLAST algorithms with MapReduce are in [16]. We choose BLAST for our experiment because it represents a type of query applications. The original nt database was 32GB in size, and around 16GB after converting to Sequence File format. B. Evaluation During the execution of a job, we recorded the state of the slot of each TaskTracker at every second. We run the BLAST job that processes 16GB data with the standard Hadoop-0.20 environment and our new optimized version of Hadoop MapReduce respectively. The results are shown in Figure 5 and Figure 6. (a). Standard Hadoop (b). Dismiss of the job setup/cleanup tasks TABLE 1. HARDWARE INFORMATION We download the latest release FASTA format sequence databases, nt (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ excluding bulk divisions) from NCBI as our test datasets. (c). Push-model task assignment and instant message communication mechanisms Figure 5. Performance evaluation for map side extension BLAST As shown in Figure 5(b) and 6(b), after applying the optimization of dismissing setup/cleanup tasks, the setup and cleanup time costs are noticeably reduced. The total job execution time is shortened from about 47 seconds down to 34 for the map side extension BLAST, and from about 58 seconds down to 46 for the reduce side extension BLAST. 692

6 As shown in Figure 5(c) and 6(c), after further applying the optimization of the posh-model task assignment and instant message communication mechanisms, the total job execution time is further shortened to 27 seconds for the map side extension BLAST, and to 40 for the reduce side extension BLAST. Comparing Figure 6(c) with Figure 6(b), it is obvious that, while cluster executing a job, the slots are used at a higher level of utilization, in both map phase and reduce phase. Figure 6. Performance evaluation for reduce side extension BLAST In Figure 7, the horizontal ordinate represents queries with different lengths of DNA sequences and the vertical one the time costs. Comparison to the standard Hadoop, our optimized Hadoop reduces the time cost about 23% on average. (a). Map side extension BLAST (a). Standard Hadoop (b). Reduce side extension BLAST Figure 7. Time cost comparison of the standard and optimized Hadoop (b). Dismiss of the job setup/cleanup tasks (c). Push-model task assignment and instant message communication mechanisms V. CONCLUSIONS Hadoop MapReduce has been proven to be a successful model and framework for large scale data processing. In this paper, we conducted our research on performance optimization for Hadoop MapReduce job execution. This optimization is especially helpful for the query or analysistype short jobs. By optimizing the job initialization and termination stages, changing the task assignment from heartbeat-based pull-model to a push model, and providing an instant message communication mechanism instead of heartbeats, we achieved 23% performance improvement on average for our test application. Also our optimized version of Hadoop preserves full features of standard Hadoop MapReduce framework, without changing any programming APIs of Hadoop. Further work need to be done to perform stability test of our optimized version of Hadoop and make more tests for a 693

7 variety of benchmark applications and datasets for further improvement. REFERENCES [1] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1, 2008, pp [2] MPI (Message Passing Interface). Available: [3] PVM (Parallel Virtual Machine). Available: [4] Apahce Software Foundation, Hadoop, Available: [5] D. Gillick, A. Faria, and J. Denero, Mapreduce: Distributed computing for machine learning, [Online]. Available: [6] Fair Scheduler Guide. Available: [7] Matei Zaharia, Andy Konwinski, Anthony D.Joseph, Randy Katz, Ion Stoica. Improving MapReduce Performance in Heterogeneous Environments. 8th USENIX Symposium on Operating Systems Design and Implementation., p.29-42, December 08-10, 2008, San Diego, California. [8] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proceedings of the 5th European conference on Computer systems. Pages: , [9] Sangwon Seo, Ingook Jang, Kyungchang Woo, Inkyo Kim, Jin-Soo Kim, Seungryoul Maeng. HPMR:Prefetching and Pre-Shuf Shared MapReduce Computation Environment. IEEE International Conference on Cluster Computing and Workshops, 2009 [10] Shubin Zhang, Jizhong Han, Zhiyong Liu, Kai Wang, Shengzhong Feng. Accelerating MapReduce with Distributed Memory Cache. IEEE, th International Conference on Parallel and Distributed Systems [11] Kristi Morton, Abram Friesen, Magdalena Balazinska, Dan Grossman. Estimating the Progress of MapReduce Pipelines IEEE 26 th International Conference on Data Engineering [12] Shivnath Babu. Towards automatic optimization of MapReduce programs. International Conference on Management of Data, Proceedings of the 1st ACM symposium on Cloud computing [13] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming Journal, vol. 13, no. 4, Oct. 2005, pp [14] Apache Software Foundation, Hive, Available: [15] I. S. Jacobs and C. P. Bean, Fine particles, thin films and exchange anisotropy, in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp [16] Xiao-Liang Yang, Yu-Long Liu, Chun-Feng Yuan, Yi-Hua Huang, parallelization of BLAST with MapReduce for Long Sequence Alignment, International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), 2011, pp

SURVEY OF MAPREDUCE OPTIMIZATION METHODS

SURVEY OF MAPREDUCE OPTIMIZATION METHODS SURVEY OF MAPREDUCE OPTIMIZATION METHODS 1 Parmeshwari P. Sabnis, 2 Chaitali A.Laulkar Computer Department Sinhgad College of Engineering,Pune,India Email : 1 sabnis.parmeshwari@gmail.com, 2 calaulkar.scoe@sinhgad.edu

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

A Platform for Fine-Grained Resource Sharing in the Data Center

A Platform for Fine-Grained Resource Sharing in the Data Center Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California,

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING

HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING HADOOP MAPREDUCE IN CLOUD ENVIRONMENTS FOR SCIENTIFIC DATA PROCESSING 1 KONG XIANGSHENG 1 Department of Computer & Information, Xin Xiang University, Xin Xiang, China E-mail: fallsoft@163.com ABSTRACT

More information

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment Sangwon Seo 1, Ingook Jang 1, 1 Computer Science Department Korea Advanced Institute of Science and Technology (KAIST), South

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 203 208 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Tolhit A Scheduling

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

The Optimization and Improvement of MapReduce in Web Data Mining

The Optimization and Improvement of MapReduce in Web Data Mining Journal of Software Engineering and Applications, 2015, 8, 395-406 Published Online August 2015 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2015.88039 The Optimization and

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Li-Yung Ho Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Improving MapReduce Performance by Exploiting Input Redundancy *

Improving MapReduce Performance by Exploiting Input Redundancy * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 789-804 (2011) Short Paper Improving MapReduce Performance by Exploiting Input Redundancy * SHIN-GYU KIM, HYUCK HAN, HYUNGSOO JUNG +, HYEONSANG EOM AND

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

An Enhanced Approach for Resource Management Optimization in Hadoop

An Enhanced Approach for Resource Management Optimization in Hadoop An Enhanced Approach for Resource Management Optimization in Hadoop R. Sandeep Raj 1, G. Prabhakar Raju 2 1 MTech Student, Department of CSE, Anurag Group of Institutions, India 2 Associate Professor,

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 06, 2016 ISSN (online): 2321-0613 Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi Abstract

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters Motivation MapReduce: Simplified Data Processing on Large Clusters These are slides from Dan Weld s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google,

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

TP1-2: Analyzing Hadoop Logs

TP1-2: Analyzing Hadoop Logs TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Actual4Dumps.   Provide you with the latest actual exam dumps, and help you succeed Actual4Dumps http://www.actual4dumps.com Provide you with the latest actual exam dumps, and help you succeed Exam : HDPCD Title : Hortonworks Data Platform Certified Developer Vendor : Hortonworks Version

More information

SQL Query Optimization on Cross Nodes for Distributed System

SQL Query Optimization on Cross Nodes for Distributed System 2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

ADVANCES in NATURAL and APPLIED SCIENCES

ADVANCES in NATURAL and APPLIED SCIENCES ADVANCES in NATURAL and APPLIED SCIENCES ISSN: 1995-0772 Published BY AENSI Publication EISSN: 1998-1090 http://www.aensiweb.com/anas 2016 May 10(5): pages 166-171 Open Access Journal A Cluster Based Self

More information

BESIII Physical Analysis on Hadoop Platform

BESIII Physical Analysis on Hadoop Platform BESIII Physical Analysis on Hadoop Platform Jing HUO 12, Dongsong ZANG 12, Xiaofeng LEI 12, Qiang LI 12, Gongxing SUN 1 1 Institute of High Energy Physics, Beijing, China 2 University of Chinese Academy

More information

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

The Datacenter Needs an Operating System

The Datacenter Needs an Operating System UC BERKELEY The Datacenter Needs an Operating System Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos,

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY , pp-01-05 FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY Ravin Ahuja 1, Anindya Lahiri 2, Nitesh Jain 3, Aditya Gabrani 4 1 Corresponding Author PhD scholar with the Department of Computer Engineering,

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Piranha: Optimizing Short Jobs in Hadoop

Piranha: Optimizing Short Jobs in Hadoop Piranha: Optimizing Short Jobs in Hadoop Khaled Elmeleegy Turn Inc. Khaled.Elmeleegy@Turn.com ABSTRACT Cluster computing has emerged as a key parallel processing platform for large scale data. All major

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

A New Approach to Web Data Mining Based on Cloud Computing

A New Approach to Web Data Mining Based on Cloud Computing Regular Paper Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 A New Approach to Web Data Mining Based on Cloud Computing Wenzheng Zhu* and Changhoon Lee School of

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM * Journal of Contemporary Issues in Business Research ISSN 2305-8277 (Online), 2012, Vol. 1, No. 2, 42-56. Copyright of the Academic Journals JCIBR All rights reserved. IMPLEMENTATION OF INFORMATION RETRIEVAL

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman Andrew Konwinski Matei Zaharia Ali Ghodsi Anthony D. Joseph Randy H. Katz Scott Shenker Ion Stoica Electrical Engineering

More information

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center : A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California,

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters 2nd IEEE International Conference on Cloud Computing Technology and Science Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters Koichi Shirahata,HitoshiSato and Satoshi Matsuoka Tokyo Institute

More information

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive 4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,

More information

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems Resource Management for Dynamic MapReduce Clusters in Multicluster Systems Bogdan Ghiţ, Nezih Yigitbasi, Dick Epema Delft University of Technology, the Netherlands {b.i.ghit, m.n.yigitbasi, d.h.j.epema}@tudelft.nl

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Locality Aware Fair Scheduling for Hammr

Locality Aware Fair Scheduling for Hammr Locality Aware Fair Scheduling for Hammr Li Jin January 12, 2012 Abstract Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality

More information

Parallel data processing with MapReduce

Parallel data processing with MapReduce Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b 2nd Workshop on Advanced Research and Technology in Industry Applications (WARTIA 2016) The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

MixApart: Decoupled Analytics for Shared Storage Systems

MixApart: Decoupled Analytics for Shared Storage Systems MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto, NetApp Abstract Data analytics and enterprise applications have very

More information

Chronos: Failure-Aware Scheduling in Shared Hadoop Clusters

Chronos: Failure-Aware Scheduling in Shared Hadoop Clusters : Failure-Aware Scheduling in Shared Hadoop Clusters Orcun Yildiz, Shadi Ibrahim, Tran Anh Phuong, Gabriel Antoniu To cite this version: Orcun Yildiz, Shadi Ibrahim, Tran Anh Phuong, Gabriel Antoniu. :

More information

Forget about the Clouds, Shoot for the MOON

Forget about the Clouds, Shoot for the MOON Forget about the Clouds, Shoot for the MOON Wu FENG feng@cs.vt.edu Dept. of Computer Science Dept. of Electrical & Computer Engineering Virginia Bioinformatics Institute September 2012, W. Feng Motivation

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

a Spark in the cloud iterative and interactive cluster computing

a Spark in the cloud iterative and interactive cluster computing a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Similarities and Differences Between Parallel Systems and Distributed Systems

Similarities and Differences Between Parallel Systems and Distributed Systems Similarities and Differences Between Parallel Systems and Distributed Systems Pulasthi Wickramasinghe, Geoffrey Fox School of Informatics and Computing,Indiana University, Bloomington, IN 47408, USA In

More information

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment VPraveenkumar, DrSSujatha, RChinnasamy

More information