Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster
|
|
- Suzan Sharp
- 5 years ago
- Views:
Transcription
1 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster ZHIHAO TENG and ZHENGPING JIN ABSTRACT The original task scheduling algorithm of Hadoop cannot meet the performance requirements of heterogeneous clusters. The existing Hadoop task scheduling algorithm assumes that the performance of each node is consistent. This algorithm can perform well in homogeneous clusters. However, in the heterogeneous Hadoop cluster, due to different performance between CPU, disks and memory, the load imbalance will occur in the cluster. In the view of the unbalanced load of Hadoop cluster in heterogeneous environment, this paper proposes a new algorithm named Load balancing algorithm based on heterogeneous environment (LBAHE), which takes into account the performance differences of each node. And when measuring the performance of nodes, the number of slots is no longer the only criterion. We also add the CPU, disk memory and other factors. Experiments show the efficiency of the new algorithm and it can perform tasks faster in heterogeneous clusters than original algorithm. KEYWORDS load balancing, heterogeneous cluster, task scheduling, Hadoop INTRODUCTION Hadoop is one of the most important processing frameworks for big data. Load balancing has always been an important factor affecting Hadoop cluster s performance. Increasing data poses a challenge to processing capabilities of Hadoop cluster[1]. Inappropriate load balancing strategy will lead to the waste of computing resources, increase execution time, and even lead to system downtime. On the other hand, the appropriate load balancing strategy cannot satisfy the demand of users and make rational use of resources to complete tasks as soon as possible. The current load balancing strategy of Hadoop cluster perform well in the homogeneous cluster. However, in heterogeneous environments, due to the differences between data processing capacity, disk usage, and file read frequency, the load imbalance will occur in Hadoop clusters. In the heterogeneous environment, the performance of nodes is different. The current Hadoop cluster load balancing strategy does not take into account the differences in nodes, resulting in unreasonable allocation of tasks, which may result in unreasonable use of resources. Zhihao Teng, @163.com, Zhengping Jin, zhpjin@bupt.edu.cn, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, , China 1013
2 And when allocating task in the Hadoop s MapReduce stage, the current strategy only considers whether the node has enough memory. The CPU, disk, and I/O capabilities are not used as a reference factor for node load capacity. This could lead to some unreasonable allocation of tasks. Therefore, Hadoop load balancing strategy for heterogeneous environment needs further study [2]. In this paper, a novel Hadoop load balancing strategy is proposed for heterogeneous environment. The main research results are as follows: 1. Deeply analyze and understand the existing Hadoop load balancing strategy, and improve the Hadoop load balancing strategy in heterogeneous environment. 2. Through experimental evaluation, the same task is performed under the new and original load balancing strategies, and the task runs shorter under the new strategy compared to the original one. In the second section, we will introduce the relative work of load balancing, and the third section is about the existing problems of load balancing in heterogeneous environment and the new load balancing strategy proposed in this paper. In the fourth section, we evaluate the efficiency of the strategy through the relevant experiments. The fifth section is the summary. RELATED WORK Load balancing has always been an important factor affecting the performance of Hadoop clusters. In this section, we will introduce the development of Hadoop load balancing strategy in heterogeneous environments. Matei Zaharia et al.[3] proposed a LATE scheduling algorithm to improve MapReduce performance in an unbalanced environment. Guo et al.[4] and others put forward a new resource scheduling algorithm where tasks are assigned to those nodes whose resources cannot be fully utilized. At this stage, the main study of load balancing: Smriti R. Ramaknshnan et al.[5] studied the load balancing of Reduce. Venkata Swamy et al.[6] proposed a h-mapreduce model. It evaluates Reduce overloaded node. Yuanquan Fan et al.[7] proposed a LBVP algorithm, this strategy can ensure that each Reduce to obtain roughly the same amount of data, so as to achieve the purpose of load balancing. In [8], the heterogeneous cluster load balancing and file response time are studied, but the heterogeneity of node capacity in heterogeneous clusters is neglected. In [9], a strategy of prorating data is proposed which takes into account the heterogeneity of nodes, but ignores the influence of the heterogeneity of node storage space on data storage. In [10], the overload load efficiency of Hadoop data is proposed, which can balance the data load of each frame in a short time, but does not consider the heterogeneity of nodes. LOAD BALANCING STRATEGY Hadoop consists of two core components: HDFS and MapReduce. MapReduce is a parallel computing framework and it is mainly responsible for data processing. JobTracker is the master of MapReduce, which is responsible for scheduling the tasks. These tasks are distributed to different TaskTracker nodes. TaskTracker is MapReduce Slave that is only responsible for running the master Task execution. JobTracker assigns tasks to TaskTracker according to a certain scheduling algorithm. Default scheduling algorithm are shown in Figure 1: 1014
3 Figure 1. Task-obtaining process of the default Hadoop scheduling algorithm. [1] 1. Tasktracker count the number of currently executing tasks. 2. Tasktrackers determine whether the number of current executing tasks is less than the fixed number of slots for tasks or not. The fixed number of slots for tasks limits the number of tasks simultaneously running on a tasktracker, which is marked as FixedTasksCapacity. 3. AskForNewTask is a flag to indicate whether to obtain a new task or not. If RunningTasks is less than Fixed-TasksCapacity, it means that this tasktracker could accept new tasks. The tasktracker will set the flag to true, otherwise, to false. We can see when JobTracker allocate the task, it collect the number of nodes through the heartbeat mechanism firstly. The implementation of a MapReduce job consists of three main processes: Map, Shuffle, Reduce. In the Map phase, the Partition function is called to assign tasks to Reduce: PartitionNum = Hash (key)% ReduceNum (3-1) PartitionNum indicates the partition number and ReduceNum represents the number of Reduce. As can be seen from the above formula, the data will be evenly distributed to each Reduce. In isomorphic environment, it is reasonable to call the Partition function to assign the results of Map calculations to Reduce equally. However, for heterogeneous clusters, the network bandwidth, CPU frequency, memory size, disk read and write speed of each node is inconsistent. Those will lead to the performance of each node is quite different, and the load of each node is dynamic changes, which May cause some problems: For the same task, the load on different nodes is different. Assume that A and B have X and 2X memory, respectively, and take up 50%. But the remaining amount is different. So the amount that can continue to accept new tasks is different. This results in an irrational allocation of tasks in heterogeneous clusters. If the Tasktracker is a high performance computing node and cannot get more tasks. This means a hunger situation. If the task tracker is a low performance computing node and continues to get new tasks. However, it cannot run more tasks, which means "saturation", resulting in overloading. And in the course of the operation of various tasks, the requirements for resources is not the same. Some tasks can be heavily calculated, i.e., CPU intensive tasks, while others require a large number of I/O operations. However, it can be seen from the above that the default scheduling algorithm for Hadoop is only based on TaskTracker node number to assign task which can cause unreasonable assignment. In addition, the 1015
4 following situation may occur: the node has enough slots to continue to accept the new task, but this node does not have enough CPU resources. But the CPU is the necessary resources for it, this situation will exacerbate CPU resource consumption, resulting in node load imbalance and the implementation of the overall slow task. A new algorithm is proposed for the problem of default task allocation. The new algorithm takes into account the heterogeneous situation of the cluster, so that it can assign different numbers of tasks to different nodes. The new algorithm also allocates cpu, disk, memory and others as the evaluation criteria for node s load capacity. This can be more comprehensive than the default method. The algorithm steps are as follows: 1. JobTracker accepts the job and starts the task. 2. TaskTracker detects the node's resource information (slot number, CPU, disk, memory, etc.), and each node task execution status. 3. Pass the detected information to the JobTracker via the heartbeat mechanism. 4. Under the premise of heterogeneous consideration, JobTracker judges whether TaskTracker continues to assign new tasks according to the information transmitted, the actual load situation of each node and the ability to run tasks. When the new algorithm detects the resource information, it not only tests whether there are enough slots, but also has a more comprehensive detection of node information: CPU, disk, memory and so on. All the information is combined to judge the load capacity of the nodes. This avoids the situation when the number of slots is sufficient, but other resources are insufficient to cause load imbalance. When the node information is delivered to the JobTracker, the node assignment is judged according to the information obtained and the actual situation of each node. This process takes into account the difference in performance of each node, the load capacity of different nodes. Therefore, different performance nodes corresponding to the overload of the threshold is also different. The new algorithm judges the conditions of the node under the conditions of the load. In isomorphic clusters, Hash functions are assigned to each Slave node on average. But in the heterogeneous environment, we must first evaluate the performance of each node. For CPU, because the kernel number of each node is different, we must first determine the number of nodes, then detect the load of CPU, and use the kernel and each load to get the whole node about the use of CPU. For CPU nodes with different kernel numbers, different thresholds are set, and when the corresponding threshold is exceeded, the node is considered to be overloaded. Similarly, for the disk, you cannot simply use a unified standard to judge different node. A larger threshold is set for a larger node on the disk, a smaller threshold is set for a smaller disk. This allows you to have different criteria for the nodes that have different disks. The Slave node periodically reports the heartbeat to the Master node through the RPC protocol, which includes memory, CPU, disk, and so on. Master node assigns tasks to slave nodes according to the information of each node and a certain scheduling algorithm. After the Master receives the node confidence of the Slave, it will determine the memory, CPU, disk and other information of the node. If each option is not overloaded, a new task will be assigned to this node. If one of the returned nodes is overloaded, i.e., the threshold is exceeded, a new task will not be assigned to this node for the time being, but it will continue to monitor the situation of this node. For the next heartbeat message, if the load of this node reaches its normal level and has enough resources to run the new task, it will get a new task. 1016
5 EXPERIMENTAL EVALUATION In order to verify the performance of the new algorithm. Running time in the heterogeneous Hadoop system will be the criterion. The running time of the task includes the response time and the actual running time. Response time is from the submission of tasks to the beginning. This time indicator reflects the system's ability to provide services. The shorter the execution time, the stronger the system's ability to handle the task. The shorter the time, the more reasonable the dispatch of the system task is, and the better the load condition. Experimental environment: Experiments require sufficient tasks with heterogeneous cluster environments. Because the experimental environment requires an unbalanced cluster, a heterogeneous Hadoop cluster is built with a virtual machine. Each VM is a Hadoop node, the node's hardware status is different, there are three nodes with 2G memory, two nodes with 1G memory. One node is set to JobTracker, and the other four nodes are set to TaskTracker. All hard disk space is 20 GB. Virtual machines installed CentOS system, JDK version jdk-7u79-linux-x64, Hadoop version of The configuration of the cluster environment is shown in the following table. Which will be running Slave nodes with different shell scripts to consume the limited resources. In the above experimental environment, different tasks are executed in different algorithms, and each task is executed many times. The following experimental results are obtained: Figure 2 shows the running time of different algorithms under different raw data. The new line represents the running time of the improved algorithm, and original is the running time of the original algorithm. As can be seen from the diagram, the two algorithms run under the data of 100M, 200M, 500M, 1024M and 2048M respectively, and the improved algorithm runs much shorter. This shows that in the same heterogeneous environment, the new algorithm can be more reasonable allocation of tasks, and more rational use of resources. And make the task completed more quickly.this proves that the new algorithm has better load balancing capability and the effectiveness of the algorithm. Table 1. Hadoop cluster communication information. Host name memory disk Master 2G 20G Slave1 2G 10G Slave2 2G 20G Slave3 1G 20G Slave4 1G 10G Execution time(s) Size of tak(m) Original Figure 2. experimental results. 1017
6 task ratio/performance ratio size of task(mb) node1 Figure 3. Experimental results of the original algorithm. 1.5 task ratio/performance size of task(mb) node1 Figure 4. Experimental results of the new algorithm (LBAHE). The percentage of tasks represents the ratio of the actual task and the total amount of tasks that are actually running throughout the job. The performance ratio represents the weight of the performance of the running task node across the cluster performance, which reflects the amount of task that the node actually should allocate. This ratio is used to measure the load status of nodes in a cluster. In a reasonably loaded system, the ratio of the task to capacity ratio and the performance ratio should be close to 1. When the node approaches 1, the node that represents a certain capability runs a corresponding number of tasks. Fig. 3 is the experimental result of the original algorithm, and the experimental result selects the proportion of two nodes. As can be seen from the experimental data, in the result of the original algorithm, the scale values are all greater than 1 or less than 1. It shows that the task assignment is unreasonable and the load is unbalanced. In the experiment results of the new algorithm (LBAHE), the scale value is fluctuating between 1 and above, which shows that the task assignment is reasonable and the load is better. Experimental results show that the new algorithm (LBAHE) is effective for heterogeneous cluster load balancing. SUMMARY In this paper, we solve the Hadoop load balancing problem in heterogeneous environment. Firstly, we analyze the factors of load imbalance caused in unbalanced cluster: the performance of the cluster is inconsistent and the judgment factor is simple. In the view of the above two problems, a new algorithm is proposed, which can adapt to the Hadoop cluster of heterogeneous environment. It can be seen from the experiment 1018
7 that in the case of heterogeneous and different load conditions of each node. The new algorithm shows good performance. ACKNOWLEDGEMENTS This work is supported by NSFC (Grant No ), and the Fundamental Research Funds for the Central Universities (Grant No.2015RC23). REFERENCES 1. Apache. (2012, Aug.). Hadoop, The Apache Software Foundation, ForrestHill, MD, USA. [Online]. Available: 2. Xiaolong Xu, Lingling Cao, and Xinheng Wang, Adaptive Task Scheduling Strategy Based on Dynamic Workload Adjustment for Heterogeneous Hadoop Clusters. IEEE Systems Journal, 2016, 10 (2): Zaharia M., Konwinski A., Joseph A.D., et al. Improving MapReduce Performance in Heterogeneous Environments [C]/IOSDI. 2008, 8(4): Guo Z., Fox G.: Improving MapReduce performance in heterogeneous network environments and resource utilization [C]//Proceedings of the th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 2012: Ramakrishnan S.R., Swart G., Urmanov A. Balancing Reducer skew in MapReduce workloads using progressive sampling [C]//Proceedings of the Third ACM Symposium on Cloud Computing_ ACM, 2012: Martha V.S., Zhao W., Xu X. h-mapreduce: A Framework for Workload Balancing in MapReduce [C]//Advanced Information Networking and Applications (AINA), 2013 IEEE 27th International Conference on. IEEE, 2013: Fan Y., Wu W., Cao H., et al. LBVP.: A load balance algorithm based on Virtual Partition in Hadoop cluster [C]//Cloud Computing Congress (APC1oudCC), 2012 IEEE Asia Pacific. IEEE, 2012: Liu Kun, Niu Wenliang. An improved Hadoop data load balancing algorithm [J]. Journal of Henan Polytechnic University: Natural Science Edition, 2013, 32(3): Xie J.,.Yin S., Ruan X., et al. Improving map reduce performance through data placement in heterogeneous hadoop clusters [C]/ /Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010: Liu Kun, Xiao Lin, Zhao Haiyan. Research and optimization of cloud data load balancing in Hadoop.Microellectronics & Computer,2012,29(9):
Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce
Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,
More informationA priority based dynamic bandwidth scheduling in SDN networks 1
Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems
More informationDynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More informationA Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1, d
International Forum on Mechanical, Control and Automation IFMCA 2016 A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1,
More informationPerformance Analysis of MapReduce Program in Heterogeneous Cloud Computing
1734 JOURNAL OF NETWORKS, VOL. 8, NO. 8, AUGUST 2013 Performance Analysis of MapReduce Program in Heterogeneous Cloud Computing Wenhui Lin 1,2 and Jun Liu 1 1 Beijing Key Laboratory of Network System Architecture
More informationADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS
INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS Radhakrishnan R 1, Karthik
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More informationThe Improvement and Implementation of the High Concurrency Web Server Based on Nginx Baiqi Wang1, a, Jiayue Liu2,b and Zhiyi Fang 3,*
Computing, Performance and Communication systems (2016) 1: 1-7 Clausius Scientific Press, Canada The Improvement and Implementation of the High Concurrency Web Server Based on Nginx Baiqi Wang1, a, Jiayue
More informationImproved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationNovel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments
Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments Sun-Yuan Hsieh 1,2,3, Chi-Ting Chen 1, Chi-Hao Chen 1, Tzu-Hsiang Yen 1, Hung-Chang
More informationImprovements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1
3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao
More informationProcessing Technology of Massive Human Health Data Based on Hadoop
6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,
More informationA Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition
2016 3 rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: 978-1-60595-383-0 A Study of Cloud Computing Scheduling Algorithm Based on Task Decomposition Feng Gao &
More informationCCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE
More informationImproving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study
Improving MapReduce Performance in a Heterogeneous Cloud: A Measurement Study Xu Zhao 1,2, Ling Liu 2, Qi Zhang 2, Xiaoshe Dong 1 1 Xi an Jiaotong University, Shanxi, China, 710049, e-mail: zhaoxu1987@stu.xjtu.edu.cn,
More informationAn Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based
More informationA Brief on MapReduce Performance
A Brief on MapReduce Performance Kamble Ashwini Kanawade Bhavana Information Technology Department, DCOER Computer Department DCOER, Pune University Pune university ashwinikamble1992@gmail.com brkanawade@gmail.com
More informationDistributed Face Recognition Using Hadoop
Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,
More informationA Scheme of Multi-path Adaptive Load Balancing in MANETs
4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) A Scheme of Multi-path Adaptive Load Balancing in MANETs Yang Tao1,a, Guochi Lin2,b * 1,2 School of Communication
More informationA Spark Scheduling Strategy for Heterogeneous Cluster
Copyright 2018 Tech Science Press CMC, vol.55, no.3, pp.405-417, 2018 A Spark Scheduling Strategy for Heterogeneous Cluster Xuewen Zhang 1, Zhonghao Li 1, Gongshen Liu 1, *, Jiajun Xu 1, Tiankai Xie 2
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationHADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)
More informationCooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop
Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology
More informationModeling and Optimization of Resource Allocation in Cloud
PhD Thesis Progress First Report Thesis Advisor: Asst. Prof. Dr. Tolga Ovatman Istanbul Technical University Department of Computer Engineering January 8, 2015 Outline 1 Introduction 2 Studies Time Plan
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More informationThe Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c
Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationA New Approach to Web Data Mining Based on Cloud Computing
Regular Paper Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 A New Approach to Web Data Mining Based on Cloud Computing Wenzheng Zhu* and Changhoon Lee School of
More informationJournal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive
4 2018 7 ( ) Journal of East China Normal University (Natural Science) No. 4 Jul. 2018 : 1000-5641(2018)04-0099-10 Hadoop/Hive 1, 1, 1, 1,2, 1, 1 (1., 210095; 2., 210095) :,, Hadoop/Hive, Hadoop/Hive.,,
More informationA Micro Partitioning Technique in MapReduce for Massive Data Analysis
A Micro Partitioning Technique in MapReduce for Massive Data Analysis Nandhini.C, Premadevi.P PG Scholar, Dept. of CSE, Angel College of Engg and Tech, Tiruppur, Tamil Nadu Assistant Professor, Dept. of
More informationA MapReduce based Parallel K-Means Clustering for Large Scale CIM Data Verification
A MapReduce based Parallel K-Means Clustering for Large Scale CIM Data Verification Chuang Deng, Yang Liu*, Lixiong Xu, Jie Yang, Junyong Liu School of Electrical Engineering and Information, Sichuan University,
More informationLEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud
LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationAvailable online at ScienceDirect. Procedia Computer Science 89 (2016 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 341 348 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Parallel Approach
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationA New Model of Search Engine based on Cloud Computing
A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key
More informationAvailable online at ScienceDirect. Procedia Computer Science 89 (2016 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 203 208 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Tolhit A Scheduling
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationAn Intelligent Load Balancing Algorithm Towards Efficient Cloud Computing
AI for Data Center Management and Cloud Computing: Papers from the 2011 AAAI Workshop (WS-11-08) An Intelligent Load Balancing Algorithm Towards Efficient Cloud Computing Yang Xu, Lei Wu, Liying Guo, Zheng
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationResearch Article Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:
More informationImplementation and performance test of cloud platform based on Hadoop
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Implementation and performance test of cloud platform based on Hadoop To cite this article: Jingxian Xu et al 2018 IOP Conf. Ser.:
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationDynamic Load-balance Scheduling Approach in Linux Based on. Real-time Modifying LVS Weight. YihuaLIAO1, a,min LIN2
7th International Conference on Advanced Design and Manufacturing Engineering (ICADME 217) Dynamic Load-balance Scheduling Approach in Linux Based on Real-time Modifying LVS Weight YihuaLIAO1, a,min LIN2
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationDynamic Replication Management Scheme for Cloud Storage
Dynamic Replication Management Scheme for Cloud Storage May Phyo Thu, Khine Moe Nwe, Kyar Nyo Aye University of Computer Studies, Yangon mayphyothu.mpt1@gmail.com, khinemoenwe@ucsy.edu.mm, kyarnyoaye@gmail.com
More informationA Hybrid Architecture for Video Transmission
2017 Asia-Pacific Engineering and Technology Conference (APETC 2017) ISBN: 978-1-60595-443-1 A Hybrid Architecture for Video Transmission Qian Huang, Xiaoqi Wang, Xiaodan Du and Feng Ye ABSTRACT With the
More informationChisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique
Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.
More informationImprovement on PageRank Algorithm Based on User Influence
Improvement on Algorithm Based on User Influence Yang Wang Basic Medical College ShaanXi University of Chinese Medicine Xianyang, Shanxi, China Abstract With the rapid development of the Internet, web
More information2/26/2017. For instance, consider running Word Count across 20 splits
Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:
More informationPreliminary Research on Distributed Cluster Monitoring of G/S Model
Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring
More informationStudy of Load Balancing Schemes over a Video on Demand System
Study of Load Balancing Schemes over a Video on Demand System Priyank Singhal Ashish Chhabria Nupur Bansal Nataasha Raul Research Scholar, Computer Department Abstract: Load balancing algorithms on Video
More informationA Method of Identifying the P2P File Sharing
IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.11, November 2010 111 A Method of Identifying the P2P File Sharing Jian-Bo Chen Department of Information & Telecommunications
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationThe Optimization and Improvement of MapReduce in Web Data Mining
Journal of Software Engineering and Applications, 2015, 8, 395-406 Published Online August 2015 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2015.88039 The Optimization and
More informationA Multicast Routing Algorithm for 3D Network-on-Chip in Chip Multi-Processors
Proceedings of the World Congress on Engineering 2018 ol I A Routing Algorithm for 3 Network-on-Chip in Chip Multi-Processors Rui Ben, Fen Ge, intian Tong, Ning Wu, ing hang, and Fang hou Abstract communication
More informationConfiguring a MapReduce Framework for Dynamic and Efficient Energy Adaptation
Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation Jessica Hartog, Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju Department of Computer Science, State University of New
More informationDelegated Access for Hadoop Clusters in the Cloud
Delegated Access for Hadoop Clusters in the Cloud David Nuñez, Isaac Agudo, and Javier Lopez Network, Information and Computer Security Laboratory (NICS Lab) Universidad de Málaga, Spain Email: dnunez@lcc.uma.es
More informationData Prefetching for Scientific Workflow Based on Hadoop
Data Prefetching for Scientific Workflow Based on Hadoop Gaozhao Chen, Shaochun Wu, Rongrong Gu, Yongquan Xu, Lingyu Xu, Yunwen Ge, and Cuicui Song * Abstract. Data-intensive scientific workflow based
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationBig Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi
IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 06, 2016 ISSN (online): 2321-0613 Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi Abstract
More informationAnalyzing and Improving Load Balancing Algorithm of MooseFS
, pp. 169-176 http://dx.doi.org/10.14257/ijgdc.2014.7.4.16 Analyzing and Improving Load Balancing Algorithm of MooseFS Zhang Baojun 1, Pan Ruifang 1 and Ye Fujun 2 1. New Media Institute, Zhejiang University
More informationDesign of an Optimal Data Placement Strategy in Hadoop Environment
Design of an Optimal Data Placement Strategy in Hadoop Environment Shah Dhairya Vipulkumar 1, Saket Swarndeep 2 1 PG Scholar, Computer Engineering, L.J.I.E.T., Gujarat, India 2 Assistant Professor, Dept.
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationAn Optimization Algorithm of Selecting Initial Clustering Center in K means
2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School
More informationResearch and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d
4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationQADR with Energy Consumption for DIA in Cloud
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationIndexing Strategies of MapReduce for Information Retrieval in Big Data
International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya
More informationLocality Aware Fair Scheduling for Hammr
Locality Aware Fair Scheduling for Hammr Li Jin January 12, 2012 Abstract Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality
More informationAn Improved KNN Classification Algorithm based on Sampling
International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationSQL Query Optimization on Cross Nodes for Distributed System
2016 International Conference on Power, Energy Engineering and Management (PEEM 2016) ISBN: 978-1-60595-324-3 SQL Query Optimization on Cross Nodes for Distributed System Feng ZHAO 1, Qiao SUN 1, Yan-bin
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationDISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM
DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM V.V. Korkhov 1,a, S.S. Kobyshev 1, A.B. Degtyarev 1, A. Cubahiro 2, L. Gaspary 3, X. Wang 4, Z. Wu 4 1 Saint Petersburg State University, 7/9 Universitetskaya
More informationTowards Makespan Minimization Task Allocation in Data Centers
Towards Makespan Minimization Task Allocation in Data Centers Kangkang Li, Ziqi Wan, Jie Wu, and Adam Blaisse Department of Computer and Information Sciences Temple University Philadelphia, Pennsylvania,
More informationOracle Big Data. A NA LYT ICS A ND MA NAG E MENT.
Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem
More informationEnergy efficient optimization method for green data center based on cloud computing
4th ational Conference on Electrical, Electronics and Computer Engineering (CEECE 2015) Energy efficient optimization method for green data center based on cloud computing Runze WU1, a, Wenwei CHE1, b,
More informationAn Improved Weighted Least Connection Scheduling Algorithm for Load Balancing in Web Cluster Systems
An Improved Weighted Least Connection Scheduling Algorithm for Load Balancing in Web Cluster Systems Gurasis Singh 1, Kamalpreet Kaur 2 1Assistant professor,department of Computer Science, Guru Nanak Dev
More informationPaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University
PaaS and Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1 Outline PaaS Hadoop: HDFS and Mapreduce YARN Single-Processor Scheduling Hadoop Scheduling
More informationK-MEANS METHOD FOR GROUPING IN HYBRID MAPREDUCE CLUSTERS
K-MEANS METHOD FOR GROUPING IN HYBRID MAPREDUCE CLUSTERS 1 YANG YANG, XIANG LONG, BO JIANG, YU LIU, 1 School of Computer Science and Engineering, Beihang University, Haidian 100191, Being, China :Corresponding
More informationPerformance Optimization for Short MapReduce Job Execution in Hadoop
2012 Second International Conference on Cloud and Green Computing Performance Optimization for Short MapReduce Job Execution in Hadoop Jinshuang Yan, Xiaoliang Yang, Rong Gu, Chunfeng Yuan, and Yihua Huang
More informationConstruction Scheme for Cloud Platform of NSFC Information System
, pp.200-204 http://dx.doi.org/10.14257/astl.2016.138.40 Construction Scheme for Cloud Platform of NSFC Information System Jianjun Li 1, Jin Wang 1, Yuhui Zheng 2 1 Information Center, National Natural
More informationA Simple Model for Estimating Power Consumption of a Multicore Server System
, pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of
More informationMixing and matching virtual and physical HPC clusters. Paolo Anedda
Mixing and matching virtual and physical HPC clusters Paolo Anedda paolo.anedda@crs4.it HPC 2010 - Cetraro 22/06/2010 1 Outline Introduction Scalability Issues System architecture Conclusions & Future
More informationLITERATURE SURVEY (BIG DATA ANALYTICS)!
LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer
More informationAutomated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University
D u k e S y s t e m s Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University Motivation We address challenges for controlling elastic applications, specifically storage.
More informationJumbo: Beyond MapReduce for Workload Balancing
Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp
More informationADVANCES in NATURAL and APPLIED SCIENCES
ADVANCES in NATURAL and APPLIED SCIENCES ISSN: 1995-0772 Published BY AENSI Publication EISSN: 1998-1090 http://www.aensiweb.com/anas 2016 May 10(5): pages 166-171 Open Access Journal A Cluster Based Self
More informationCloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University
Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed
More informationParallelization of K-Means Clustering Algorithm for Data Mining
Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com
More information