HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS

Size: px

Start display at page:

Download "HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS"

Ferdinand Walton
6 years ago
Views:

1 INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN (Print), ISSN (Print) ISSN (Online) Volume 5, Issue 12, December (2014), pp IAEME: Journal Impact Factor (2014): (Calculated by GISI) IJCET I A E M E HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS Suhas V. Ambade Pune University, MIT College of engineering, Kothrud, Pune , India Prof. Priya Deshpande Pune University, MIT College of engineering, Kothrud, Pune , India ABSTRACT Now a day s Peta-Bytes of data becomes the norm in industries. Handling, analyzing such big data is challenging task. Even frameworks like Hadoop (Open Source Implementation of MapReduce Paradigm) and NoSQL databases like Cassandra, HBase can be used to analyze and store such large data; heterogeneity of data is still an issue. Data centers usually have clusters formed using heterogeneous nodes. Ecosystem like Hadoop can be used to manage such types of cluster but it can not schedule jobs (Application) efficiently on this heterogeneous cluster when data is itself heterogeneous. Heterogeneity of data may be because of data format or because of the complexity. This paper is review of systems and algorithms for distributed management of huge size data, efficiency of these approaches. Keywords: Hadoop, Block Placement Policy, Heterogeneous Cluster, HDFS. 1. INTRODUCTION Big data is referred to both unstructured and structured data which is very large by volume, velocity and verity. The challenges include searching, sharing, storage, analysis, capture, transfer, violation of privacy and visualization [2]. According to IBM[16], 80% of data is unstructured which is present in current date, which is obtained from sensors used for gathering information about climate, social networking site s posts, transactions on purchase records, and GPS signals on cell phone, digital images and videos, and many more. All of this generated unstructured data from 249

2 various sources is big data. Hadoop [1] is the platform for structuring Big Data and making it useful for further analytics. Hadoop is designed and developed by Apache Software Foundation for distributed processing of large data i.e. big data on the commodity hardware. As Hadoop is implemented in Java language, so any machine that is Java compatible supports Hadoop architecture. Hadoop consists of four modules [1] as listed below: 1. Hadoop Common: These are the utilities for supporting other Hadoop modules. 2. Hadoop Distributed File System (HDFS): For storing and processing large data sets among the clusters. 3. Hadoop YARN: It is a framework for job scheduling and resource management for clusters. 4. Hadoop MapReduce: It is for processing of large data sets. These all modules are designed to take into consideration about hardware failure. Some of other assumptions and goals in HDFS are large data processing, hardware failure, streaming data access, portability across heterogeneous software and hardware platform, simple coherency model. HDFS uses random block placement strategy in which there is no fair and equal distribution of replicas across the cluster. HDFS does not take into consideration the DataNode s disk space utilization so it may cause cluster unbalancing and which can affect MapReduce performance. When new rack is added which contains multiple DataNodes to an existing Hadoop cluster we are in a situation of the unbalanced cluster. Suppose there are two existing Racks; rack A & B having file suhas.pdf and running MapReduce jobs on it. When we add one more rack to the cluster, the suhas.pdf data did not get automatically distributed over to the new rack. All the data resides where it was placed before. New DataNodes in the newly added rack were in idle state because no data, till loading new data into the cluster. If both servers A & B are busy in doing any task then only the Job Tracker may not have any choice than to assign Map tasks on Suhas.pdf to the new DataNodes which do not have any local data present. There is unbalance in a cluster which results in network traffic and job completion time will be more. To solve this unbalancing problem a utility program named balancer has to be run explicitly to reorganize that data among clusters. And balance will explicitly arrange that data among the clusters where it finds more storage space which causes lose of I/O bandwidth. Recently, many authors present strategies for managing replicas and about placement of data blocks in clusters including for homogeneous and heterogeneous cluster. Some of these strategies are explained in a literature review of this paper. 1.1 HDFS Architecture HDFS [3] is component of Hadoop for file system management which supports hierarchical file organization. It supports write once and read many times policy. It has master-slave architecture, the NameNode is master and DataNodes are slaves. The architecture also contains secondary Namenode which acts as a backup for NameNode. Namenode: It is the system that manages file systems namespace i.e. metadata and controls access of the client to the data that is stored in file at datanodes. The file system namespace changes and its properties are updated in the NameNode. Datanode: It is the system that actually stores data; the blocks are internally split into equal size and are placed into the file system. 250

Fig: HDFS Architecture Generally data is divided into blocks of equal size 64MB block except the last block, and these blocks are placed on datanodes using default block placement policy (random

3 Fig: HDFS Architecture Generally data is divided into blocks of equal size 64MB block except the last block, and these blocks are placed on datanodes using default block placement policy (random block placement policy). The replication factor by default is 3. This means three copies of each block are present in HDFS, this improves availability. Heartbeats are sent to NameNode by DataNode after regular interval of time to recognize which node is dead and which is alive. By default the interval is 3 second; if NameNode doesn t get any response from Datanode in 10 seconds then the DataNode is treated as fail node. A heartbeat contains node id, track of total storage space, number of data transfers currently going on and amount of storage space in use. 1.2 Block Placement Policy In HDFS file is divided into small chunks of which the size is guarded by the parameter dfs.block.size in the config file named hdfs-site.xml will be placed on a different DataNode. Each block, 64 MB (by default) chunks are stored on nodes, is guarded using the parameter dfs.replication in file hdfs-site.xml which helps to achieve fault tolerance. Each block copy is called as a replica. HDFS uses rack aware data placement strategy that means if the blocks are placed in one rack then their copy will be placed in another rack so as to achieve fault tolerance when there is node failure or switch failure. Following is default block placement policy present in HDFS [4]: Place the first replica on Datanode either local node or on a random node depending upon the HDFS client running in the cluster. Place the second replica on a rack other than first replica placement. Place the third replica in the rack where the second replica is placed. If there are replicas remaining; distribute them randomly across the racks present in network with the restriction that, in the same rack there are no more than two replicas. 251

4 2. LITERATURE REVIEW JiongXie et al. [5], explains the need Hadoop for data-intensive nature large scale systems with heterogeneous clusters such as mining of data and web indexing. Data locality plays key role in improving MapReduce performance. In a heterogeneous cluster environment high computing power nodes compete with that of low computing power node so there is data movement from low computing power node to high computing power node which reduces the performance and there is also load balancing issue. To overcome these problems author proposes data reorganization algorithm to support data placement in the cluster. Heterogeneity is calculated as matric of the ratio of computing power across nodes called computing ratio. Depending upon computing ratio, fragments of a file are distributed so that at a same time all nodes can complete the processing of local data. The advantage of this work is better utilization of computing power, I/O performance of each node and data rebalancing is achieved. The limitation is that the author does not take into consideration about data replication because of the higher disk space utilization, but this can create issues of fault tolerance. Xianglong Ye et al. [6], proposes a novel block placement strategy depending upon space which is remained in the Datanode. Proposed strategy mainly takes load balancing into consideration. HDFS considers network bandwidth and fault tolerance but there are some shortages such as it don t take into consideration about disk space utilization while placing blocks also it does not consider the real time situation of node so there is need of balancing tool called balancer to achieve load balancing. By considering these two shortages author proposes new block placement policy which takes care about disk space utilization. Load balancing is achieved through taking lowest utilization node as a priority node; there is no more than a single replica in Datanode, no rack contains more than two replicas, and local rack is priorior. The advantage of paper discussed above is, there is proper load balancing as per real time situation of Datanode and disk space utilization is known before placing block so no balancer is needed. The limitation is that there is control overhead if large numbers of datanodes are present in Hadoop cluster. Jun Wang et al.[7], proposed paper DRAW which means Data-gRouping-AWare data placement scheme is designed to work at rack level. DRAW considers locality of interest for frequently accessed data blocks i.e. if two blocks are accessed one after another then they are considered as related blocks and are grouped together. Random block placement strategy doesn t take data grouping into consideration; if it is used then there is less possibility that related blocks are placed in same Datanode so the MapReduce task is applied on multiple DataNodes. DRAW has three parts: (a) data history graph: which is created from the system s log maintained in the Namenode, (b) data grouping matrix: it is used to show a relation among two blocks of data, (c) optimal data placement algorithm: it is based on submatrix of data grouping matrix to indicate dependency between data already placed and data being placed. The advantage of this is maximum parallelism is achieved which enhance load balancing. The limitations are that there is a very less probability that continuously accessed blocks need to be related. And the log file which is at Namenode is of huge size so reading it and gaining pattern from that is a big deal. Krish K.R. et al. [8], proposed Hats: a Heteroginity Aware Tiered Storage, in which storage devices which are connected to each DataNode are taken into consideration. At each DataNode there may be many types of storage media is connected such as a solid storage device, network attached storage, etc. so there is heterogeneity at the node. These same types of storages are placed in a single tire logically so the node with different types of storage device is part of multiple tiers. Depending upon these tiers and heterogeneity the new block placement policies are designed: (a) network-aware 252

policy: blocks are randomly distributed, (b) tier aware policy: considers storage characteristics of storage devices and replicates blocks in multiple tiers.

5 policy: blocks are randomly distributed, (b) tier aware policy: considers storage characteristics of storage devices and replicates blocks in multiple tiers. To prevent loss of data if there is node failure; node stores only single copy even though a node has multiple tiers. And (c) hybrid policy: it is a combination of (a) and (b). Fig: hats architecture The advantage of this is proposed work is higher I/O bandwidth and job completion time is less. The limitation is each node is part of multiple tires so it s hard to maintain the metadata. Madhu Kumar et al. [12], proposed bandwidth aware data placement scheme for hadoop, commented on issue that data retrieval may be affected by many parameters so proposed scheme focuses on bandwidth as a parameter in which data is stored in DataNode having maximum bandwidth so that retrieval time would be less. The blocks are placed in cluster in bandwidth-aware fashion. In this implementation author used Iperf (open source package) for measurement of bandwidth. Bandwidth between DataNode and the client is measured after particular time and data is placed accordingly. Madhu Kumar et al.[13] proposed "A Dynamic Data Placement Scheme for Hadoop Using Real-time Access Patterns which focuses on real time access patterns of data that is fetched by users. Data is placed near to users so that access time is reduced and bandwidth utilization is proper. As far as real situation optimal nodes are chosen based upon who is accessing the data node frequently, and choosing most relevant location. The distance is found based upon ping s RTT (round trip time).the advantage of this work is choosing DataNode based upon real time situation. The lightweight extension for Hadoop called CoHadoop [14], is the implementation which selects DataNode randomly for every new key. Co-location is achieved by adding a property called locator, locator table is maintained at master-node and data placement policy is modified so that it uses locator while placing blocks. Many files can be assigned to same locator, the grouping key (attribute which is common in set of blocks) is used to identify related logs, co-location is done for all those files which are corresponding to the same key. A file which does not have any locator are placed with default block placement strategy. Log processing is also improved in CoHadoop. 253

6 Advantages of CoHadoop are to improve the efficiency of many operations such as indexing, columnar storage, joins, grouping, sessionization, and aggregation [14]. The disadvantage is that it leads to imbalance cluster so there is not proper load balancing. To improve this load balancing one another approach is evolved by Nishanth S et el. called CoHadoop ++[11]. The load balancing is achieved through selection of DataNode based upon load it has i.e. node selection policy is improved. Block placement is based based upon two key factors: (1) remaining storage capacity of the node (2) number of grouping keys to which DataNode is currently assigned [11]. The limitation is that: nature of data is not identified for co-location. 3. CHALLENGING ISSUES Hadoop is open source implementation of MapReduce paradigm. In the Hadoop HDFS (hadoop distributed file system) is used for storing data into clusters of data node. By default Hadoop divides the files into block and then distribute this block to various nodes. By considering fault tolerance, by default it replicates the data into 2 another node i.e. total 3 copies are stored so that if one of the nodes fails then same copy is available on another node. Cluster admin can configure this factor using hdfs-site.xml But Hadoop can not handle the application with different file formats. One solution is to convert all data into text format and then upload data on HDFS for processing. But conversion is overhead. Also Hadoop by default don't consider heterogeneity of cluster nodes for processing data which leads to performance degradation. Heterogeneity of nodes can be benificial in the case of application with heterogeneous data i.e. data with different file formats. Hadoop does not have support for application with heterogeneous data. 4. PROPOSED SYSTEM Currently Hadoop does not have any support for directly working on various file formats. It does not support file formats like pdf because of complexity of handling them. Even if we add support for the pdf file processing as a plugin to hadoop, still the issue will be with the application with different type files as input. Our work is to add the support for more complex file types like pdf etc. As well as modify hadoop mapreduce runtime as well as hadoop file system so that user can submit application with more complex file types. This heterogeneous data can create load balancing issue even though hadoop cluster is having different compute power nodes. Modification in hadoop run time will be so that master node will be aware of each node's compute power in advance so that complete node set can be divided again into subparts. Hadoop file system must be modified so that when user copy such kind of data on cluster, more complex data will be copied to partition of node set with high compute power. Here compute power for each slave node will be calculated at node itself considering processor, memory, hard disk and then will be sent to master node as a part of heartbeat. For allowing the processing of more complex file type, customized Input Format will be written which will actualy parse the data blocks. And for key value pair as input for mapper customized Record Reader and Input Split will be required. 5. PROPOSED ARCHITECTURE Following figure shows proposed architecture that we are going to implement for our system so that the data is get divided into clusters as per complexity and file formats. Files with different file formats are get divided into blocks are these blocks are placed upon different nodes in the cluster as per file formats. 254

Fig: proposed architecture GOALS Enhance current system for application with different file formats. To leverage the heterogeneity of nodes for managing total load.

7 Fig: proposed architecture GOALS Enhance current system for application with different file formats. To leverage the heterogeneity of nodes for managing total load. This system properly utilizes compute powers of various data node. 6. METHODOLOGY AND TECHNIQUES TO BE USED Nodes can be partitioned into sub clusters based on compute power. Hadoop currently works on the principal of replication factor and copy the blocks to more than one machine depending on this replication factor. This default policy can be changed so that data with more complex data types can be placed on high compute power nodes group. It also requires implementing InputFomat interface for handling heterogeneous data. 7. CONCLUSION This paper discussed the current solutions for challenges in handling big data using ecosystem like hadoop, associated distributed file system, other distributed data management frameworks for issues related to complexity of data, Heterogeneous cluster and load balancing. But Hadoop does not have direct support for file format categorization also for pdf, we have proposed our system so that it can handle pdf files directly, categorize files based upon file formats and can use better computing power of each node. Results for this system will be tested based upon cloudsim stimulator and will be presented in further paper. 255

8 REFERENCES Dhruba Borthaku, The hadoop distributed file system: Architecture and design, Retrieved from hadoop.apache.org/docs/r1.2.1/hdfs_design.html, ntpolicydefault 5. JiongXie; Shu Yin; Xiaojun Ruan; Zhiyang Ding; Yun Tian; Majors, J.; Manzanares, A.; Xiao Qin, "Improving MapReduce performance through data placement in heterogeneous Hadoop clusters," Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, vol., no., pp.1,9, April Xianglong Ye; Mengxing Huang; Donghai Zhu; PengXu, "A Novel Blocks Placement Strategy for Hadoop," Computer and Information Science (ICIS), 2012 IEEE/ACIS 11th International Conference on, vol., no., pp.3,7, May June Wang, J.; Xiao, Q.; Yin, J.; Shang, P., "DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications With Interest Locality," Magnetics, IEEE Transactions on, vol.49, no.6, pp.2514,2520, June Krish, K.R.; Anwar, A.; Butt, A.R., "hats: A Heterogeneity-Aware Tiered Storage for Hadoop," Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, vol., no., pp.502,511, May Shanjiang Tang; Bu-Sung Lee; Bingsheng He, "DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters," Cloud Computing, IEEE Transactions on, vol.2, no.3, pp.333,347, July-Sept Nishanth, S.; Radhikaa, B.; Ragavendar, T.J.; Babu, C.; Prabavathy, B., "CoHadoop++: A load balanced data co-location in Hadoop Distributed File System," Advanced Computing (ICoAC), 2013 Fifth International Conference on, vol., no., pp.100,105, Dec Shabeera, T.P.; Madhu Kumar, S.D., "Bandwidth-aware data placement scheme for Hadoop," Intelligent Computational Systems (RAICS), 2013 IEEE Recent Advances in, vol., no., pp.64,67, Dec Poonthottam, V.P.; Madhu Kumar, S.D., "A Dynamic Data Placement Scheme for Hadoop Using Real-time Access Patterns," Advances in Computing, Communications and Informatics (ICACCI), 2013 International Conference on, vol., no., pp.225,229, Aug M. Y. Eltabakh, Y. Tian, F. Ozcan, R. Gemulla, A. Krettek, J. McPherson. "CoHadoop: Flexible Data Placement and Its Exploitationin Hadoop," In proceedings of 37th International Conference on Very Large Data Bases, 2011, Pages , Seattle, Washington. 15. J. Dittrich et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, volume 3, pages , Gandhali Upadhye and Astt. Prof. Trupti Dange, Nephele: Efficient Data Processing Using Hadoop International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 7, 2014, pp , ISSN Print: , ISSN Online: Kuldeep Deshpande and Dr. Bhimappa Desai, Limitations of Datawarehouse Platforms and Assessment of Hadoop as an Alternative International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 2, 2014, pp , ISSN Print: , ISSN Online:

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

Software Metric Trends And Evolution, B Venkata Ramana, Dr.G.Narasimha, Journal Impact Factor DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON (2015): 8.9958 (Calculated by GISI) www.jifactor.com THE