Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU

Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU CS13B1033 T Satya Vasanth Reddy CS13B1035 Hrishikesh Vaidya CS13S1041 Arjun V Anand

Hadoop Architecture

Hadoop Architecture Name Node : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Data Node: A Data Node stores data in the Hadoop File System. A functional filesystem has more than one Data Node, with data replicated across them. On startup, a Data Node connects to the Name Node and responds to requests from Name Node. Task Tracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Hadoop Architecture Job Tracker: The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data. 1. Client applications submit jobs to the Job tracker. 2. The JobTracker talks to the NameNode to determine the location of the data 3. The JobTracker locates TaskTracker nodes with available slots at or near the data 4. The JobTracker submits the work to the chosen TaskTracker nodes. 5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.

Hadoop Architecture 6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. 7. When the work is completed, the JobTracker updates its status. The steps and detailed explanation for setting up the Hadoop multi node cluster are in the comprehensive report.

Single source shortest path algorithm OVERVIEW: A path in a graph can be defined as the set of consecutive nodes such that there is an edge from one node to the next node in the sequence. The shortest path between two nodes can be defined as the path that has the minimum total weight of the edges along the path. Variant of breadth first search is used to solve the single source shortest path problem.

LOGIC OF ALGORITHM : The single-source all pairs shortest path can be solved using MapReduce using parallel Breadth-First Search (BFS) in an iterative manner. The source node is processed first, then the nodes connected to the source node are processed and so on. Input Format : ID EDGE-EDGEWEIGHT DISTANCE_FROM_SOURCE COLOR COLOUR CODE White unvisited Gray - visited Black - Finished

INPUT FORMAT Before start all nodes are coloured white except source which will be grey. For eg., 1 2-2,3-6,6-3 0 GRAY Here, 1-2 edge weight is 2, 1-3 edge weight is 6 and so on and 0 distance from source indicates that it is the source and GREY indicates it is discovered.

Algorithm The gray node indicates that it is visited and its neighbors should be processed. All the nodes adjacent to a gray node that are white are changed to be gray colored indicating that the nodes are visited. The original gray node is colored black indicating that all its neighbors are visited and the processing of the node is finished. The process continues until there are no more gray nodes to process in the graph. INPUT 1 2-2,3-6,6-3 0 GRAY 2 1-2 Integer.MAX_VALUE WHITE 3 1-6,4-4,5-1,6-1 Integer.MAX_VALUE WHITE 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 Integer.MAX_VALUE WHITE

Stages of Algorithm

Mapper Responsible for "exploding" all gray nodes - e.g. for exploding all nodes that live at our current depth in the tree. For each gray node, the mappers emit a new gray node, with distance = distance from source of gray node + weight of the edge. They also then emit the input gray node, but colored black. Mappers also emit all non-gray nodes, with no change.

After Map Iteration 1 1 2-2,3-6,6-3 0 GRAY 2 NULL 2 GRAY 3 NULL 6 GRAY 6 NULL 3 GRAY 2 1-2 Integer.MAX_VALUE WHITE 3 1-6,4-4,5-1,6-1 Integer.MAX_VALUE WHITE 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 Integer.MAX_VALUE WHITE

Reducer The reducers, of course, receive all data for a given key - in this case it means that they receive the data for all "copies" of each node. For example, the reducer that receives the data for key = 2 gets the following list of values : 2 NULL 2 GRAY 2 1-2 Integer.MAX_VALUE WHITE The reducers job is to take all this data and construct a new node using the non-null list of edges the minimum distance the darkest color

After Iteration 1 Using this logic after the first iteration, the output will be 1 2-2,3-6,6-3 0 BLACK 2 1-2 2 GRAY 3 1-6,4-4,5-1,6-1 6 GRAY 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 3 GRAY

Terminating Condition The iteration stops when there are no more grey nodes to process in the graph. FINAL OUTPUT: 1 2-2,3-6,6-3 0 BLACK 2 1-2 2 BLACK 3 1-6,4-4,5-1,6-1 4 BLACK 4 3-4,5-2 9 BLACK 5 3-1,4-2 7 BLACK 6 1-3,3-1 3 BLACK

Analysis of Running times on Single Node and No. of Nodes in input graph Multi node cluster Time taken in Single Node Time taken in 3 Node cluster 10 8.217 12.035 20 8.307 9.734 50 10.669 14.674 100 8.214 10.307 200 14.273 11.385 500 29.296 16.261 10000 1963.61 319.704 30000 9772.563 1956.86

All pair shortest Path Overview For calculating shortest path between all pair of vertices without using parallel computing, we use the standard Floyd Warshall algorithm. For implementing it in hadoop framework, the main task is to reduce the problem statement to key-value pairs. After nth iteration of relaxation, we get shortest path from node i to node j having path length at most n.

Input format lthe input format is in the form of nodeid and adjacency list. lthe graph is undirected and there can be multiple edges within a pair of vertices. The adjacency list has a list of pairs which denote the neighbouring vertex and the weight joining them. 1 3,43 2 4,18 3 2,31 1,32 4,14 4 2,27 3,23 5,48 5 1,23

Mapper Class lthe mapper class takes the entire file as input and parses it line by line. lfor each trio of vertices present in the graph it relaxes the edge weights. lif node i and node j are adjacent to node k then it sums dist(i,k) and dist(k,j) and sets it to dist(i,j). lthe implementation is similar to that of Floyd Warshall. It considers the k'th vertex to be present in the path from i to j. lfor all the vertices in the adjacency list it emits a new node with the same nodeid as that of the adjacent vertex.

Output of Mapper 1 3,43 2 4,18 3 2,31 1,32 4,14 4 2,27 3,23 5,48 5 1,23 3 1,43 4 2,18 2 3,31 1,63 4,45 1 3,32 2,63 4,46 4 3,14 2,45 1,46 2 4,27 3,50 5,75 3 4,23 2,50 5,71 5 4,48 2,75 3,71

Reducer lthe output of mapper is fed to the reducer. A list of values having the same key is sent to a particular reducer. lthe value is adjacency list having adjacent node and path weight from key to the list nodeid. lfor the shortest path the minimum of all the path weights to a particular vertex is considered and added to the adjacency list of the key. lafter each iteration the output file is generated by the reducer having shortest path from each node i to j.

Output of Reducer 1 2,63 3,32 4,46 5,23 2 1,63 3,31 4,18 5,75 3 1,32 2,31 4,14 5,71 4 1,46 2,18 3,14 5,48 5 1,23 2,75 3,71 4,48

Final output For each vertex a boolean variable is maintained to check whether we have got the minimum weights for all pairs. If the path weight gets updated for a particular node then it sets isconverged to false indicating that the current distance may not be the shortest. The final output for the above graph is : 1 2,63 3,32 4,46 5,23 2 1,63 3,31 4,18 5,75 3 1,32 2,31 4,14 5,71 4 1,46 2,18 3,14 5,48 5 1,23 2,75 3,71 4,48

Analysis of running times of single node and multi-node cluster S.no No. of nodes in input graph Time taken in single node (sec) Time taken in 3-node cluster(sec 1 20 10.447 9.432 2 50 17.866 11.102 3 100 58.159 23.124 4 200 582.186 73.743 5 500 10025.565 1396.768

Summary and future enhancements Map reduce is not efficient way for small inputs as creating the map and reduce jobs takes a considerable amount of time comparable to processing time. Before formatting the namenode it s a better practice to delete the namenode and datanode in Hadoop_store to ensure all the nodes get the same cluster id. Analysis was done using wireless network but the performance can be improved using LAN which has greater bandwidth

Summary and future enhancements Dijkstra s algorithm is more efficient because at any step it only pursues edges from the minimum-cost path inside the frontier. But our algorithm explores all paths in parallel which isn t as efficient overall. We are calculating the shortest path length here but we can find the trace of shortest path covered along with shortest distance by keeping track of the parent vertex.

Acknowledgement We would like to thank Dr.Sobhan Babu for guiding us and Ms.Samanvi and Mr.Kanishka Chauhan for helping us understand the concepts time to time. Thanks to Tanya Marwah for giving us an extra slave node. PS: Complete details of implementation in framework and procedure of analysis are encompassed in the extensive reports and the video.

Bibliography Google MapReduce Paper Wiki Hadoop Hadoop Operations by Eric Sammer BigData University

THANK YOU