Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU

Similar documents
Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

3. Monitoring Scenarios

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Lecture 11 Hadoop & Spark

PARALLELIZATION OF BACKWARD DELETED DISTANCE CALCULATION IN GRAPH BASED FEATURES USING HADOOP JAYACHANDRAN PILLAMARI. B.E., Osmania University, 2009

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

Clustering Lecture 8: MapReduce

2/26/2017. For instance, consider running Word Count across 20 splits

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.


MI-PDB, MIE-PDB: Advanced Database Systems

Database Applications (15-415)

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI. Department of Computer Science and Engineering CS6301 PROGRAMMING DATA STRUCTURES II

HADOOP FRAMEWORK FOR BIG DATA

Outlines: Graphs Part-2

Elementary Graph Algorithms. Ref: Chapter 22 of the text by Cormen et al. Representing a graph:

TIE Graph algorithms

Data-Intensive Computing with MapReduce

Hadoop MapReduce Framework

Big Data for Engineers Spring Resource Management

Hadoop. copyright 2011 Trainologic LTD

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

TIE Graph algorithms

itpass4sure Helps you pass the actual test with valid and latest training material.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

A brief history on Hadoop

Distributed Systems. CS422/522 Lecture17 17 November 2014

Figure 1: A directed graph.

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Big Data 7. Resource Management

CSI 604 Elementary Graph Algorithms

Introduction to MapReduce

From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7.

Graph implementations :

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

2. True or false: even though BFS and DFS have the same space complexity, they do not always have the same worst case asymptotic time complexity.

Chapter 14. Graphs Pearson Addison-Wesley. All rights reserved 14 A-1

Cloud Computing CS

Exam Questions CCA-505

Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster

Introduction to Data Management CSE 344

Inria, Rennes Bretagne Atlantique Research Center

CS490 Quiz 1. This is the written part of Quiz 1. The quiz is closed book; in particular, no notes, calculators and cell phones are allowed.

Distributed Face Recognition Using Hadoop

Top 25 Hadoop Admin Interview Questions and Answers

Trees. Arash Rafiey. 20 October, 2015

22.1 Representations of graphs

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Basic Graph Algorithms

CMPSC 250 Analysis of Algorithms Spring 2018 Dr. Aravind Mohan Shortest Paths April 16, 2018

csci 210: Data Structures Graph Traversals

Introduction to MapReduce

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

Problem 1. Which of the following is true of functions =100 +log and = + log? Problem 2. Which of the following is true of functions = 2 and =3?

Programming Models MapReduce

Graphs. Tessema M. Mengistu Department of Computer Science Southern Illinois University Carbondale Room - Faner 3131

A BigData Tour HDFS, Ceph and MapReduce

Big Data and Scripting map reduce in Hadoop

Your First Hadoop App, Step by Step

LECTURE 26 PRIM S ALGORITHM

LECTURE 17 GRAPH TRAVERSALS

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Distributed Computation Models

CS490: Problem Solving in Computer Science Lecture 6: Introductory Graph Theory

The Shortest Path Problem

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

INF280 Graph Traversals & Paths

TP1-2: Analyzing Hadoop Logs

Homework Assignment #3 Graph

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Map-Reduce Applications: Counting, Graph Shortest Paths

CS350: Data Structures Dijkstra s Shortest Path Alg.

Practical Session No. 12 Graphs, BFS, DFS, Topological sort

CS 220: Discrete Structures and their Applications. graphs zybooks chapter 10

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Comparative Analysis of K means Clustering Sequentially And Parallely

Graph Algorithms. Chapter 22. CPTR 430 Algorithms Graph Algorithms 1

BigData and Map Reduce VITMAC03

Breadth First Search. cse2011 section 13.3 of textbook

A Novel Approach for Workload Optimization and Improving Security in Cloud Computing Environments

Breadth First Search. Graph Traversal. CSE 2011 Winter Application examples. Two common graph traversal algorithms

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Hadoop/MapReduce Computing Paradigm

CSC263 Week 8. Larry Zhang.

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Deployment Planning Guide

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Improved MapReduce k-means Clustering Algorithm with Combiner

Local Algorithms for Sparse Spanning Graphs

Transcription:

Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU CS13B1033 T Satya Vasanth Reddy CS13B1035 Hrishikesh Vaidya CS13S1041 Arjun V Anand

Hadoop Architecture

Hadoop Architecture Name Node : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Data Node: A Data Node stores data in the Hadoop File System. A functional filesystem has more than one Data Node, with data replicated across them. On startup, a Data Node connects to the Name Node and responds to requests from Name Node. Task Tracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Hadoop Architecture Job Tracker: The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data. 1. Client applications submit jobs to the Job tracker. 2. The JobTracker talks to the NameNode to determine the location of the data 3. The JobTracker locates TaskTracker nodes with available slots at or near the data 4. The JobTracker submits the work to the chosen TaskTracker nodes. 5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.

Hadoop Architecture 6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. 7. When the work is completed, the JobTracker updates its status. The steps and detailed explanation for setting up the Hadoop multi node cluster are in the comprehensive report.

Single source shortest path algorithm OVERVIEW: A path in a graph can be defined as the set of consecutive nodes such that there is an edge from one node to the next node in the sequence. The shortest path between two nodes can be defined as the path that has the minimum total weight of the edges along the path. Variant of breadth first search is used to solve the single source shortest path problem.

LOGIC OF ALGORITHM : The single-source all pairs shortest path can be solved using MapReduce using parallel Breadth-First Search (BFS) in an iterative manner. The source node is processed first, then the nodes connected to the source node are processed and so on. Input Format : ID EDGE-EDGEWEIGHT DISTANCE_FROM_SOURCE COLOR COLOUR CODE White unvisited Gray - visited Black - Finished

INPUT FORMAT Before start all nodes are coloured white except source which will be grey. For eg., 1 2-2,3-6,6-3 0 GRAY Here, 1-2 edge weight is 2, 1-3 edge weight is 6 and so on and 0 distance from source indicates that it is the source and GREY indicates it is discovered.

Algorithm The gray node indicates that it is visited and its neighbors should be processed. All the nodes adjacent to a gray node that are white are changed to be gray colored indicating that the nodes are visited. The original gray node is colored black indicating that all its neighbors are visited and the processing of the node is finished. The process continues until there are no more gray nodes to process in the graph. INPUT 1 2-2,3-6,6-3 0 GRAY 2 1-2 Integer.MAX_VALUE WHITE 3 1-6,4-4,5-1,6-1 Integer.MAX_VALUE WHITE 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 Integer.MAX_VALUE WHITE

Stages of Algorithm

Mapper Responsible for "exploding" all gray nodes - e.g. for exploding all nodes that live at our current depth in the tree. For each gray node, the mappers emit a new gray node, with distance = distance from source of gray node + weight of the edge. They also then emit the input gray node, but colored black. Mappers also emit all non-gray nodes, with no change.

After Map Iteration 1 1 2-2,3-6,6-3 0 GRAY 2 NULL 2 GRAY 3 NULL 6 GRAY 6 NULL 3 GRAY 2 1-2 Integer.MAX_VALUE WHITE 3 1-6,4-4,5-1,6-1 Integer.MAX_VALUE WHITE 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 Integer.MAX_VALUE WHITE

Reducer The reducers, of course, receive all data for a given key - in this case it means that they receive the data for all "copies" of each node. For example, the reducer that receives the data for key = 2 gets the following list of values : 2 NULL 2 GRAY 2 1-2 Integer.MAX_VALUE WHITE The reducers job is to take all this data and construct a new node using the non-null list of edges the minimum distance the darkest color

After Iteration 1 Using this logic after the first iteration, the output will be 1 2-2,3-6,6-3 0 BLACK 2 1-2 2 GRAY 3 1-6,4-4,5-1,6-1 6 GRAY 4 3-4,5-2 Integer.MAX_VALUE WHITE 5 3-1,4-2 Integer.MAX_VALUE WHITE 6 1-3,3-1 3 GRAY

Terminating Condition The iteration stops when there are no more grey nodes to process in the graph. FINAL OUTPUT: 1 2-2,3-6,6-3 0 BLACK 2 1-2 2 BLACK 3 1-6,4-4,5-1,6-1 4 BLACK 4 3-4,5-2 9 BLACK 5 3-1,4-2 7 BLACK 6 1-3,3-1 3 BLACK

Analysis of Running times on Single Node and No. of Nodes in input graph Multi node cluster Time taken in Single Node Time taken in 3 Node cluster 10 8.217 12.035 20 8.307 9.734 50 10.669 14.674 100 8.214 10.307 200 14.273 11.385 500 29.296 16.261 10000 1963.61 319.704 30000 9772.563 1956.86

All pair shortest Path Overview For calculating shortest path between all pair of vertices without using parallel computing, we use the standard Floyd Warshall algorithm. For implementing it in hadoop framework, the main task is to reduce the problem statement to key-value pairs. After nth iteration of relaxation, we get shortest path from node i to node j having path length at most n.

Input format lthe input format is in the form of nodeid and adjacency list. lthe graph is undirected and there can be multiple edges within a pair of vertices. The adjacency list has a list of pairs which denote the neighbouring vertex and the weight joining them. 1 3,43 2 4,18 3 2,31 1,32 4,14 4 2,27 3,23 5,48 5 1,23

Mapper Class lthe mapper class takes the entire file as input and parses it line by line. lfor each trio of vertices present in the graph it relaxes the edge weights. lif node i and node j are adjacent to node k then it sums dist(i,k) and dist(k,j) and sets it to dist(i,j). lthe implementation is similar to that of Floyd Warshall. It considers the k'th vertex to be present in the path from i to j. lfor all the vertices in the adjacency list it emits a new node with the same nodeid as that of the adjacent vertex.

Output of Mapper 1 3,43 2 4,18 3 2,31 1,32 4,14 4 2,27 3,23 5,48 5 1,23 3 1,43 4 2,18 2 3,31 1,63 4,45 1 3,32 2,63 4,46 4 3,14 2,45 1,46 2 4,27 3,50 5,75 3 4,23 2,50 5,71 5 4,48 2,75 3,71

Reducer lthe output of mapper is fed to the reducer. A list of values having the same key is sent to a particular reducer. lthe value is adjacency list having adjacent node and path weight from key to the list nodeid. lfor the shortest path the minimum of all the path weights to a particular vertex is considered and added to the adjacency list of the key. lafter each iteration the output file is generated by the reducer having shortest path from each node i to j.

Output of Reducer 1 2,63 3,32 4,46 5,23 2 1,63 3,31 4,18 5,75 3 1,32 2,31 4,14 5,71 4 1,46 2,18 3,14 5,48 5 1,23 2,75 3,71 4,48

Final output For each vertex a boolean variable is maintained to check whether we have got the minimum weights for all pairs. If the path weight gets updated for a particular node then it sets isconverged to false indicating that the current distance may not be the shortest. The final output for the above graph is : 1 2,63 3,32 4,46 5,23 2 1,63 3,31 4,18 5,75 3 1,32 2,31 4,14 5,71 4 1,46 2,18 3,14 5,48 5 1,23 2,75 3,71 4,48

Analysis of running times of single node and multi-node cluster S.no No. of nodes in input graph Time taken in single node (sec) Time taken in 3-node cluster(sec 1 20 10.447 9.432 2 50 17.866 11.102 3 100 58.159 23.124 4 200 582.186 73.743 5 500 10025.565 1396.768

Summary and future enhancements Map reduce is not efficient way for small inputs as creating the map and reduce jobs takes a considerable amount of time comparable to processing time. Before formatting the namenode it s a better practice to delete the namenode and datanode in Hadoop_store to ensure all the nodes get the same cluster id. Analysis was done using wireless network but the performance can be improved using LAN which has greater bandwidth

Summary and future enhancements Dijkstra s algorithm is more efficient because at any step it only pursues edges from the minimum-cost path inside the frontier. But our algorithm explores all paths in parallel which isn t as efficient overall. We are calculating the shortest path length here but we can find the trace of shortest path covered along with shortest distance by keeping track of the parent vertex.

Acknowledgement We would like to thank Dr.Sobhan Babu for guiding us and Ms.Samanvi and Mr.Kanishka Chauhan for helping us understand the concepts time to time. Thanks to Tanya Marwah for giving us an extra slave node. PS: Complete details of implementation in framework and procedure of analysis are encompassed in the extensive reports and the video.

Bibliography Google MapReduce Paper Wiki Hadoop Hadoop Operations by Eric Sammer BigData University

THANK YOU