From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search

Size: px
Start display at page:

Download "From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search"

Transcription

1 From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search Mark Sevalnev November 22, Introduction In the era of parallelism problems can be solved in a less time by simply increasing the amount of computing nodes. The challenge is then of course to express the problem so that the parallel computing can be used. Many problems are impossible to parallelize by their nature, the example of such problems is Depth First Search (DFS). Also, as parallel algorithms are much more complicated then sequential ones, there is a big wish to get parallel versions of sequential algorithms systematically. Euler tour is a problem that can be computed parallely. For that dierent techniques can be used. One of them is prex computation. This technique is simple enough but covers a very broad set of problems. The drawback of the prex computation is that we need to reshape problems to make them applicable to be solved using prex computation. Hadoop is a software framework to create and run parallel programs. It implements map/reduceparadigm in which you reshape the problem into two partsyou have to specify what is the mapperprogram and what is the reducer-program after which you just feed data to those. Hadoop works in parallel by splitting the input les and sending them rst to mappers from which processed data is sent to reducers. As Hadoop hides all details of parallelism a programmer need not to understand the underlying implementation of parallel algorithms. It's needed just to t the whole program into a mapper and a reducer and Hadoop will care about the rest. This paper introduces two techniques to solve problems. First one is an algorithm design technique, which tells you what can be done if some constrains are satised. The second is an application which implements another design technique. We will study which problems can be solved by prex computation and Hadoop, how eciently they can be solved and how much designer's work is used then. 2 Algorithms on pointer-based data structures 2.1 Prex computation The prex computation technique is an ecient way to process data stored in linked lists parallely. Many computational problems can be solved by transforming them into linked lists and than applying the prex computation technique. The main idea behind the prex computation is a pointer jumping. If we have a tree, in which a leaf wants to access the root, than using sequential algorithm we can perform this in time linearly proportional to the number of nodes between the leaf and the root. On the other hand, using a parallel algorithm allows us to access the root from the leaf in logarithmic time. As we know, each tree node except the root have the pointer to its parent. The logarithmic time can be achieved if we instruct the nodes to update their parent pointers in the following manner: if node x points to node y and node y points to node z, then node x makes its pointer to point to node z. You can see from the picture how nodes update their parent pointers. Here is another example where prex computation is used, given a linked list L it is required to perform the following computation: x 1, x 1 x 2, x 1 x 2 x 3, In a sequential setting, we have a pointer 1

2 Figure 1: An example of prex computation 2

3 to the head of the list and we perform the prex computation by a single traversal of L, and the time required is linear in the number of nodes in the list. Now assume we want to solve the same problem in parallel. If all what we have is a pointer to the head of the list, there can be very little to improve. But in a typical parallel setting it is likely that each processor has a pointer to each own node. L is probably constructed in parallel, each processor contributing a node. A parallel algorithm for prex computation on L uses pointer jumping. In each iteration a processor: 1) Uses -operation to combine its own value with the value stored in its successor node. 2) Makes its successor pointer to point to the its successor's successor node. Although no processor knows the number of the nodes in the list, the algorithm terminates when all nodes point to nil. The algorithm PRAM LINKED LIST PREFIX is given below. It is assumed that we have as many processors as there is nodes in the list. Also the algorithm uses a next elds not to mix up succ elds. for all i do in parallel next(i) := succ(i) end for finished := false while not finished do finished := true for all i do in parallel if next(i)!= nil then val(next(i)) := val(i) * val(next(i)) next(i) := next(next(i)) end if if next!= nil then finished := (COMMON) false end if end for end while The algorithm runs in logarithmic time and uses linear number of processors. We can use the prex computation as a subroutine to perform another useful computation named list ranking. Given linked list L we may want to know the distance from each node to the end of the list. Specically, for each node i for which succ(i) nil, we wish to compute: rank(i) = rank(succ(i))+1, if succ(i) = nil, then rank(i) = 0. Sequentially the problem is solved by traversing the list from beginning to the end and converting each node to point to its predecessor. After that the list is traversed in opposite direction assigning a rank to each node visited. This takes O(n) time. In parallel this can be again solved in O(log(n)) time. First, each processor makes its successor node to point to itself. Second, value 1 is assigned to each node. Finally, the prex computation is performed with operation taken as +.[1] 2.2 Euler tours Now when we learned the concept of prex computation, we can use it to compute something more practical namely Euler tours. Euler tour in a graph is a list of edges such that every edge of the graph is present exactly once in the list and consecutive edges are neighbors in the graph. If Euler tour exists in a graph than all vertices have an even degree. This comes from the fact that each vertex must be entered and leaved which contributes to two edges, if the vertex is visited more than once, 2 n edges is needed, where n is the number of visits. This leads to the fact that Euler tours always exist in directed trees. By directed trees we mean any undirected tree which is turned into directed by splitting every edge into two going in both directions. Given a directed tree DT with n vertices, we describe a parallel algorithm for computing an Euler tour ET of DT. The input to the algorithm is set of liked lists in which DT is stored. Every vertex has its own linked list in which outgoing edges are stored. A node ij in the linked list for vertex v i consists of two elds, a eld edge containing edge (v i, v j ) and a eld next containing a pointer to the next node. The purpose of the algorithm is to arrange the edges of DT into single list such that each edge (v i, v j ) is followed by an edge (v j, v k ). 3

4 On a PRAM, we assume the availability of n 1 processors, with each processor P ij, i < j, in charge of two edges of DT, namely (v i, v j ) and (v j, v i ). The task of each processor is to determine the position of each edge in the nal linked list forming ET. It does so by determining the successor of its two edges as follows. If, in the linked list for v j, edge (v j, v i ) is followed by some edge (v j, v k ), then the successor of (v i, v j ) in ET is (v j, v k ). Otherwisethat is, if (v j, v i ) is the last edge in the linked list for v j then the successor of (v i, v j ) in ET is the rst edge in the linked list of v j. The successor of (v j, v i ) is computed similarly. The same formally: Successor of (vi, vj) If next(ji) == jk then succ(ij) := jk else succ(ij) := head(vj) Successor of (vj, vi) If next(ij) == im then succ(ji) := im else succ(ji) := head(vi) The successor of each edge is found in constant time thus the ET on the DT with n vertices is found in constant time using n 1 processors.[1] 3 Solving problems using an application platform 3.1 Map/reduce paradigm Hadoop is a software platform which implements map/reduce-paradigm. Map/reduce-paradigm was inspired by a functional programming paradigm. A map is just any function dened by user and applied to a list of values produces a list of results, for example it can be power-of-threefunction: cube(x) = x x x. So calling that for a list [1,2,3,4,5] will result [1,8,27,64,125]. A reduce is also any user-dened function that takes as an input a list of values, processes them in some order and returns a single result. For instance it can be factorial-functiion: res := 1; while there is elements in input do {res = next_element} return res. Thus given a list [1,2,3,4,5] the reducer returns 1*2*3*4*5 in other words 120. Another common example of map/reduce is word counting. One can count words occuring in a text document in a parallel manner. To implement this map-function is dened to assign every word number 1, and reduce-function to sum together numbers attached to each word. As separate word work as keys and they are grouped together after which their ones are summedthis gives of course the number of occurances in the text. [2] 3.2 Hadoop implementation Hadoop is an implementation of map/reduce paradigm. By using Hadoop you can in a distributive manner process petabytes of data on nodes. Hadoop takes data in a key/value-form which is then split and fed to a number of user-dened maps-functions. Each mapper processes the data appropriately and outputs the list of intermediate key/value pairs. The key/value list is sorted according to the keys and partitioned to number of reducers such that no piece of data with the same key can be fed to dierent reducers. Every partition is then sent to the user-dened reduce function. A reducer outputs the nal list of key/value pairs after its own data procession. Hadoop's map/reduce framework is built on the top of Hadoop distributed le system (HDFS) whose architecture resembles the Google le system a lot. [3] The basic idea of the HDFS is that it is denoted to be run on inexpensive hardware for large data-intensive applications. To improve faulttolerance and eciency HDFS distributes the same data amoung several nodes. If some cluster's datanode goes down the data which was assigned to it is resent to other datanodes from the replica datanodes still storing this data. HDFS is also aware of the data location for all replicas. It is rackaware and tries to send the data to the nearest available datanode in respect of that data location, so it optimises dataow across a network. 4

5 Figure 2: How master/slave architecture on Hadoop looks like Hadoop consists of two layersmap/reduce layer and HDFS layer. HDFS layer provides infrastructure for Map/Reduce layer and Map/Reduce layer executes tasks sent to a mappers and reducers. HDFS provides a familiar le system interface, les are organized hierarchically and identied by pathnames. HDFS layer contains a namenode, a secondary namenode and a datanode. The namenode is set up on an arbitrary computer which is going to be a master and it keeps information of all data replicas. HDFS splits data into blocks, such that all blocks have the same size (64 Mb by default) except possibly the last block of some le. A large block size oers some advantages, for example this reduces the need to access the namenode as les consist of a few blocks. The namenode also ensures that every piece of data is replicated on three dierent machines. When a client wants to access some piece of data, the namenode translates the requested data to blocks' locations the data consists of and returns the blocks' numbers and their locations to the client. The secondary namenode takes periodically snapshots of namenode's logs for further debugging in the case the namenode crashes. Datanodes work on slave machines. Slave machines are computers in a cluster on which Hadoop's datanodes are running. In the Map/Reduce layer there is a jobtracker and a tasktracker. The jobtracker works on master machine and decides which task will be sent to which tasktracker. Tasktrackers execute tasks assigned to them Distributed breadth rst search For state space exploration can be used two well-known algorihtmsdepth rst search (DFS) and breadth rst search (BFS). They are both equally ecient and widely used to traverse the nodes of a graph. DFS is also used as subroutine of nding strongly connected components of a graph. However, the problem with a DFS is that it cannot be paralellized [4]. As a DFS is a central algorithm in many model checkers, to round the problem with its unparalellization other graph-traversing algorithms was introduced. However, most of them are not as ecient as DFS or BFS. Instead of DFS, BFS can be paralellized. Distributed breadth rst search (DBFS), as we will call the parallel version of BFS, work as follows. It shares nodes in a frontier between execution nodes and every execution node generates the successors of its nodes. Then produced nodes are gathered together, duplicates are removed as well as nodes that were already seen on the previous iterations. The rest is a new frontier. It is concatenated with a seen nodes' list and feed again to the execution nodes. This is repeated until there is no unseen nodes. The eciency of DBFS is 1 5

6 O((V + E) V log(v )), where V is a total number of nodes and E is a total number of edges in a graph. O(V + E) is a time eciency of BFS and a factor V log(v ) comes from the fact that on every iteration of DBFS a frontier should be extracted. Theoretically we can achieve in DBFS the same eciency as in BFS. For that we need a table to inspect in constant time which nodes are already visited. The problem is that table should have an exponential size in respect of the number of nodes. 3.4 Simple node space exploration We designed several implementations of DBFS to run on Hadoop. Each of them have their weaknesses and strengths. The easier implementations give rough idea behind DBFS but they are not optimal. The more ecient ones are on the other hand more complicated and thus error-prone. The rst implementation, called simple node space exploration, puts nodes into input folder, calles Hadoop which in turn generates the successors of that nodes plus nodes themselves and outputs them into output folder. This is repeated until le containing nodes doesn't grow any more. Formally this algorithm looks like that: { S 0 = initial_node S i+1 = i j=0 get_successors(s i), i N In termes of algorihm the above mathematical notation will be look like this: S := getinitialstate(model); new_size := getsize(s); do old_size := new_size S := generatesuccessors(model, S); S := sort(s); S := removeduplicates(s); new_size := getsize(s); until (old_size!= new_size) Nodes of the graph were expressed as a bitvector. Because the nal purpose was to test this approach on Hadoop and provisionally the bottle neck of Hadoop is network's bandwidth, we decided to compress a node represanting character bit-vector into binary bit-vector. In this approach we divide a node into eight-number groups and every group is encoded into one integer. Because with eight bit it is possible to get numbers up to 255 (number which represented using three characters), we use three characters instead of original eight. In that way we save space in the price of additional computation. For ecient duplicates' removement we need rst to sort nodes-le. For that we used standard UNIX-command sort. After that we remove duplicates with self-made program Non-duplicateoutput. The size of nodes-le is got using standard UNIX-command wc. We implemented the above algorithm for run it sequentially. In this way we get a rough upperbound on a parallel version of the same algorihtm. It isn't exact because the same thing can be done more eciently but it gives the border we don't want to exceed. The parallel version of this algorithm looks pretty the same with the exception that sorting and duplicate removement is done by Hadoop. So, here keys are the nodes of a node space, values are not used, a mapper is a program which outputs the given node and the successors of that node and a reducer is a identity function. Internally Hadoop works as follows: it splits an input le, assigns every split to some mapper, the mappers in turn output the keys in other words nodes they got and the successors of that nodes. Hadoop sorts the output of a mappers with respect to a keys (actually there is no values). The reducers copy the sorted output from each mapper using HTTP across the network (this part is called shue) 2. Simultaneously the reduces merge sorts the keys because the same key may come from a dierent mappers. This phaseshue and sort will turn out to be the slowest part 3. After that the reducers execute the secondary sort on values of every key, but this (1) 6

7 Figure 3: How master/slave architecture on Hadoop looks like is not done becauce there is no values. Then the actual reduce function is calledidentity function and the result is output. 3.5 Node space exploration using set substraction The second version of the algorithm distinguishes between internal nodesnodes whose successors we have already seen and frontier nodesnodes whose successors might be some new, unseen nodes. So, here in addition to keysnodes we use values which can take two dierent values f or s. s means that the node is internal and it is no reason to generate its successors and f means that it is frontier node and its successors should be generated. The algorithm puts nodes into input folder, calles Hadoop which in turn generates the successors of that nodes plus nodes themselves for frontier nodes and outputs them into output folder. This is repeated until le containing frontier nodes has at least one node, formally: S 0 = initial_node F 0 = initial_node S i+1 = S i get_successors(f i ), i N F i+1 = S i+1 \ S i, i N (2) In termes of algorihm the above mathematical notation will be look like this: S := getinitialstate(model); F := S; frontier_size := getsize(f); do F' := generatesuccessors(model, F); F := F' / S; frontier_size := getsize(f); until (frontier_size > 0) As with the case of the rst algorithm we implemented this algoritm as sequential script to get rough bound of a running time. When running this algorithm on Hadoop, it works as follows. First, internal nodeskeys are put into one input le with the value s and frontier les to another with the value f. The input folder with these les is split and fed to the mappers. The mapper reads rst the value of a key, if it is s, keynode is output as it and if it is f, the mapper outputs this key with the value s, generates the successors of that node and outputs them with the value f. To reduce network trac we added also combiners to this point. Usually the combiner executes the reducer's function, it processes the mappers' output more eciently than reducer as the output of the mapper is available 7

8 in memory. 4 The combiner in our algorithm does slightly dierent function as the reducer because it sees the output of a single mapper only. Our combiner looks through the value list of each key, if it nds there the value s, it outputs the key with the value s and if it doesn't it outputs the key with the value f. In such a way we save time by doing some of a reducer's work but data already available in memory (not on a disk) and save space (and thus time) by sending for example only < , < f >> and not < , < f, f, f, f, f, f >>. 4 Conclusions There is many ways to solve problems parallely. In this paper we concentrated on two approaches algorithm design technique named prex computation and map/reduce. Also we introduced an application platform which implements map/reduce. Some pros and cons of both approaches were discussed and many dierent examples of problems to be solved by those techniques were inspected. As we said prex computation is more general and thus applicable for the broader set of problems. The drawback of this is that more time for algorithm design is needed. Map/reduce is more simplier but a much work of design is already completed. As we saw from experiments on Hadoop, its running time of node space generation besides the size of the graph strongly depends on the number of iterations. The minimum running time of one Hadoop call takes about half a minute no matter how small amount of data is fed. It is needed for starting mappers and reducers. Thus we get the lower bound of how fast Hadoop can solve some instance. To round this problem multiple iterations within one run can be donewe can generate not only successors of given node but also successors of successors. References [1] Selim G. Akl. Parallel Computation models and methods. Prentice Hall, [2] Jerey Dean and Sanjay Ghemawat. Mapreduce simplied data processing on large clusters. Communications of the ACM, 51(1):107113, [3] Sanjay Ghemawat, Howard Gobio,, and Shun-Tak Leung. The google le system. In 19th ACM Symposium on Operating Systems Principles, [4] John H. Reif. Depth-rst search is inherently sequential

Euler Tours and Their Applications. Chris Moultrie CSc 8530

Euler Tours and Their Applications. Chris Moultrie CSc 8530 Euler Tours and Their Applications Chris Moultrie CSc 8530 Topics Covered History Terminology Sequential Algorithm Parallel Algorithm BSP/CGM Algorithm History Started with the famous Konigsberg problem.

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU

Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU CS13B1033 T Satya Vasanth Reddy CS13B1035 Hrishikesh Vaidya CS13S1041 Arjun V Anand Hadoop Architecture Hadoop Architecture

More information

L22: SC Report, Map Reduce

L22: SC Report, Map Reduce L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

CSL 730: Parallel Programming

CSL 730: Parallel Programming CSL 73: Parallel Programming General Algorithmic Techniques Balance binary tree Partitioning Divid and conquer Fractional cascading Recursive doubling Symmetry breaking Pipelining 2 PARALLEL ALGORITHM

More information

Parallel Pointers: Graphs

Parallel Pointers: Graphs Parallel Pointers: Graphs Breadth first Search (BFS) 101 Given a vertex S, what is the distance of all the other vertices from S? Why would this be useful? To find the fastest route through a network.

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

16 Greedy Algorithms

16 Greedy Algorithms 16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices

More information

CPSC 320 Sample Solution, Playing with Graphs!

CPSC 320 Sample Solution, Playing with Graphs! CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster

Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster Abstract- Traveling Salesman Problem (TSP) is one of the most common studied problems in combinatorial

More information

CS-2510 COMPUTER OPERATING SYSTEMS

CS-2510 COMPUTER OPERATING SYSTEMS CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

8. Write an example for expression tree. [A/M 10] (A+B)*((C-D)/(E^F))

8. Write an example for expression tree. [A/M 10] (A+B)*((C-D)/(E^F)) DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING EC6301 OBJECT ORIENTED PROGRAMMING AND DATA STRUCTURES UNIT IV NONLINEAR DATA STRUCTURES Part A 1. Define Tree [N/D 08]

More information

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 1. O(logn) 2. O(n) 3. O(nlogn) 4. O(n 2 ) 5. O(2 n ) 2. [1 pt] What is the solution

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

CS521 \ Notes for the Final Exam

CS521 \ Notes for the Final Exam CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Distributed Systems. CS422/522 Lecture17 17 November 2014

Distributed Systems. CS422/522 Lecture17 17 November 2014 Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen

More information

Your First Hadoop App, Step by Step

Your First Hadoop App, Step by Step Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On

More information

Chapter Fourteen Bonus Lessons: Algorithms and Efficiency

Chapter Fourteen Bonus Lessons: Algorithms and Efficiency : Algorithms and Efficiency The following lessons take a deeper look at Chapter 14 topics regarding algorithms, efficiency, and Big O measurements. They can be completed by AP students after Chapter 14.

More information

Graph implementations :

Graph implementations : Graphs Graph implementations : The two standard ways of representing a graph G = (V, E) are adjacency-matrices and collections of adjacencylists. The adjacency-lists are ideal for sparse trees those where

More information

TIE Graph algorithms

TIE Graph algorithms TIE-20106 239 11 Graph algorithms This chapter discusses the data structure that is a collection of points (called nodes or vertices) and connections between them (called edges or arcs) a graph. The common

More information

Parallel Euler tour and Post Ordering for Parallel Tree Accumulations

Parallel Euler tour and Post Ordering for Parallel Tree Accumulations Parallel Euler tour and Post Ordering for Parallel Tree Accumulations An implementation technical report Sinan Al-Saffar & David Bader University Of New Mexico Dec. 2003 Introduction Tree accumulation

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,

More information

6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP:

6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP: 6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS Vehicle Routing Problem, VRP: Customers i=1,...,n with demands of a product must be served using a fleet of vehicles for the deliveries. The vehicles, with given

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Evaluation of Apache Hadoop for parallel data analysis with ROOT Evaluation of Apache Hadoop for parallel data analysis with ROOT S Lehrack, G Duckeck, J Ebke Ludwigs-Maximilians-University Munich, Chair of elementary particle physics, Am Coulombwall 1, D-85748 Garching,

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

CPSC W1: Midterm 1 Sample Solution

CPSC W1: Midterm 1 Sample Solution CPSC 320 2017W1: Midterm 1 Sample Solution January 26, 2018 Problem reminders: EMERGENCY DISTRIBUTION PROBLEM (EDP) EDP's input is an undirected, unweighted graph G = (V, E) plus a set of distribution

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We

More information

Locality Aware Fair Scheduling for Hammr

Locality Aware Fair Scheduling for Hammr Locality Aware Fair Scheduling for Hammr Li Jin January 12, 2012 Abstract Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

Algorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary

Algorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary Sorting 2 Topics covered by this lecture: Stability of Sorting Quick Sort Is it possible to sort faster than with Θ(n log(n)) complexity? Countsort Stability A sorting algorithm is stable if it preserves

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

More PRAM Algorithms. Techniques Covered

More PRAM Algorithms. Techniques Covered More PRAM Algorithms Arvind Krishnamurthy Fall 24 Analysis technique: Brent s scheduling lemma Techniques Covered Parallel algorithm is simply characterized by W(n) and S(n) Parallel techniques: Scans

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

Chapter 6. Parallel Algorithms. Chapter by M. Ghaari. Last update 1 : January 2, 2019.

Chapter 6. Parallel Algorithms. Chapter by M. Ghaari. Last update 1 : January 2, 2019. Chapter 6 Parallel Algorithms Chapter by M. Ghaari. Last update 1 : January 2, 2019. This chapter provides an introduction to parallel algorithms. Our highlevel goal is to present \how to think in parallel"

More information

(Refer Slide Time: 01.26)

(Refer Slide Time: 01.26) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture # 22 Why Sorting? Today we are going to be looking at sorting.

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

D. Θ nlogn ( ) D. Ο. ). Which of the following is not necessarily true? . Which of the following cannot be shown as an improvement? D.

D. Θ nlogn ( ) D. Ο. ). Which of the following is not necessarily true? . Which of the following cannot be shown as an improvement? D. CSE 0 Name Test Fall 00 Last Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. points each. The time to convert an array, with priorities stored at subscripts through n,

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are

More information

MITOCW watch?v=w_-sx4vr53m

MITOCW watch?v=w_-sx4vr53m MITOCW watch?v=w_-sx4vr53m The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

COMP Parallel Computing. PRAM (3) PRAM algorithm design techniques

COMP Parallel Computing. PRAM (3) PRAM algorithm design techniques COMP 633 - Parallel Computing Lecture 4 August 30, 2018 PRAM algorithm design techniques Reading for next class PRAM handout section 5 1 Topics Parallel connected components algorithm representation of

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Outline. Graphs. Divide and Conquer.

Outline. Graphs. Divide and Conquer. GRAPHS COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges books. Outline Graphs.

More information

residual residual program final result

residual residual program final result C-Mix: Making Easily Maintainable C-Programs run FAST The C-Mix Group, DIKU, University of Copenhagen Abstract C-Mix is a tool based on state-of-the-art technology that solves the dilemma of whether to

More information

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell Hash Tables CS 311 Data Structures and Algorithms Lecture Slides Wednesday, April 22, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks CHAPPELLG@member.ams.org 2005

More information

Algorithms for Grid Graphs in the MapReduce Model

Algorithms for Grid Graphs in the MapReduce Model University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Solutions to relevant spring 2000 exam problems

Solutions to relevant spring 2000 exam problems Problem 2, exam Here s Prim s algorithm, modified slightly to use C syntax. MSTPrim (G, w, r): Q = V[G]; for (each u Q) { key[u] = ; key[r] = 0; π[r] = 0; while (Q not empty) { u = ExtractMin (Q); for

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS

THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS 2

More information

Hadoop On Demand: Configuration Guide

Hadoop On Demand: Configuration Guide Hadoop On Demand: Configuration Guide Table of contents 1 1. Introduction...2 2 2. Sections... 2 3 3. HOD Configuration Options...2 3.1 3.1 Common configuration options...2 3.2 3.2 hod options... 3 3.3

More information