From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search
|
|
- Imogen Adams
- 5 years ago
- Views:
Transcription
1 From prex computation on PRAM for nding Euler tours to usage of Hadoop-framework for distributed breadth rst search Mark Sevalnev November 22, Introduction In the era of parallelism problems can be solved in a less time by simply increasing the amount of computing nodes. The challenge is then of course to express the problem so that the parallel computing can be used. Many problems are impossible to parallelize by their nature, the example of such problems is Depth First Search (DFS). Also, as parallel algorithms are much more complicated then sequential ones, there is a big wish to get parallel versions of sequential algorithms systematically. Euler tour is a problem that can be computed parallely. For that dierent techniques can be used. One of them is prex computation. This technique is simple enough but covers a very broad set of problems. The drawback of the prex computation is that we need to reshape problems to make them applicable to be solved using prex computation. Hadoop is a software framework to create and run parallel programs. It implements map/reduceparadigm in which you reshape the problem into two partsyou have to specify what is the mapperprogram and what is the reducer-program after which you just feed data to those. Hadoop works in parallel by splitting the input les and sending them rst to mappers from which processed data is sent to reducers. As Hadoop hides all details of parallelism a programmer need not to understand the underlying implementation of parallel algorithms. It's needed just to t the whole program into a mapper and a reducer and Hadoop will care about the rest. This paper introduces two techniques to solve problems. First one is an algorithm design technique, which tells you what can be done if some constrains are satised. The second is an application which implements another design technique. We will study which problems can be solved by prex computation and Hadoop, how eciently they can be solved and how much designer's work is used then. 2 Algorithms on pointer-based data structures 2.1 Prex computation The prex computation technique is an ecient way to process data stored in linked lists parallely. Many computational problems can be solved by transforming them into linked lists and than applying the prex computation technique. The main idea behind the prex computation is a pointer jumping. If we have a tree, in which a leaf wants to access the root, than using sequential algorithm we can perform this in time linearly proportional to the number of nodes between the leaf and the root. On the other hand, using a parallel algorithm allows us to access the root from the leaf in logarithmic time. As we know, each tree node except the root have the pointer to its parent. The logarithmic time can be achieved if we instruct the nodes to update their parent pointers in the following manner: if node x points to node y and node y points to node z, then node x makes its pointer to point to node z. You can see from the picture how nodes update their parent pointers. Here is another example where prex computation is used, given a linked list L it is required to perform the following computation: x 1, x 1 x 2, x 1 x 2 x 3, In a sequential setting, we have a pointer 1
2 Figure 1: An example of prex computation 2
3 to the head of the list and we perform the prex computation by a single traversal of L, and the time required is linear in the number of nodes in the list. Now assume we want to solve the same problem in parallel. If all what we have is a pointer to the head of the list, there can be very little to improve. But in a typical parallel setting it is likely that each processor has a pointer to each own node. L is probably constructed in parallel, each processor contributing a node. A parallel algorithm for prex computation on L uses pointer jumping. In each iteration a processor: 1) Uses -operation to combine its own value with the value stored in its successor node. 2) Makes its successor pointer to point to the its successor's successor node. Although no processor knows the number of the nodes in the list, the algorithm terminates when all nodes point to nil. The algorithm PRAM LINKED LIST PREFIX is given below. It is assumed that we have as many processors as there is nodes in the list. Also the algorithm uses a next elds not to mix up succ elds. for all i do in parallel next(i) := succ(i) end for finished := false while not finished do finished := true for all i do in parallel if next(i)!= nil then val(next(i)) := val(i) * val(next(i)) next(i) := next(next(i)) end if if next!= nil then finished := (COMMON) false end if end for end while The algorithm runs in logarithmic time and uses linear number of processors. We can use the prex computation as a subroutine to perform another useful computation named list ranking. Given linked list L we may want to know the distance from each node to the end of the list. Specically, for each node i for which succ(i) nil, we wish to compute: rank(i) = rank(succ(i))+1, if succ(i) = nil, then rank(i) = 0. Sequentially the problem is solved by traversing the list from beginning to the end and converting each node to point to its predecessor. After that the list is traversed in opposite direction assigning a rank to each node visited. This takes O(n) time. In parallel this can be again solved in O(log(n)) time. First, each processor makes its successor node to point to itself. Second, value 1 is assigned to each node. Finally, the prex computation is performed with operation taken as +.[1] 2.2 Euler tours Now when we learned the concept of prex computation, we can use it to compute something more practical namely Euler tours. Euler tour in a graph is a list of edges such that every edge of the graph is present exactly once in the list and consecutive edges are neighbors in the graph. If Euler tour exists in a graph than all vertices have an even degree. This comes from the fact that each vertex must be entered and leaved which contributes to two edges, if the vertex is visited more than once, 2 n edges is needed, where n is the number of visits. This leads to the fact that Euler tours always exist in directed trees. By directed trees we mean any undirected tree which is turned into directed by splitting every edge into two going in both directions. Given a directed tree DT with n vertices, we describe a parallel algorithm for computing an Euler tour ET of DT. The input to the algorithm is set of liked lists in which DT is stored. Every vertex has its own linked list in which outgoing edges are stored. A node ij in the linked list for vertex v i consists of two elds, a eld edge containing edge (v i, v j ) and a eld next containing a pointer to the next node. The purpose of the algorithm is to arrange the edges of DT into single list such that each edge (v i, v j ) is followed by an edge (v j, v k ). 3
4 On a PRAM, we assume the availability of n 1 processors, with each processor P ij, i < j, in charge of two edges of DT, namely (v i, v j ) and (v j, v i ). The task of each processor is to determine the position of each edge in the nal linked list forming ET. It does so by determining the successor of its two edges as follows. If, in the linked list for v j, edge (v j, v i ) is followed by some edge (v j, v k ), then the successor of (v i, v j ) in ET is (v j, v k ). Otherwisethat is, if (v j, v i ) is the last edge in the linked list for v j then the successor of (v i, v j ) in ET is the rst edge in the linked list of v j. The successor of (v j, v i ) is computed similarly. The same formally: Successor of (vi, vj) If next(ji) == jk then succ(ij) := jk else succ(ij) := head(vj) Successor of (vj, vi) If next(ij) == im then succ(ji) := im else succ(ji) := head(vi) The successor of each edge is found in constant time thus the ET on the DT with n vertices is found in constant time using n 1 processors.[1] 3 Solving problems using an application platform 3.1 Map/reduce paradigm Hadoop is a software platform which implements map/reduce-paradigm. Map/reduce-paradigm was inspired by a functional programming paradigm. A map is just any function dened by user and applied to a list of values produces a list of results, for example it can be power-of-threefunction: cube(x) = x x x. So calling that for a list [1,2,3,4,5] will result [1,8,27,64,125]. A reduce is also any user-dened function that takes as an input a list of values, processes them in some order and returns a single result. For instance it can be factorial-functiion: res := 1; while there is elements in input do {res = next_element} return res. Thus given a list [1,2,3,4,5] the reducer returns 1*2*3*4*5 in other words 120. Another common example of map/reduce is word counting. One can count words occuring in a text document in a parallel manner. To implement this map-function is dened to assign every word number 1, and reduce-function to sum together numbers attached to each word. As separate word work as keys and they are grouped together after which their ones are summedthis gives of course the number of occurances in the text. [2] 3.2 Hadoop implementation Hadoop is an implementation of map/reduce paradigm. By using Hadoop you can in a distributive manner process petabytes of data on nodes. Hadoop takes data in a key/value-form which is then split and fed to a number of user-dened maps-functions. Each mapper processes the data appropriately and outputs the list of intermediate key/value pairs. The key/value list is sorted according to the keys and partitioned to number of reducers such that no piece of data with the same key can be fed to dierent reducers. Every partition is then sent to the user-dened reduce function. A reducer outputs the nal list of key/value pairs after its own data procession. Hadoop's map/reduce framework is built on the top of Hadoop distributed le system (HDFS) whose architecture resembles the Google le system a lot. [3] The basic idea of the HDFS is that it is denoted to be run on inexpensive hardware for large data-intensive applications. To improve faulttolerance and eciency HDFS distributes the same data amoung several nodes. If some cluster's datanode goes down the data which was assigned to it is resent to other datanodes from the replica datanodes still storing this data. HDFS is also aware of the data location for all replicas. It is rackaware and tries to send the data to the nearest available datanode in respect of that data location, so it optimises dataow across a network. 4
5 Figure 2: How master/slave architecture on Hadoop looks like Hadoop consists of two layersmap/reduce layer and HDFS layer. HDFS layer provides infrastructure for Map/Reduce layer and Map/Reduce layer executes tasks sent to a mappers and reducers. HDFS provides a familiar le system interface, les are organized hierarchically and identied by pathnames. HDFS layer contains a namenode, a secondary namenode and a datanode. The namenode is set up on an arbitrary computer which is going to be a master and it keeps information of all data replicas. HDFS splits data into blocks, such that all blocks have the same size (64 Mb by default) except possibly the last block of some le. A large block size oers some advantages, for example this reduces the need to access the namenode as les consist of a few blocks. The namenode also ensures that every piece of data is replicated on three dierent machines. When a client wants to access some piece of data, the namenode translates the requested data to blocks' locations the data consists of and returns the blocks' numbers and their locations to the client. The secondary namenode takes periodically snapshots of namenode's logs for further debugging in the case the namenode crashes. Datanodes work on slave machines. Slave machines are computers in a cluster on which Hadoop's datanodes are running. In the Map/Reduce layer there is a jobtracker and a tasktracker. The jobtracker works on master machine and decides which task will be sent to which tasktracker. Tasktrackers execute tasks assigned to them Distributed breadth rst search For state space exploration can be used two well-known algorihtmsdepth rst search (DFS) and breadth rst search (BFS). They are both equally ecient and widely used to traverse the nodes of a graph. DFS is also used as subroutine of nding strongly connected components of a graph. However, the problem with a DFS is that it cannot be paralellized [4]. As a DFS is a central algorithm in many model checkers, to round the problem with its unparalellization other graph-traversing algorithms was introduced. However, most of them are not as ecient as DFS or BFS. Instead of DFS, BFS can be paralellized. Distributed breadth rst search (DBFS), as we will call the parallel version of BFS, work as follows. It shares nodes in a frontier between execution nodes and every execution node generates the successors of its nodes. Then produced nodes are gathered together, duplicates are removed as well as nodes that were already seen on the previous iterations. The rest is a new frontier. It is concatenated with a seen nodes' list and feed again to the execution nodes. This is repeated until there is no unseen nodes. The eciency of DBFS is 1 5
6 O((V + E) V log(v )), where V is a total number of nodes and E is a total number of edges in a graph. O(V + E) is a time eciency of BFS and a factor V log(v ) comes from the fact that on every iteration of DBFS a frontier should be extracted. Theoretically we can achieve in DBFS the same eciency as in BFS. For that we need a table to inspect in constant time which nodes are already visited. The problem is that table should have an exponential size in respect of the number of nodes. 3.4 Simple node space exploration We designed several implementations of DBFS to run on Hadoop. Each of them have their weaknesses and strengths. The easier implementations give rough idea behind DBFS but they are not optimal. The more ecient ones are on the other hand more complicated and thus error-prone. The rst implementation, called simple node space exploration, puts nodes into input folder, calles Hadoop which in turn generates the successors of that nodes plus nodes themselves and outputs them into output folder. This is repeated until le containing nodes doesn't grow any more. Formally this algorithm looks like that: { S 0 = initial_node S i+1 = i j=0 get_successors(s i), i N In termes of algorihm the above mathematical notation will be look like this: S := getinitialstate(model); new_size := getsize(s); do old_size := new_size S := generatesuccessors(model, S); S := sort(s); S := removeduplicates(s); new_size := getsize(s); until (old_size!= new_size) Nodes of the graph were expressed as a bitvector. Because the nal purpose was to test this approach on Hadoop and provisionally the bottle neck of Hadoop is network's bandwidth, we decided to compress a node represanting character bit-vector into binary bit-vector. In this approach we divide a node into eight-number groups and every group is encoded into one integer. Because with eight bit it is possible to get numbers up to 255 (number which represented using three characters), we use three characters instead of original eight. In that way we save space in the price of additional computation. For ecient duplicates' removement we need rst to sort nodes-le. For that we used standard UNIX-command sort. After that we remove duplicates with self-made program Non-duplicateoutput. The size of nodes-le is got using standard UNIX-command wc. We implemented the above algorithm for run it sequentially. In this way we get a rough upperbound on a parallel version of the same algorihtm. It isn't exact because the same thing can be done more eciently but it gives the border we don't want to exceed. The parallel version of this algorithm looks pretty the same with the exception that sorting and duplicate removement is done by Hadoop. So, here keys are the nodes of a node space, values are not used, a mapper is a program which outputs the given node and the successors of that node and a reducer is a identity function. Internally Hadoop works as follows: it splits an input le, assigns every split to some mapper, the mappers in turn output the keys in other words nodes they got and the successors of that nodes. Hadoop sorts the output of a mappers with respect to a keys (actually there is no values). The reducers copy the sorted output from each mapper using HTTP across the network (this part is called shue) 2. Simultaneously the reduces merge sorts the keys because the same key may come from a dierent mappers. This phaseshue and sort will turn out to be the slowest part 3. After that the reducers execute the secondary sort on values of every key, but this (1) 6
7 Figure 3: How master/slave architecture on Hadoop looks like is not done becauce there is no values. Then the actual reduce function is calledidentity function and the result is output. 3.5 Node space exploration using set substraction The second version of the algorithm distinguishes between internal nodesnodes whose successors we have already seen and frontier nodesnodes whose successors might be some new, unseen nodes. So, here in addition to keysnodes we use values which can take two dierent values f or s. s means that the node is internal and it is no reason to generate its successors and f means that it is frontier node and its successors should be generated. The algorithm puts nodes into input folder, calles Hadoop which in turn generates the successors of that nodes plus nodes themselves for frontier nodes and outputs them into output folder. This is repeated until le containing frontier nodes has at least one node, formally: S 0 = initial_node F 0 = initial_node S i+1 = S i get_successors(f i ), i N F i+1 = S i+1 \ S i, i N (2) In termes of algorihm the above mathematical notation will be look like this: S := getinitialstate(model); F := S; frontier_size := getsize(f); do F' := generatesuccessors(model, F); F := F' / S; frontier_size := getsize(f); until (frontier_size > 0) As with the case of the rst algorithm we implemented this algoritm as sequential script to get rough bound of a running time. When running this algorithm on Hadoop, it works as follows. First, internal nodeskeys are put into one input le with the value s and frontier les to another with the value f. The input folder with these les is split and fed to the mappers. The mapper reads rst the value of a key, if it is s, keynode is output as it and if it is f, the mapper outputs this key with the value s, generates the successors of that node and outputs them with the value f. To reduce network trac we added also combiners to this point. Usually the combiner executes the reducer's function, it processes the mappers' output more eciently than reducer as the output of the mapper is available 7
8 in memory. 4 The combiner in our algorithm does slightly dierent function as the reducer because it sees the output of a single mapper only. Our combiner looks through the value list of each key, if it nds there the value s, it outputs the key with the value s and if it doesn't it outputs the key with the value f. In such a way we save time by doing some of a reducer's work but data already available in memory (not on a disk) and save space (and thus time) by sending for example only < , < f >> and not < , < f, f, f, f, f, f >>. 4 Conclusions There is many ways to solve problems parallely. In this paper we concentrated on two approaches algorithm design technique named prex computation and map/reduce. Also we introduced an application platform which implements map/reduce. Some pros and cons of both approaches were discussed and many dierent examples of problems to be solved by those techniques were inspected. As we said prex computation is more general and thus applicable for the broader set of problems. The drawback of this is that more time for algorithm design is needed. Map/reduce is more simplier but a much work of design is already completed. As we saw from experiments on Hadoop, its running time of node space generation besides the size of the graph strongly depends on the number of iterations. The minimum running time of one Hadoop call takes about half a minute no matter how small amount of data is fed. It is needed for starting mappers and reducers. Thus we get the lower bound of how fast Hadoop can solve some instance. To round this problem multiple iterations within one run can be donewe can generate not only successors of given node but also successors of successors. References [1] Selim G. Akl. Parallel Computation models and methods. Prentice Hall, [2] Jerey Dean and Sanjay Ghemawat. Mapreduce simplied data processing on large clusters. Communications of the ACM, 51(1):107113, [3] Sanjay Ghemawat, Howard Gobio,, and Shun-Tak Leung. The google le system. In 19th ACM Symposium on Operating Systems Principles, [4] John H. Reif. Depth-rst search is inherently sequential
Euler Tours and Their Applications. Chris Moultrie CSc 8530
Euler Tours and Their Applications Chris Moultrie CSc 8530 Topics Covered History Terminology Sequential Algorithm Parallel Algorithm BSP/CGM Algorithm History Started with the famous Konigsberg problem.
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationThe MapReduce Framework
The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationImplementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU
Implementing Mapreduce Algorithms In Hadoop Framework Guide : Dr. SOBHAN BABU CS13B1033 T Satya Vasanth Reddy CS13B1035 Hrishikesh Vaidya CS13S1041 Arjun V Anand Hadoop Architecture Hadoop Architecture
More informationL22: SC Report, Map Reduce
L22: SC Report, Map Reduce November 23, 2010 Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance Google version = Map Reduce; Hadoop = Open source
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationMap Reduce Group Meeting
Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationCS301 - Data Structures Glossary By
CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm
More informationCSL 730: Parallel Programming
CSL 73: Parallel Programming General Algorithmic Techniques Balance binary tree Partitioning Divid and conquer Fractional cascading Recursive doubling Symmetry breaking Pipelining 2 PARALLEL ALGORITHM
More informationParallel Pointers: Graphs
Parallel Pointers: Graphs Breadth first Search (BFS) 101 Given a vertex S, what is the distance of all the other vertices from S? Why would this be useful? To find the fastest route through a network.
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More information16 Greedy Algorithms
16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices
More informationCPSC 320 Sample Solution, Playing with Graphs!
CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously
More informationDistributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented
More informationV Advanced Data Structures
V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,
More informationParallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster
Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster Abstract- Traveling Salesman Problem (TSP) is one of the most common studied problems in combinatorial
More informationCS-2510 COMPUTER OPERATING SYSTEMS
CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional
More informationLecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More information8. Write an example for expression tree. [A/M 10] (A+B)*((C-D)/(E^F))
DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING EC6301 OBJECT ORIENTED PROGRAMMING AND DATA STRUCTURES UNIT IV NONLINEAR DATA STRUCTURES Part A 1. Define Tree [N/D 08]
More information1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1
Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 1. O(logn) 2. O(n) 3. O(nlogn) 4. O(n 2 ) 5. O(2 n ) 2. [1 pt] What is the solution
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationDistributed Systems. CS422/522 Lecture17 17 November 2014
Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely
More informationCorpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing
Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More information6.001 Notes: Section 4.1
6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,
More informationDepartment of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang
Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';
More informationtime using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o
Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen
More informationYour First Hadoop App, Step by Step
Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On
More informationChapter Fourteen Bonus Lessons: Algorithms and Efficiency
: Algorithms and Efficiency The following lessons take a deeper look at Chapter 14 topics regarding algorithms, efficiency, and Big O measurements. They can be completed by AP students after Chapter 14.
More informationGraph implementations :
Graphs Graph implementations : The two standard ways of representing a graph G = (V, E) are adjacency-matrices and collections of adjacencylists. The adjacency-lists are ideal for sparse trees those where
More informationTIE Graph algorithms
TIE-20106 239 11 Graph algorithms This chapter discusses the data structure that is a collection of points (called nodes or vertices) and connections between them (called edges or arcs) a graph. The common
More informationParallel Euler tour and Post Ordering for Parallel Tree Accumulations
Parallel Euler tour and Post Ordering for Parallel Tree Accumulations An implementation technical report Sinan Al-Saffar & David Bader University Of New Mexico Dec. 2003 Introduction Tree accumulation
More informationWhat Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture
What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,
More information6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP:
6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS Vehicle Routing Problem, VRP: Customers i=1,...,n with demands of a product must be served using a fleet of vehicles for the deliveries. The vehicles, with given
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationEvaluation of Apache Hadoop for parallel data analysis with ROOT
Evaluation of Apache Hadoop for parallel data analysis with ROOT S Lehrack, G Duckeck, J Ebke Ludwigs-Maximilians-University Munich, Chair of elementary particle physics, Am Coulombwall 1, D-85748 Garching,
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationCPSC W1: Midterm 1 Sample Solution
CPSC 320 2017W1: Midterm 1 Sample Solution January 26, 2018 Problem reminders: EMERGENCY DISTRIBUTION PROBLEM (EDP) EDP's input is an undirected, unweighted graph G = (V, E) plus a set of distribution
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationOptimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C
Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We
More informationLocality Aware Fair Scheduling for Hammr
Locality Aware Fair Scheduling for Hammr Li Jin January 12, 2012 Abstract Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationData Structure. IBPS SO (IT- Officer) Exam 2017
Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationAlgorithms and Data Structures. Marcin Sydow. Introduction. QuickSort. Sorting 2. Partition. Limit. CountSort. RadixSort. Summary
Sorting 2 Topics covered by this lecture: Stability of Sorting Quick Sort Is it possible to sort faster than with Θ(n log(n)) complexity? Countsort Stability A sorting algorithm is stable if it preserves
More informationAn On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland
An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,
More informationMore PRAM Algorithms. Techniques Covered
More PRAM Algorithms Arvind Krishnamurthy Fall 24 Analysis technique: Brent s scheduling lemma Techniques Covered Parallel algorithm is simply characterized by W(n) and S(n) Parallel techniques: Scans
More informationMap-Reduce. John Hughes
Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc
More informationChapter 6. Parallel Algorithms. Chapter by M. Ghaari. Last update 1 : January 2, 2019.
Chapter 6 Parallel Algorithms Chapter by M. Ghaari. Last update 1 : January 2, 2019. This chapter provides an introduction to parallel algorithms. Our highlevel goal is to present \how to think in parallel"
More information(Refer Slide Time: 01.26)
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture # 22 Why Sorting? Today we are going to be looking at sorting.
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationPhysical Level of Databases: B+-Trees
Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,
More informationD. Θ nlogn ( ) D. Ο. ). Which of the following is not necessarily true? . Which of the following cannot be shown as an improvement? D.
CSE 0 Name Test Fall 00 Last Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. points each. The time to convert an array, with priorities stored at subscripts through n,
More informationHadoop and HDFS Overview. Madhu Ankam
Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationMITOCW watch?v=w_-sx4vr53m
MITOCW watch?v=w_-sx4vr53m The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To
More informationWe assume uniform hashing (UH):
We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a
More informationYuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013
Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords
More informationCOMP Parallel Computing. PRAM (3) PRAM algorithm design techniques
COMP 633 - Parallel Computing Lecture 4 August 30, 2018 PRAM algorithm design techniques Reading for next class PRAM handout section 5 1 Topics Parallel connected components algorithm representation of
More information2/26/2017. For instance, consider running Word Count across 20 splits
Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:
More informationOutline. Graphs. Divide and Conquer.
GRAPHS COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges books. Outline Graphs.
More informationresidual residual program final result
C-Mix: Making Easily Maintainable C-Programs run FAST The C-Mix Group, DIKU, University of Copenhagen Abstract C-Mix is a tool based on state-of-the-art technology that solves the dilemma of whether to
More informationHash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell
Hash Tables CS 311 Data Structures and Algorithms Lecture Slides Wednesday, April 22, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks CHAPPELLG@member.ams.org 2005
More informationAlgorithms for Grid Graphs in the MapReduce Model
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationSolutions to relevant spring 2000 exam problems
Problem 2, exam Here s Prim s algorithm, modified slightly to use C syntax. MSTPrim (G, w, r): Q = V[G]; for (each u Q) { key[u] = ; key[r] = 0; π[r] = 0; while (Q not empty) { u = ExtractMin (Q); for
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationTHE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS 2
More informationHadoop On Demand: Configuration Guide
Hadoop On Demand: Configuration Guide Table of contents 1 1. Introduction...2 2 2. Sections... 2 3 3. HOD Configuration Options...2 3.1 3.1 Common configuration options...2 3.2 3.2 hod options... 3 3.3
More information