CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:

Similar documents
Parallel Pointers: Graphs

CS127: B-Trees. B-Trees

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

CS 350 : Data Structures B-Trees

CS350: Data Structures B-Trees

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard

DDS Dynamic Search Trees

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

Summer Final Exam Review Session August 5, 2009

CP2 Revision. theme: dynamic datatypes & data structures

1. Mesh Coloring a.) Assign unique color to each polygon based on the polygon id.

CSE 530A. B+ Trees. Washington University Fall 2013

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

CSE 373 Autumn 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

CS301 - Data Structures Glossary By

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

A dictionary interface.

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

RIS shading Series #2 Meet The Plugins

Scene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development

12 Abstract Data Types

CS24 Week 8 Lecture 1

Section 05: Solutions

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Computer Science E-119 Fall Problem Set 4. Due prior to lecture on Wednesday, November 28

CS 1114: Implementing Search. Last time. ! Graph traversal. ! Two types of todo lists: ! Prof. Graeme Bailey.

CS 62 Practice Final SOLUTIONS

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

CS 4349 Lecture October 18th, 2017

CS 283: Assignment 1 Geometric Modeling and Mesh Simplification

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Binary Search Trees Treesort

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

Balanced Trees Part One

CS 326: Operating Systems. Process Execution. Lecture 5

DOWNLOAD PDF LINKED LIST PROGRAMS IN DATA STRUCTURE

Course Review. Cpt S 223 Fall 2009

A simple OpenGL animation Due: Wednesday, January 27 at 4pm

CSCI2100B Data Structures Trees

Data Structure. A way to store and organize data in order to support efficient insertions, queries, searches, updates, and deletions.

Trees can be used to store entire records from a database, serving as an in-memory representation of the collection of records in a file.

Lecture 13: AVL Trees and Binary Heaps

Programming 2. Topic 8: Linked Lists, Basic Searching and Sorting

x-fast and y-fast Tries

LECTURE 17 GRAPH TRAVERSALS

CS 221 Review. Mason Vail

6.837 Introduction to Computer Graphics Assignment 5: OpenGL and Solid Textures Due Wednesday October 22, 2003 at 11:59pm

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

Recitation 9. Prelim Review

CSE373 Winter 2014, Final Examination March 18, 2014 Please do not turn the page until the bell rings.

CS F-11 B-Trees 1

Geometric Features for Non-photorealistiic Rendering

CSE 100 Advanced Data Structures

1 5,9,2,7,6,10,4,3,8,1 The first number (5) is automatically the first number of the sorted list


Intro to Algorithms. Professor Kevin Gold

Data Structures and Algorithms

Section 1: True / False (1 point each, 15 pts total)

CS 465 Program 4: Modeller

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

Memory Management. Didactic Module 14 Programming Languages - EEL670 1

Memory Management. Chapter Fourteen Modern Programming Languages, 2nd ed. 1

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

Question Bank Subject: Advanced Data Structures Class: SE Computer

Chapter 3. Texture mapping. Learning Goals: Assignment Lab 3: Implement a single program, which fulfills the requirements:

Lecture 3: Art Gallery Problems and Polygon Triangulation

COMP 175: Computer Graphics April 11, 2018

CSE 163: Assignment 2 Geometric Modeling and Mesh Simplification

Heap Management. Heap Allocation

UNIT III BALANCED SEARCH TREES AND INDEXING

Adaptive Point Cloud Rendering

LPT Codes used with ROAM Splitting Algorithm CMSC451 Honors Project Joseph Brosnihan Mentor: Dave Mount Fall 2015

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees

Basic Search Algorithms

CS 62 Midterm 2 Practice

C SCI 335 Software Analysis & Design III Lecture Notes Prof. Stewart Weiss Chapter 4: B Trees

Cpt S 223 Course Overview. Cpt S 223, Fall 2007 Copyright: Washington State University

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

Background: disk access vs. main memory access (1/2)

DATA STRUCTURE AND ALGORITHM USING PYTHON

Lecture Notes. char myarray [ ] = {0, 0, 0, 0, 0 } ; The memory diagram associated with the array can be drawn like this

Team Defrag. June 4 th Victoria Kirst Sean Andersen Max LaRue

CSE373: Data Structure & Algorithms Lecture 18: Comparison Sorting. Dan Grossman Fall 2013

CS302 Data Structures using C++

Binary Trees. College of Computing & Information Technology King Abdulaziz University. CPCS-204 Data Structures I

Computer Science E-22 Practice Final Exam

Chapter 12: Indexing and Hashing. Basic Concepts

Disjoint Sets and the Union/Find Problem

MULTIMEDIA COLLEGE JALAN GURNEY KIRI KUALA LUMPUR

Largest Online Community of VU Students

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

CS488. Visible-Surface Determination. Luc RENAMBOT

Lab 1c: Collision detection

Atomic Transactions in Cilk

Transcription:

Nathan Crandall, 3970001 Dane Pitkin, 4085726 CS140 Final Project Introduction: Our goal was to parallelize the Breadth-first search algorithm using Cilk++. This algorithm works by starting at an initial vertex and exploring all of its neighbors. It pushes these neighbors into a queue until every vertex one level away is in the queue. Then it pops off the initial vertex and repeats this process for every neighbor in the queue. This algorithm searches a graph's levels 0, 1, 2, 3,..., to k levels away from the starting vertex in that order, which is a breadth-first search. We started with a sequential version of this algorithm that uses a queue as stated above. In order to parallelize this algorithm, a different data structure must be used in place of the queue. Since the queue can only push one item in or pop one item out at a time, the queue cannot be accessed in parallel. There are several data structures to replace the queue with. The two implementations we tried were the STL lists wrapped in a cilk reducer and our implementation of a data structure called a Bag wrapped in a custom reducer. Algorithm: The parallel algorithm we used is fairly similar to the sequential version of the algorithm. We have a while loop that loops once per level of the graph, but instead of computing the neighbors in the while loop, we call a recursive function that in this loop. This recursive function takes two Bags, one containing the vertices of the current level and one for storing all newly explored neighbors at the next level. It first checks the size of the input Bag and compares it with the grain size (this grain size is customizable and changed accordingly to provide maximum runtime efficiency). If the size of the input Bag is larger than the grain size, it splits the Bag in half, spawns a recursive call with the new half Bag and then runs a recursive call on the same thread with the old half Bag. Once the Bag has been split into enough pieces, they will be smaller than the grain size and will then compute the neighbors of the elements in the Bag it has and store the neighbors in the output Bag (which is a hyperobject, so that it can be written to in parallel). There is a cilk for loop that explores the neighbors of each vertex. This is the limit of the parallelism for our parallel algorithm. Data structure: The final implementation for our parallel Breadth-first search algorithm uses a Bag data structure containing Pennant trees. A Bag is a multiset and so therefore it supports the union operation. The Bag is an array of uniquely sized Pennant trees. A Pennant tree is a complete binary tree with an extra node above the complete binary tree's root that is used as the Pennant tree's root. Each index i of the Bag corresponds to a Pennant tree of size 2^i, which is a complete binary tree of size 2^i 1 + the root of the Pennant tree. Because of this property, the Bag works like a binary counter, with each index of the Bag either containing a Pennant tree or null pointer. Inserting an element into the Bag is just a matter of unioning a single element into the Bag and updating the size of the Bag by 1 like a binary counter, with each Pennant tree

representing a bit. Inserting takes O(1) amortized time. This is because any Bag of size even will have an insert cost of O(1) time (a Pennant tree of size 1 is inserted in the first position of the Bag in O(1) time) and any Bag of size odd may take up to O(log n) to insert one element, but by pushing the Pennant trees farther into the Bag, one O(log n) insert could free up many O(1) inserts before another O(log n) insert needs to happen, which is why the insert function has an amortized time of O(1). Our Bag not only can merge with another Bag, but it can also split itself in half and return a second Bag containing the other half of the elements. Both of these operations happen in O(log n) time. The merge operation works like a two bit Full Adder. It starts at the beginning index of each Bag and merges the two trees. Three cases may occur here. First, either one of the Pennant trees are empty and the non-empty Pennant tree gets pushed onto the resulting Bag. Second, if both Bags contain trees then they are merged together and carried to the next index in the merge operation. Third, if neither Bag has a tree for this index, then a null pointer is assigned to the resulting Bag. For every subsequent merge, a fourth step must be taken into account. The carried tree must be merged with the two next Pennant trees. The rules of merging still follow that of a two bit binary Full Adder. Split() is the inverse operation of merge(). The split function splits each tree and places the resulting Pennant trees in its index 1 in each Bag. These operations are O(log n) because they call k operations, where k = number of Pennant trees in Bag = log n (n = number of elements in Bag), and each operation works in O(1) time (merging and splitting Pennant trees only rearranges the roots, which takes O(1) time). These functions allow the Bag to be added to and split apart at any time, which means it is a great choice for using in a Parallel Breadth-first search algorithm. However, in order for our data structure to be accessed in parallel, it must be wrapped in a cilk hyperobject. Since we wrote our own data structure, there is no pre-made cilk hyperobject that we can use. Therefore, we needed to write our own. The type of hyperobject that we implemented was a cilk reducer. This is the appropriate option because our parallel algorithm requires that our Bag be written to in parallel and the parallel versions must be reduced to a single entity when our algorithm reaches a sequential portion of code. The reducer was written in the Bag header file and is essentially a wrapper around the Bag class. This Bag reducer includes a cilk class called a Monoid (this is a Mathematical concept) with our static reduce function, which just merges the Bags from different processes into a single entity. A reducer is very efficient and has very little overhead. The overhead occurs during work-stealing processes, when one instance of a Bag now has multiple copies on multiple processes and the most current version must be reduced to the original memory location, destroying the outdated Bags. In our Bag reducer wrapper class, we use the only data member, a cilk::reducer<monoid>, to access the Bag's functions. All this does is implement the functionality of the Bag as stated previously, abstracting away the problem of data races. 3D Graph Viewer: Our 3D graph viewer turned out surprisingly well, as it depicts the graphs read in from any edge file with useful information and offers a different way to view and think about the file. When starting the project, looking at an edge file was completely useless when there were more than 10 or 20 edges. It just became a huge, incomprehensible file with large numbers. But when we ran it with our graph viewer, we were easily able to understand how the graph was

represented and how it should be represented in memory. This program was created with OpenGL and Professor Gilbert s code to read in an edge file. Window creation was handled by GLUT (GL Utilities Toolkit). At first, each node was represented by a 2D disk created with the GLU (GL Utilities) quadric library, and edges were drawn as solid red lines. Eventually, we updated the program to add more clarity by drawing the nodes as 3D spheres using GLU quadrics once again, and drawing the tail of an edge as a white vertex, and the head of an edge as a red vertex. This gradient on edges provided an easier way to tell how vertices were connected, and where edges begin and end. After this, we decided to use GLUT bitmap fonts to identify the nodes so that we could tell which vertex was which. We solved this problem by assigning an ID variable to each node, converting this label to a string, and feeding it to the render call for a GLUT bitmap font. We then realized our graph viewer was pretty static, as the user couldn t move around or zoom into the graph. To solve this problem, we added keyboard input so the arrow keys move the camera around accordingly to the x and y-axis, W advances the camera forward into the z-axis (in this case, it would be the negative z-axis), S retreats the camera back from the graph, and A and D move the camera along the x-axis but does not update the look-at vector, therefore giving the user a better depth of the 3D environment, since you can effectively rotate the camera around the y-axis. The last step to make this a nice looking program was to add smooth shading and lighting to the graph. Luckily the GLU quadric library provides a nice function for computing normals on spheres, so those calculations were already done for us. Also, OpenGL easily allows the programmer to add lighting to the scene, which in turn takes care of the shading for the sphere. After this, we felt our graph viewer was as effective as it could be, also balanced with a nice front end for usage. However, one thing that disappointed us was the fact that we could not compile our graph viewer with Cilk++. GLUT takes in a pointer to a function in the form of void* to handle its OS calls, such as glutkeyboardfunc(void(*)(unsigned char, int, int). However, when compiling with Cilk++, function signatures are changed to void (cilk*)(unsigned char, int, int). This broke the GLUT system calls, therefore rendering GLUT useless. This was disappointing since we wanted to run the OpenGL render calls on one thread, and use the rest of the threads to perform the BFS. We thought this would be an interesting experiment to see how OpenGL render calls impact system performance, and more importantly how it would impact our BFS algorithm.

This graph was produced from the small graph file sample.txt that was provided by Professor Gilbert. This graph was produced from the small RMAT graph file tinyrmat.txt that was provided by Professor Gilbert. This graph helped show how many edges some vertices could have.

Different Implementations: 1) Unoptimized STL List with predefined Append List Reducer: This was our first implementation of a BFS. We searched the Cilk include directory and discovered that there was a predefined list reducer made for the push_back function on STL lists. We eventually got our algorithm to run both sequentially and in parallel, only to discover it was painfully slow. Professor Gilbert s code ran in 0.005 seconds on a 130,000 edge graph, whereas ours ran in 11.2 seconds. After discovering this, we both agreed that this was far too slow and needed to find a way to do a faster implementation. 2) Optimized STL List with predefined Append List Reducer: Our next step was to go through our code and find any unnecessary copy calls on the STL list, or any other steps that could be reduced to a smaller amount of operations. We found several copy constructor calls for STL lists and deleted those. Also, we realized we were searching sequentially through our list, instead of just calling the pop_back() function, which runs in O(1) time. After cleaning up our code, our runtime was reduced from 11.2 seconds to 0.1 seconds. It was a huge difference, but with a big flaw. It was still slower than Professor Gilbert s sequential code, and the more processors we added the slower it would become. We were getting the opposite of speedup, so we decided once again that we needed to go with a different implementation. 3) Two predefined Append List Reducer hyperobjects: We decided to go with one last approach before attempting at creating the Bag data structure. Instead of using a data structure that was a wrapper for an STL list, we created two hyperobjects that held the current vertices being explored and the current parents of the vertices being explored. In this way, we thought it would parallelize very well since we could simultaneously explore all the neighbors of a node, and keep track of several nodes at once. Unfortunately, this was not the case since hyperobjects are very slow. This implementation was by far the worst, as it took 90.52 seconds to explore 130,000 edges. Obviously we had to scrap this idea and create the Bag data structure if we wanted to get parallel efficiency.

Performance: This graph represents the time to run a BFS on one billion edges for 1, 2, 4, 8, 16, and 32 processors. As you can see, the more processors added, the quicker the algorithm runs. This graph has the largest speedup on 32 processors, showing that the larger the dataset, the more efficient the parallel algorithm is with the maximum amount of processors on triton. The line at 1.7 seconds is the time it took for the optimized sequential code to run.

This graph depicts the same data as the above graph, except the parallel BFS was run on 10 8 edges instead of 10 9. As you can see, we acquire parallel speedup, except if you look closely the 16 processor program ran slightly faster than the 32 processor program. This is probably due to Cilk++ overhead and not having enough data to parallelize efficiently.

This graph depicts the same data set as the above graph, except on 10 6 edges. This graph shows how our algorithm does not achieve its full potential until having a larger data set. This data set is too small to parallelize efficiently in our algorithm, which is why the 32 processor program runs so slow. This graph shows almost the same exact behavior as the former graph, except the parallel BFS is run on 10 5 edges. Once again, Cilk++ overhead dominates the runtime, and slows our program down considerably when we add more processors.

This graph represents how the size of the data set affects the time of our parallel BFS algorithm. The smaller data sets run faster with fewer processors. This is because the larger data sets are still doing more than an efficient amount of computations per process. All the data sets appear to converge at eight processors. This is the turning point for our algorithm and when larger data sets start to run faster and smaller data sets begin to have a large amount of overhead compared to the number of computations that occur.

This is the parallel efficiency for our parallel BFS algorithm on our largest data set of 1 billion edges for using 1, 2, 4, 8, 16, and 32 processors. We achieve super linear speed up at 2 processors with almost perfect linear speed up at 4 processors. After this, the parallel efficiency quickly declines. This is most likely caused by comparing our parallel efficiency with the sequential version of our parallel algorithm (that contains a lot of overhead with the type of data structure and cilk reducer we use that is not needed sequentially) instead of comparing our code to the optimized sequential version using a queue. Conclusion: Parallelizing the BFS algorithm was quite challenging. Our initial implementations using pre-made hyperobjects and STL list reducers were either too slow or performed inverse speed up (proportionally, the more processors used, the more time the algorithm took). The Bag data structure using Pennants was an excellent choice for parallelizing this algorithm due to its worst case runtime bounds on its functions and the simplicity of the custom reducer that it needed. With this implementation, we observed excellent speedup and parallel efficiency. The larger the data set we used, the more efficient our program ran using the maximum amount of processors (32), which means our program scales excellently. In the end, we tested TEPS (Traversed edges per second) on a graph that traversed ~500 million edges on 32 processors and we received 1,572,160,924 TEPS, putting triton at number 26 on the Graph500 list.