Using OpenCL for Implementing Simple Parallel Graph Algorithms

Size: px

Start display at page:

Download "Using OpenCL for Implementing Simple Parallel Graph Algorithms"

Britton Logan
6 years ago
Views:

1 Using OpenCL for Implementing Simple Parallel Graph Algorithms Michael J. Dinneen, Masoud Khosravani and Andrew Probert Department of Computer Science, University of Auckland, Auckland, New Zealand {mjd, Abstract For the typical graph algorithms encountered most frequently in practice (such as those introduced in typical entry-level algorithms courses: graph searching/traversals, shortest paths problems, strongly connected components and minimum spanning trees) we want to consider practical non-sequential platforms such as the emergence of cost effective General-Purpose computation on Graphics Processing Units (GPGPU). In this paper we provide two simple design techniques that allow a nonspecialist computer scientist to harness the power of their GPUs as parallel compute devices. These two natural ideas are (a) using a host CPU script to synchronize a distributed view of a graph algorithm where each node of the input graph is associated with a unique processing thread ID and (b) using GPU atomic operations to synchronize a single kernel launch where a set of threads, upper-bounded by at most the number of streaming processing units available, continuously stay active and time-slice the total workload until the algorithm completes. We give concrete comparative implementations of both of these approaches for the simple problem of exploring a graph using breadthfirst search. Finally we conclude that OpenCL, in addition to CUDA, is a natural tool for modern graph algorithm designers, especially those who are not experts of GPU hardware architecture, to develop real-world usable graph applications. Keywords: parallel graph algorithms, GPGPU, OpenCL, CUDA Contact Author: M.J. Dinneen Conference: PDPTA 11 I. INTRODUCTION Parallel programming is a generic concept describing a range of technologies and approaches. However in general it describes a system whereby threads of instruction are executed truly in parallel over a shared or partitioned data source. As part of parallel computing, General Purpose computation on Graphics Processing Units (GPGPU) is a new and active field. The main goal in GPGPU is to find parallel algorithms capable of processing concurrently huge amounts of data over a number of Graphic Processing Units (GPU). GPGPU involves using the advanced parallel Graphics Processing Unit devices now readily available for general purpose parallel programming. Within GPGPU research, implementing graph algorithms is an important sub-field and is the focus of this paper. Recently, GPUs have found their places among general computing devices. They are affordable and easily accessible for those enterprises looking for relatively low cost devices to process their massive data. In some applications the size of the input data is so large that even a low-order polynomial-time algorithm surpasses the time limit. Here one may scale down the running time by using more processors to accomplish the computational task concurrently. But then the main challenge is to find a parallel GPU algorithm that accelerates computation with a significant speed up over a well designed sequential one. CUDA [11] is the GPGPU platform provided by Nvidia Corporation that enables software developers to access the low level instructions and memory of the Nvidia GPUs. With respect to the current architecture of GPUs, CUDA follows the Single Instruction with Multiple Threads approach to parallel processing. While CUDA is restricted to the Nvidia GPUs, OpenCL [8] is a generic overlay with the purpose of providing a common interface for heterogeneous and parallel processing for both CPU and GPU based systems on different devices, such as AMD Radeon graphics cards. Each GPGPU OpenCL application consists of a host program or script that runs on the CPU and which launches the kernels or kernel programs which are compiled and run on the OpenCL devices. We believe OpenCL makes programming on GPUs easier and safer because it limits access to the kernel (e.g. sandboxing).

2 Designing parallel algorithms for graph problems has been studied for many years [1], [12], [13]. Implementing these algorithms efficiently on GPUs is a challenging task. In [2], Dehne and Yogaratnam show that one may need to make non-trivial changes to import a PRAM graph algorithm efficiently on GPUs. They mentioned the irregularities among graphs as one of the main challenges. Graph irregularities, as an obstacle in designing fast parallel GPU algorithms for graph problems, is also addressed in [5], [7], and [14]. The Harish and Narayanan paper [6] on parallel GPU algorithms for graphs is widely cited. Another notable result is due to Luo, Wong and Hwu [10]. Both propose parallel implementations of basic graph algorithms which are implemented directly on the Nvidia CUDA platform. As far as we know, the latter paper provides the fastest known breadth-first search graph algorithm for GPUs. In principle we agree with the authors of [10] that the complexity of a GPGPU algorithm should be the same as the best known sequential one. However, from a practical point of view, simple (possibly non-optimal) correct algorithms are also of value. For instance, when it is known that the expected input cases are relatively small, the extra time and overhead of implementing an optimal algorithm may not be justifiable. II. TWO DIFFERENT GPGPU DESIGN APPROACHES We now explain two simple ways that may be used to synchronize graph (and other types of) GPGPU computations where we have a set of well-defined stages that need to be completed. For example, in doing a breadthfirst search (BFS) in a graph, the stages correspond to the times when the set of nodes at a given depth/level is determined. A. Host-based synchronization design The first natural approach is to use a host CPU program (or script) to synchronize stages of a graph algorithm where each part (usually nodes) of the input graph is associated with a unique processing thread. This is a standard way of synchronizing processing threads. Here a global variable, shared by all threads, is set to false by the host and set to true by any thread inside the kernel that requires another stage. For example we have the PyOpenCL [9] snippet, shown in Figure 1, from our breadth-first search implementation using host-based synchronization, where n is the number of threads. Note that we use the PyOpenCL method call enqueue_write_buffer to send data from the host to the GPU, while enqueue_read_buffer will retrieve data from the GPU to the host. Each of the kernel threads will set the global variable continue_flag if the algorithm needs another synchronization stage. Note that GPUs operate asynchronously from the host. Thus there is a requirement to use an explicit wait method call to wait for all kernel tasks to finish before going on to the next stage. Comment: In its original form the BFS algorithm of [6] uses kernel relaunch to provide a global inter-block barrier between search frontiers. Indeed their program launches two kernels for each iteration one to check the neighbors for each visited node and another to update the next frontier. In addition to our other changes we have developed an algorithm (see DKP-Host Sync form Section III) that runs in only one kernel launch per stage by using a method for synchronizing threads plus efficiently allocating data among the threads available. B. Kernel synchronization using atomic operations The second natural approach is to use GPU atomic operations to synchronize a single kernel launch. In this case a set of threads continuously stays active and timeslices the total workload until the algorithm completes. An important requirement for this approach is having an efficient way to partition an algorithm s workload. Suppose we have n tasks to complete and we only have a fixed number m = MAX THREADS of parallel processing threads. Thus, after evenly distributing, each thread should perform c = n/m tasks. This can be done in a number of ways depending on the stride through the tasks t 1, t 2,..., t n. If data is stored in memory we usually want to partition and stride through as [t 1,..., t c ], [t c+1,..., t 2c ],..., [t (m 1)c+1,..., t n ] or as [t 1, t 1+m, t 1+2m,..., t 1+(c 1)m ], [t 2, t 2+m,..., t 2+(c 1)m ],..., [t m, t m+m,..., t n ]. The listing of kernel code given in Figure 2 illustrates the distribution of these tasks to processing threads, where tid represents one of the active threads operating in parallel. By our convention the thread with tid=0 does the synchronization management for the algorithm. In this kernel listing, we use current_stage to represent a global clock and each thread keeps a local clock and only executes its set of tasks when they match. Note the use of atomic operations to ensure correctness of shared data.

3 while continue_flag[0]: continue_flag[0] = 0 cl.enqueue_write_buffer(queue, continue_flag_buf, continue_flag) cl.enqueue_write_buffer(queue, current_stage_buf, current_stage) cl.enqueue_nd_range_kernel(queue, kernel, (n, 1), None) cl.enqueue_read_buffer(queue, continue_flag_buf, continue_flag).wait() current_stage[0] += 1 Fig. 1. Host-based synchronization using PyOpenCL. while (1) // spin lock { // using current_stage as global clock if (*current_stage == local_clock[tid]) { // Is everything done? if (continue_flag[*current_stage] == 0) return; // process n/max_threads work at this sync time stage //... // if needed, we set next continue_flag[*curent_stage+1]=1 } atom_inc(finish_count); local_clock[tid] += 1; // this work thread is done if (tid==0) // thread 0 detects if everybody is done with stage { while (atom_cmpxchg(finish_count, MAX_THREADS, 0)!= 0) {} atom_inc(current_stage); } } // end kernel s algorithm loop Fig. 2. Thread-based synchronization in OpenCL kernel. The finish_count value is used by this kernel to synchronize the threads or processors involved in between stages. For example our Nvidia C2050 device has MAX THREADS=1024 and we used the second stride technique, described above, in our program DKP- Kernel Sync, which is discussed in Section III. Comment: Initially we experimented with a lock based inter-block barrier as described in the paper of Xiao and Feng [15]. This has worked reliably for small graphs and for very dense graphs. Unfortunately, Nvidia Corporation do not officially support this inter-block barrier technique and its use can lead to unpredictable results when run on the current family of Nvidia GPGPU offerings. We eventually came up with our own single-block synchronization technique (presented above) that makes use of atomic operations without the disadvantages of the approach of [15]. III. EXPERIMENTAL RESULTS As an illustrative example we develop two BFS algorithms similar to the ideas first presented as CUDA implementations by Harish and Narayanan [6]. We have recompiled it to run on our platform (see below) to get comparable running times. Partly due to the available precession of timing GPU computations, all our result times are given in milliseconds elapsed. Calculation times are the kernel run-time from launch until after the host program s final wait call. As with common

4 Graph Linear Array Representation 4 Adjacency Lists 0: 3 1 1: : 1 3 3: : 3 Sub-Index: n l 0 l 1 l 2 l 3 l 4 l 5=n Fig. 3. An effective way to represent sparse graphs in an array. practice we do not include disk I/O time and host to device copy times. We argue that often an application will copy a graph to memory or GPU and run many algorithms upon that copy (often set read-only ). So the real issue is how fast, assuming the graph data structure is available, does the actual algorithm take. For graph algorithms (on sparse graphs) one often prefers adjacency lists over adjacency matrices since it is easier to iterate through the neighbors of a node in time proportional to the node s out-degree [3]. For a GPU representation one usually linearizes this two dimensional adjacency lists to a one dimensional array such that no loss of efficiency occurs. We represent a flattened adjacency list representation of a graph of n nodes and m edges as a vector of length O(n + m) consisting of [n,l 0,...,l n, v 0,v 1,...,v m 1 ]. Here n is the order, l i is the index in this vector of the first neighbor of node i (i.e., points to some v j index, 0 j < m). Figure 3 illustrates this array representation. In particular, l n (plus the sub-index offset n + 2) is an index one past the end of the vector and the degree of node i is l i+1 l i. The expected performance of all three tested algorithms, mentioned in Table I, is O(nx + m), for a graph of order n, eccentricity x (distance of a farthest node from the starting node) and size m. We note that for sparse graphs the value of x is much less than log n, based on the known average height of a random rooted tree. So the dominant term is m in the complexity of these algorithms for most graphs thus, those chosen as our tests cases are exceptions. There are a number of variants of the BFS algorithm. One can gather information about predecessors or parents, about BFS tree levels and about the list of children. In addition one can gather all the BFS trees for each of the possible starting nodes, which approaches the task of computing the distance matrix. A. Test cases and benchmarking environment As a selection of somewhat tough test cases as suggested by Luo, Wong and Hwu s paper [10], we picked sparse graphs from the 9th DIMACS Implementation Challenge [4]. We took each of the 51 state road [di]graphs and made them connected so a BFS from starting node 0 would span and process the entire graph. We removed loops and connected the graphs by adding k 1 arcs to connect those with k > 1 components; e.g. for each lowest node index i not in the first component, we added arc (i 1, i) to the graph. The orders (number of nodes), sizes (number of arcs), and eccentricity of node 0 (distance of farthest node from node 0) are listed in the first few columns of Table I. The system used by the authors in implementing the BFS graph algorithm, using the above two design approaches, consists of a rack-mountable server with 2 quad core 2.5GHerz Intel CPUs and 2 Nvidia Tesla C2050 series (Fermi class) cards. The Tesla C2050 is classified as having Nvidia compute capability 2.0 which defines a range of attributes. In particular, the C2050 has 14 multiple processors (MPs) each with with 32 cores and 3Gb cache (global memory). Each of the 448 cores operate at a frequency of 1.15 GHz. The Tesla C2050 supports blocksizes of up to 1024, which can be viewed as the MAX THREAD value discussed earlier. B. Observations from our experiments All programs produced the same (correct) expected BFS depths for each node as a standard CPU-based BFS program. In addition to computing depths from the source node that the original Harish-Narayanan algorithm computes, we also record in our DKP programs, a BFS parent and (later) performed a sequential BFS dag search to verify correctness by ensuring that each parent is, in fact, adjacent and has depth one less than the child. We conclude that both the OpenCl and CUDA approaches have very little difference in overall performance. Here, in addition to what is reported, we actually converted our OpenCL DKP-Kernel Sync implementation to a pure CUDA implementation and measured the running times on these same test cases. The times, in all cases, were roughly ±2% of those times that are listed in the last column of Table I. There are a couple of extreme cases (MI and MO) in our experiment that we do not fully understand why the times are so high relatively to the other two programs

5 TABLE I GPU BFS ALGORITHM RUNNING TIMES (IN MILLISECONDS) ON USA STATE ROAD GRAPHS (ORIGINATING FROM NODE 0). State Graph Nodes Arcs Eccentricity Harish Narayanan DKP-Host Sync DKP-Kernel Sync AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY Total Average

6 (we reran each program a few times to double check the reliability of our times). Recall the Harish-Narayanan also uses host synchronization but the program is purely in C, not PyOpenCL but we do not believe that difference is the cause. It turns out that for large graphs such as the USA road graph with 24 million nodes, DKP-Host Sync (about 20 seconds GPU time) is about 3 5 times faster than DKP- Kernel Sync. It appears that at about one million nodes (on sparse graphs) the DKP-Host Sync runs faster than DKP-Kernel Sync (e.g. see the big states such as CA, FL and TX). However for these very rare extreme cases, it might be better to use a more optimal algorithm such as the one given in [10]. For small (or dense) graphs one should probably prefer the DKP-Kernel Sync program. Finally we want to mention that the common best practices of ensuring memory coalescence should not necessarily be taken as absolute advice. Our DKP-Kernel Sync program actually performs better with memory strides of increments of MAX THREADS compared to a version that does memory strides of distance one. We suggest the user try several equivalent variations of their program and take the best performer targeted for their expected input cases. IV. CONCLUSIONS AND OPEN PROBLEMS In this paper we introduced and compared two simple techniques for synchronizing processes in parallel graph algorithms. We first considered the scenario where the host CPU is responsible for the synchronization of kernel launches. For example, in DKP-Host Sync, each node of the graph is uniquely associated with a multiprocessor thread. For our second approach we consider the case where the input is partitioned into at most the number of possible parallel threads. For example, in DKP-Kernel Sync, we use only one block (work group) of parallel threads to avoid inter-block synchronization issues. Here the multiprocessor thread use atomic operations for synchronizing the computation. Our experiments showed that both these approaches work well for specific categories of graphs. DKP-Kernel Sync is better for small and dense graphs, while DKP-Host Sync is more efficient on sparse graphs of over a million nodes. We also compared the running times among the different implementations of the same algorithm via OpenCL and CUDA. We noticed that there is no remarkable difference in computation time between them. Hence OpenCL seems to be as mature and usable as CUDA, with at least one additional advantage of being portable onto more devices (CPUs and GPUs). There are many problems left to be investigated in this area. For example, we are interested in testing other graph algorithms via these synchronization techniques. Also finding a way to reliably implement an inter-block barrier on the GPU platforms would be extremely valuable. In addition, further work could include developing a OpenCL library of efficient parallel graph algorithms for GPUs. ACKNOWLEDGMENTS The authors would like to thank both P.J. Narayanan and Wen-mei Hwu for providing samples of their BFS GPU code for comparison and Radu Nicolescu for discussions and encouragement in designing GPGPU graph algorithms. REFERENCES [1] F. Y. Chin, J. Lam John and I. Chen, Efficient parallel algorithms for some graph problems, Communication of ACM, 25(9) 1982, [2] F. Dehne and K. Yogaratnam, Exploring the Limits of GPUs With Parallel Graph Algorithms, [3] M. J. Dinneen, G. Gimel farb, and M. C. Wilson. Introduction to Algorithms, Data Structures and Formal Languages, 2nd Edition. Pearson (Education New Zealand), ISBN [4] D. Schultes. 9th DIMACS Implementation Challenge, challenge9; USA state road graphs, http: // challenge9/data/tiger/, October [5] Y. Frishman and A. Tal, Multi-Level Graph Layout on the GPU, IEEE Transactions on Visualization and Computer Graphics, 13, 2007, [6] P. Harish and P. J. Narayanan, Accelerating large graph algorithms on the GPU using CUD in IEEE High Performance Computing, 2007, LNCS 4873, pp [7] K.A. Hawick, A. Leist and D.P. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing, 36(12), 2010, [8] Khronos Group. Open Standards for Media Authoring and Acceleration, [9] A. Klöckner. PyCUDA and PyOpenCL: Even Simpler GPU Programming with Python. Nvidia GPU Technology Conference, (see [10] L. Luo, M. Wong, W-M. Hwu, An Effective GPU Implementation of Breadth-First Search in Proceedings of the 47th Design Automation Conference (Anaheim, California, NY, [11] Nvidia, CUDA. [12] M. J. Quinn and N. Deo, Parallel Graph Algorithms, ACM Computing Survey, 16(3) 1984, [13] V. Rao and V. Kumar, Parallel depth first search. Part I. Implementation, International Journal of Parallel Programming, 16(6) 1984, [14] J. Soman, K. Kishore, P. J. Narayanan, A fast GPU algorithm for graph connectivity. IEEE International Symposium on Parallel Distributed Processing, 2010, 1 8. [15] S. Xiao and W. Feng, Inter-block GPU communication via fast barrier synchronization, Technical Report TR-09-19, Dept. of Computer Science, Virginia Tech., 2009

Telecommunications and Internet Access By Schools & School Districts

Universal Service Funding for Schools and Libraries FY2014 E-rate Funding Requests Telecommunications and Internet Access By Schools & School Districts Submitted to the Federal Communications Commission,