LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms

Size: px

Start display at page:

Download "LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms"

Dinah Cannon
5 years ago
Views:

1 LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms Hamidreza Zaboli School of Computer Science Carleton University Ottawa, Canada K1S 5B6 hzaboli@connect.carleton.ca October 13, Introduction Parallel Computing has been one of the most important problems in computer science as well as a solution to satisfy the growing need of computation speed. Although in the past decades, parallel computing has been an interested area, it is going to be more popular and useful among simple users. In the past years, the area was strange for home users due to the high cost of installing a parallel computer machine. Besides that, large necessary space to install such systems, inexistence of efficient algorithms, etc. caused the area to be popular only among computer scientists. These days, by reducing the cost of producing different parallel systems with various architectures, these machines can be seen in every PC. Intel multicore processor is one of the most common parallel processors on current desktops and laptops. The ease of accessing such simple parallel processors makes experts to redesign and revise old scientific algorithms and software architectures which were originally designated for sequential processing. In fact, in order to keep today s computations efficient, most of the basic algorithms which are optimal for single-core processors, should be modified. Some of these algorithms are sorting algorithms, searching algorithms, matrix operatons, and graph algorithms. As a result of revising basic algorithms, advanced algorithms in different scientific areas, which use basic algorithms as a small portion of their bodies, need to be revised as well. It is clear that scientific algorithms in the fields of computational geometry, image processing, bioinformatics, etc. use the basic mentioned algorithms to solve a bigger problem related to their special fields. Therefore, because of the vast use of parallel processors, especially different multicore processors, and in order to keep use of them efficient, it is necessary to redesign both basic and advanced algorithms in different scientific fields. In this project, we are going to review basic and advanced problems of computational geometry on parallel machines. Note that the parallel platforms that we chose are current multicore platforms. They includes Cell-BE of IBM,Sony,Toshiba and Multicore processor of Intel. In the past years, researchers did many works on parallel computational geometry. These works are mostly on expensive, large-size parallel machines and sometimes on virtual and theoretical parallel models such as PRAM. These days, by running small parallel processing 1

units on PCs(multi-processors and multicores), it seems important to revise the algorithms of computational geometry in order to make them optimal for running on the multicore processors.

2 units on PCs(multi-processors and multicores), it seems important to revise the algorithms of computational geometry in order to make them optimal for running on the multicore processors. Thus, in this work, we start with reviewing current works on these processors, specially we are interested in some basic algorithms that are widely used in computational geometry such as list ranking and sorting algorithms. Due to the novelty of the mentioned multicore processors, there are only a few works on them. In this survey, we review these works and address related useful works. In addition, we give a fast review on the works related to basic and advanced computational geometry algorithms on graphic processor unit(gpu). This matter is interesting due to the fact that GPU is a helpful co-processor that can undertake a significant portion of the computations especially graphic and geometric computations. Current GPUs are going to be a huge SIMD (single instruction multiple data) processors exploited in PCs. Tough they are originally designated to perform graphic computations on the pixels of output image, their high abilities of performing parallel computations and the increasing number of simple processors inside a GPU have made them interesting parallel machines for researchers. As it is illustrated in Fig. 1, GPUs are large parallel processors compared to current parallel processors. For example, one of the most powerfull and recent Cell processors (cell broadband engine) have 8 parallel cores (Fig. 2) which is a few compared to a Nvidia GeForce 8 GPU which has 128 simple thread processors. However computation power of a single core in a multicore processor is much more than a single core of a GPU, number of parallel streams that can be computed in a GPU is many times more than this number in a multicore processor. Especially when computations can be performed in a SIMDized way, workloads requested by multicore for accessing memory will be time-consuming compared to the workloads in GPU. Besides the huge parallel architecture of GPUs, they have a higher bandwidth to access main memory. Note that even in current multicore proccesors, data transfer between CPU and main memory, and the provided bandwidth are problematic issues that force experts to deal with them. These advantages of GPUs over CPUs in parallel computations create a good opportunity to transfer some problems and computations of CPUs to GPUs, even for problems that are not related to graphic and geometric computations. On the other hand, transfering and solving problems especially computational geometry problems using GPUs necessitate the revision and assessment of some basic algorithms that are widely used in advanced computational geometry problems. In this literature review, we are interested in the basic and advanced computational geometry problems and we present a fast review on them. Fig. 1. The architecture of a Nvidia GeForce 8 GPU. Each 8 thread processors have a shared local memory of size 16KB[1]. 2

3 Fig. 2. The architecture of a Cell processor with 1 PPE and 8 SPEs[2]. Afterwards, as a perspective of the project, we are going to find the possible optimizations, modifications, and solutions applicable to both basic studied algorithms and advanced problems of computational geometry on the multicore platform. We expect to reach an encouraging area with a variety of computational geometry problems that can be improved to work more efficiently on multicore platforms. 2 Literature Review In this section, we start with a general review over the studies and works on the multicore platforms and address related works. Next, we concentrate on our interested basic algorithms and problems i.e. list ranking and sorting on the multicore platforms and present and discuss our selected works in detail. We suppose that reader is familiar with the importance of these problems in the field of computational geometry. Just to give a hint, we point out that most of the graph and tree problems are not solvable without having knowledge about sorting and list ranking. Please refer to [3] for more information about the use of sorting and list ranking algorithms in different computational geometry problems. 2.1 Hardware-Related Optimization Techniques on Multicore Processors In the recent few years, by introducing multicore processors, researchers focused on these processors and studied their different properties. They are also trying to find the impact of them on the running of different problems and algorithms. Because the multicore architectures are new, as a tradition, researchers have firstly tried to introduce them and their especial characteristics that may have impact on algorithms and different properties of scientific computations such as computational complexity, communication complexity, execution time, I/O loads, etc. For example, K. Nyberg in [4] and V. Pankratius et. al. in [5] tried to introduce the opportunities that multicore architecture prepares for a new generation of algorithms and softwares. K. Nyberg in [4] discussed about the old multi-tasking scheme of Ada programming language and tried to renovate multi-tasking programming environment in Ada with respect to the new multicore architecture. The paper tried to test the impact of multi-tasking in Ada on the multicore. On the other hand, general purpose applications and aspects of software engineering on the multicore have been presented in [5]. The paper considers four case studies of general-purpose applications coded with different programming languages(c++ with OpenMP, Java,C-sharp) and shows the results of their runnings on two multicore platforms manufactured by Intel and Sun microsystems. In addition to basic concepts of algorithms and software systems on the new parallel architecture i.e. multicore, an important area that catch the attention of researchers is how to find and design methods that can help to achieve a better performance on the multicore. 3

4 There are some reports about the tricks and techniques that are applied to algorithms to enhance the performance of the multicore. Most of these approaches, fall into tricks that try to hide the latency of accessing memory by DMA requests in order to keep all cores of the processor busy. Generally talking, these tricks and techniques try to use all resources of the machine, (both the multicore processor and other resources such as memory, I/O, and GPU) as much as possible. M. Rafique et. al. in [6] presented an approach to deal with I/O intensive workloads on Cell processor. The paper tries to overlap the time spent on getting/putting data from/to memory by a technique called asynchronous prefetching. In this method, the blocks of the data which will be processed in the next iterations of the execution are requested to be fetched in advance, while the cores of the Cell are working on the current data. Please note that the current data has been prefetched in the previous iterations. Briefly, the paper presents studies on the following items: 1. I/O path in Cell processor 2. I/O tricks and techniques applicable to Cell processor 3. evaluation of data prefetching techniques on Cell processor A prefetching technique which is fairly popular is double buffering. This technique has been discussed in [7]. We will study this paper in detail in the next section. A thread-to-core assignment method has been proposed in [8]. In the parallel execution of a parallel code, different parts of the code are divided to parallel parts so that each part can be run by a thread. Finally, every thread is sent to a core. The paper uses the fact that each thread with respect to its related code section needs some resources. On the other hand, each core of CPU at a moment, has access to a set of resources not all of them. The method tries to assign threads to cores so that the allocated pair of thread and core is the best match at the moment and the core can provide the thread with the required resources. Moreover, fairness and overall throughput should be taken into account. The method uses the fact that most codes have several phases of execution that show approximately similar runtime characteristics compared to other phases. If we can approximately predict similarity among program phases, then we can use this information to assign threads to cores optimally[8]. M. Chu et. al. in [9] and Anderson in [10] proposed a similar method in that they partition data access requests over cores and their associated local stores, instead of partitioning execution of the code. L. Liu et. al. in [11] show that if the number of cores exceeds a certain threshold, performance will degrade due to the bandwidth problem between CPU and memory. As it can be seen in Fig. 3, multicore processor of Intel has 4 cores that are connected to main memory using a memory controller hub(mch). MCH also provides a connection for data transfer between main memory and other devices. By increasing number and volume of main memory accesses, MCH will be a bottleneck[11]. The paper tries to define a measure called memory access intensity and uses it to determine the mentioned threshold. Fig. 3. Block diagram of a 4-core Intel multicore processor[11]. 4

2.1.1 Double-Buffering Technique on Multicore In this section we briefly review an efficient technique called double-buffering which is exploited for data prefetching by cores of multicore processors

5 2.1.1 Double-Buffering Technique on Multicore In this section we briefly review an efficient technique called double-buffering which is exploited for data prefetching by cores of multicore processors in order to hide data transfer latency. Double-buffering is an especial type of a more general technique which is called multi-buffering. Double-buffering and multi-buffering are popular techniques that have applications in different aspects of computer science and engineering, both hardware related and software related. Double-buffering technique is interesting because it is the most effective and suitable type of multi-buffering techniques with respect to the small local stores of cores of multicore processors[7]. For example the capacity of the local store available to each core of Cell processor is only 256KB. For more information on multi-bufering techniques on multicore, including single-buffering, double-buffering, and triple-buffering, please refer to [12]. Please note that using more buffers avoids most complexities of double buffering but it needs more memory that is not a desirable request for multicore processors[7]. Doublebuffering technique has been used in basic algorithms on multicore platforms recently. For example, B. Gedik et. al. in [2] and D. Bader et. al. in [13] have used double-buffering to improve the performance of sorting and list ranking algorithms on multicore platforms. Especially they tried to hide memory access latency using this technique. In what follows we give a brief review of a recent work by J. Sancho et. al. in [7] that analyzed the impact of double-buffering on two multicore processors: Cell processor of IBM, Sony, and Toshiba and Quad-core Opteron of AMD. Assume two major operations that are performed when it is required to transfer data from the cores of CPU to main memory and in the reverse direction, i.e. from main memory to the cores of CPU. We name these operations as put transfer and get transfer, respectively[7]. Also we place two separated buffers in the local store of each cores of processor, one allocated for computations and the other for getting/putting data from/to main memory. Data transfers are performed on one buffer at the same time that the other buffer is used for computations[7]. This model is shown in Fig. 4. In fact, double-buffering is possible and useful for computations and processes in which it is possible to know or predict what data is necessary or should be processed in the next iterations. Using this method, data transfers can be overlapped with computations and consequently, the latency of data transfer will be hidden. Experiments have been performed on the two mentioned multicore processors. In the first experiment, on Quad-core Opteron, its hardware prefetcher is used to prefetch data from main memory. The hardware prefetcher prefetches data from the memory to L1 cache. Fig. 4. Double-Buffering model for the local store of a core: A buffer is designated for computations and the other for both getting/putting data from/to memory[7]. 5

6 In order to do this, the hardware prefetcher watches memory accesses during a set of iterations and then predicts the memory blocks or lines that are more likely to be requested in the next iterations and then, it prefetches them. For example, if two consecutive accesses to memory occur in blocks n and n+1, then the hardware prefetcher prefetches block n+2. However the hardware prefetcher works automatically and does not need any particular attention, it does not have the flexibility of software prefetching. On the other hand, Cell processor has dedicated DMA engines to transfer data between the cores of Cell processor(which are called SPEs) and main memory. These DMA engines can be controlled by user. DMA controllers can operate and transfer data at the time that SPEs are processing other data. The following plots illustrate elapsed time and the achieved speedup using double-buffering technique in comparison to using a single buffer. Fig. 5 shows results of using double-buffering on 4 Quad-core Opteron processors which work in parallel[7]. Fig. 5(a) shows the execution time of double-buffering technique compared to single-buffering by increasing the number of cores. Please note that the Y axis is in logaritmic scale. As can be seen, by increasing the number of cores, execution time using both techniques decreases. Also, in Fig. 5(b), achieved speedup of using double-buffering technique against singlebuffering technique is shown. Speedup is evaluated for two prefetching cases: stride-1 and stride-1k. Please note that stride is defined as the distance between each two consecutive blocks of the data which is being prefetched from memory. Fig. 5. (a)execution time of double and single buffering schemes by increasing the number of cores on 4 Quad-core Opteron, (b)speedup of double-buffering technique over single-buffering technique with two strides by increasing computation intensity[7]. Finally, Fig. 6. compares the execution times of using double-buffering technique on both Cell processor and Quad-core Opteron by increasing computation intensity. As it can be seen, the Quad-core Opteron outperforms the Cell processor. This superiority, as authors stated, is due to the higher aggregate memory bandwidth and higher peak processing rate[7]. However, by increasing computation intensity, the superiority of Opteron against Cell reduces, it still is higher than Cell processor. Please note that with a computation intensity of 1, Opteron operates about 6 times faster than Cell and with a computation intensity of 20, Opteron operates only about 2 times faster than Cell. Please refer to [7] for more information on the computation intensity and how it is measured by authors. 2.2 Basic Problems and Algorithms on Multicore Processors Up to this point, we reviewed the studies on multicore processors which are not about algorithms but about the aspects of hardware and techniques that can improve the performance and speed of executions. 6

7 Fig. 6. Execution times of using double-buffering technique on Quad-core Opteron and Cell processor[7]. As we mentioned earlier, due to the novelty of the multicore processors, the number of these works are not many. As a matter of fact, works related to different problems and algorithms on the multicore are a few. They are on basic methods and algorithms such as sorting, matrix multiplication, dynamic programming, and list ranking. The other issue that is important, is compilers and programming environments that are necessary for execution and evaluation of algorithms and codes on the multicore. Unfortunately, There is no dedicated compiler for multicore platforms so far and current programmings and executions are done in the traditional environments which are originally designated for sequential processing. At best, programming environment is a modified or optimized compiler or programming language that may not achieve the highest possible performance of the multicore processor. Some of these environments are MPI, OpenMP, and CILK. MPI is a modified version of C++ that is designated for programming in message-passing parallel programming models. However A. Kumar et. al. in [14] studied the feasibility of using MPI for programming on the Cell multicore processor. OpenMp is a similar modified version of C++ designated for shared-memory parallel architecture. As it can be seen in the architecture of the multicore processors, and as it has been treated so far in the recent research and projects, multicore processors fall mostly into the category of sharedmemory architectures. Therefore, using OpenMP for programming on the multicore is preferable. In addition to the mentioned environments, a new extension of C++ called Cilk has been developed for programming on the multicore platforms. Please refer to [15] for more information. The other instances of basic algorithms on the multicore are dynamic programming and matrix operations which are discussed in [16] and [17], respectively. The first one evaluated three classes of dynamic programming algorithms on the multicore: local dependency dynamic programming, gaussian elimination paradigm, and parenthesis problem. Experiments are performed on an AMD Opteron with 8 cores. The second one tried to evaluate linear algebra matrix operations on the multicore. Recently, a new work on the problem of sequence alignment has been performed on Cell processor. Although this problem is not an advanced problem, it is considered one level upper than the mentioned basic algorithms. It seems that researchers are going to transfer gradually from basic algorithms to complex and especial algorithms that have applications only in a especial scientific field. A. Sarje et. al. in [18] presented two advanced alignment techniques called spliced alignments and syntenic alignments implemented on the Cell 7

8 processor. These alignment techniques are used in biological applications. Experimental results show speedups of about 4 on the Cell processor compared to serial algorithms on the Cell and Pentium 4 processors[18]. 2.3 Sorting Algorithms on Multicore Processors In this section we give a brief review of sorting algorithms on multicore processors. There are some works on big parallel machines and multi-processors that can be found in [19]. However, as we have seen so far, there are only a few works that studied the optimization of sorting algorithms on multicore in [2], [20]. In most cases, the routine for designing a sorting algorithm for multicore processor is like the following. First, a basic sorting algorithm that seems to be effective and efficient for parallel sorting is chosen. Then it is used as the kernel of the whole sorting process and may be modified for achieving better performance. Finally, some optimizations are applied to the whole process for a more effective implementation and result. Multi-buffering and memory latency hiding are of popular modifications. In the following, we discuss CellSort algorithm by B. Gedik et. al. which uses a bitonic sorting kernel to design a sorting process on the Cell processor. More information about this algorithm can be found in [2] CellSort: A sorting algorithm on the Cell Processor CellSort, as the authors mentioned in their report, is based on distributed bitonic merge with a SIMDized bitonic sorting kernel[2]. Their sorting process has three levels. At the innermost level which is called single-spe local sort, local data items to a core of the Cell processor are sorted using only that core. We refer to the cores of the Cell processor as SPEs. In the second level, the data items which are stored in the local stores of SPEs are sorted using a distributed bitonic sort. This level is called in-core sort because the sorting in this level is performed inside the Cell processor, between the SPEs or inside a SPE. SPEs are connected using a bus called Element Interconnect Bus(EIB). The architecture of the Cell processor, SPEs, PPE, and EIB are shown in Fig. 2. Practically, we need to sort data items that do not fit inside the local stores of the Cell processor. Therefore, The sorting algorithm should be able to handle such situations. CellSort has this ability and handles it by moving data back and forth between the Cell processor and main memory. As the authors stated, this level is called distributed out-of-core sort. This level is very costly and time-consuming because of the data transfer between Cell processor and main memory. Single-SPE local sort is a simple bitonic sort that is applied to the local data items in the local memory of the SPEs. Fig. 7 shows a simple scheme of the simple bitonic sort[2]. Assuming a list of n items, bitonic sort starts with lists of size 1, merging them at each stage, and doubling the size of the lists up to merging two sorted bitonic lists of size n/2. This bitonic merging process is shown in Fig. 7. The computational complexity of simple bitonic sort for sorting m items is O(m log 2 m). One of the optimizations that can be applied to the sorting process is using SIMD instructions that are provided by the SPEs of cell processor. Using SIMD instruction set, two vectors each containing 4 items can be compared and sorted only by three SIMD instructions, while without using them, it takes up to 8 operations for the two vectors to be sorted[2]. Given two vectors of size 4 in the bitonic merging process, the comparison instruction of the SIMD set combined with a selection instruction can put 4 lower items into a vector and put the 4 higher items into the other vector. These instructions and the 8

Even for vectors with less than 4 items, it is possible to modify them so that SIMD instructions can be applied. Fig. 7. Phases of bitonic merging process for sorting 8 numbers[2].

9 support that cell processor provides for using them improve the whole sorting process to achieve a higher speedup. SIMD instruction set can be used for comparison of any two vectors with 4 or more data items. Even for vectors with less than 4 items, it is possible to modify them so that SIMD instructions can be applied. Fig. 7. Phases of bitonic merging process for sorting 8 numbers[2]. For example consider the last phase of comparisons in Fig. 7. Comparisons are performing on every two consecutive data items. In this case, SIMD instructions can not be used to compare and swap each item with its consecutive item. But it is possible to change the locations of items so that each item is compared and swaped with its consecutive item using SIMD instructions. This can be performed by shuffling. Fig. 8 shows the possible shuffling that can be applied to the last stage of Fig. 7. After shuffling, SIMD instructions are used and finally, another shuffling is applied to the items. Fig. 8. Using SIMD instructions for comparing and swapping vectors with less than 4 items[2]. After local sorts, next level is distributed in-core sort in which all SPEs co-operates to sort a vector of size 8m (m is the number of items in the local store of a SPE.). In this level, first of all, every SPE sorts its local numbers(data items) using the single-spe local sort so that each two consecutive SPEs sort their local numbers in opposite order (one ascending and the other descending). Now, we can apply bitonic merges starting from lists with m items (which generates a sorted list of length 2m) to the final list in the last phase with length 8m. Please note that for merging two lists of items in two consecutive SPEs, it is possible to divide the lists so that each SPE compares and swaps only half of the numbers in the lists. For example assume the first phase of merging lists of SPEs. Every SPE has a list of numbers which is sorted in the opposite order of its consecutive SPE. Therefore, it is possible to compare each number of the first list with its corresponding number in the second list, and put the smaller number in the first list and the bigger number in the second 9

list. Because comparing and swapping of each number with its corresponding number is independent of the other numbers, it is possible to divide the list of numbers and assign them to different SPEs.

10 list. Because comparing and swapping of each number with its corresponding number is independent of the other numbers, it is possible to divide the list of numbers and assign them to different SPEs. In order to do so, we split each of two consecutive lists into two equal lists as it is shown in Fig. 9. Then we assign the first half of the first list to the first SPE to be compared and swapped with the first half of the second list. Similar to this, we assign the second half of the first list to the second SPE to be compared and swapped with the second half of the second list. Using this way, both SPEs will be busy and have equal workloads to do. Besides that, each SPE has to transfer only half its list to the other SPE. This will result in a lower numbers of communications between SPEs. Fig. 9 shows the described optimization for the phase in which the length of each list is 2m. Fig. 9. Dividing lists of SPEs and their assignings to different SPEs for the phase in which length of a sorted list is 2m[2]. The last level of bitonic sorting on the Cell processor is distributed out-of-core sort. In this level, firstly a series of in-core sorts are performed so that all of data items are divided into lists of size 8m. Each sorted list is sorted in the opposite order of its next and previous list. Next, out-of-core bitonic merge is applied to these lists until the final sorted list including all of the data items is obtained[2]. Therefore, The whole process of bitonic sorting is a distributed bitonic merge which includes many bitonic in-core merges and single-spe local sorts. Note that in the distributed out-of-core merge there is a huge number of data transfers between the SPEs of Cell processor and main memory. These data transfers are very time-consuming. In order to hide this latency, prefetching is used which is described in section The prefetching is performed using DMA requests which are available to SPEs. Every SPE can request a DMA to get (i+1)th block of data items while sorting (i)th block of data and also can put (i-1)th block of data items into main memory. Using this way, a great deal of latency will be overlapped with computations of SPEs. Total computational complexity of the whole sorting process, assuming p SPEs and a total of n data items is O((n/p)log 2 n)[2]. CellSort has been evaluated on a 16-core Cell processor(ibm QS20 Cell blade) and its result has been compared to the following algorithms on Cell processor, Intel Xeon, Intel Pentium 4: simple bitonic sort, SIMDized bitonic sort, shell sort, and quick sort. Maximum number of items in a single SPE is 32K. Number of SPEs is 16 and maximum number of sorted items is 128M. Results of evaluations are separated based on three levels for single-spe local sort, distributed in-core sort, and distributed out-of-core sort. The first two plots shown in Fig. 10 are achieved results of sorting 32K of numbers using single-spe local sort[2]. It shows the sorting time of SIMDized bitonic sort relative to the other algorithms on the Cell processor and Intel processors. Fig 10(a) is sorting time for sorting integer numbers and Fig 10(b) is sorting time for float numbers. As can be seen, SIMDized bitonic sort achieved the best results compared to the others. 10

Please note that the Y axis is the time of other algorithms relative to the time of in-core bitnic sort.

11 Fig. 10. Single-SPE local sort: (a)integer numbers,(b)float numbers[2]. Next experiment is distributed in-core sort. Again, the results of bitonic sort has been compared to the result of the other algorithms on the mentioned processors. Fig. 11 summerizes these results. Please note that the Y axis is the time of other algorithms relative to the time of in-core bitnic sort. Fig 11(a) is sorting time for sorting integer numbers and Fig 11(b) is sorting time for float numbers. Again, bitonic in-core sort on the Cell processor outperforms the other algorithms, even it showed better results than bitonic sort implemented on Intel Xeon processor. Fig. 11. Distributed in-core bitonic sort: (a)integer numbers,(b)float numbers[2]. The last and the most comprehensive experiment is on distributed out-of-core sort. In this experiment 16 SPEs are used for the evaluation of distributed out-of-core bitonic sort. Similar to the two previous experiments, this experiment is performed separately for integer and float numbers for up to 128M of data items and results are shown in Fig 12(a),(b). As it is shown in the figures, distributed out-of-core bitonic sort on the Cell processor achieved the best results compared to the other combinations of algorithms and processors. Fig. 12. Distributed out-of-core bitonic sort: (a)integer numbers,(b)float numbers[2]. 11

12 2.4 List Ranking on Multicore List ranking is one of the most important problems that is a requisite for many problems in the field of computational geometry. Given a list of nodes, list ranking problem determines the rank (distance) of each node relative to the first node. As we have seen so far, there is only one study on list ranking evaluated on multicore platforms. The work has been done by D. Bader et. al. in [13]. In the following, we briefly present this work and we refer reader to [13] for more information. Assume a list of n nodes, each has a prefix property and a value property, so that the value of each node is 1 and the prefix of each node is sum of its value and the prefix of its previous node. Based on the above assumptions, list ranking in [13] is performed in four steps as follows: first, the algorithm partitions the input list into s sublists by randomly choosing n/(s 1) nodes. Then, prefix of each node in the sublists is computed. Finally, each node of the lists computes its final prefix by adding its prefix to the prefix of the last node in the previous sublist[13]. Also in this algorithm, similar to CellSort, prefetching using DMA requests are used to hide the latency of getting data from memory. Therefore, total number of DMA requests is in O(n) and computational complexity of the algorithm for each core is in O(n/p). Thus, if we have a Cell processor with 8 SPEs, computational complexity of each SPE is O(n/8). Although using prefetching technique can reduce the latency of data transfer especially in sorting, it can not help in the problem of list ranking. Because nobody knows the location of the next node of the current node in main memory. In this case, prefetching technique can only prefetch the blocks of data around the current node and performs some predictions. Even these predictions can not help much. As a result, by occurring cache misses due to the inexistence of the next node in the current fetched block of data, the number of DMA requests and consequently data transfer will be high. To solve this problem and hide these latencies, multi-threading technique is used in [13]. Because SPEs do not support hardware multi-threading, a software managed multithreading is used in [13] so that many threads, each one including a sublist are allocated to each SPE. As a result, whenever a cache miss occurs which means that next node is not inside the fetched block of data, a thread switching is performed and a DMA request is isuued. Meanwhile, the SPE is kept busy by undertaking the next thread. Threads are assigned to a SPE using a round robin scheme. Fig. [13] shows software managed multithreading technique. Using this technique, as authors stated, there will be no stall if there are sufficient number of threads. Fig. 13. Software managed multi-threading technique. Many thread are allocated to each SPE[13]. Experiments are performed on a IBM BladeCenter QS20, with two 3.2 GHz Cell processors. One of the processors is used for measuring performance. Fig. 14 illustrates running time of the list ranking algorithm for each SPE. As can be seen in the figure, running times are different because the lengths of the lists allocated to SPEs are different. Meanwhile, the experiment is performed for two cases: one without software multi-threading and the other 12

13 using multi-threading. As it can be seen, using multi-threading with 64 sublists(each SPE is allocated with 8 sublists), the total required time for list ranking is significantly less than that without multi-threading. Fig. 14. Result of list ranking with and without multi-threading and running times of SPEs[13]. In another experiment, the proposed list ranking algorithm on Cell processor has been compared to the algorithm on the other single and parallel processors with a list of size 8 million nodes. Fig. 15 shows these comparisons for two types of lists: random list and ordered list. Fig. 15(a) shows the comparison of the algorithm on Cell processor with other single processors. As can be seen, Cell processor achieved a smaller running time compared to the other single processors. Also, in Fig 15(b), the same comparison has been performed for Cell processor and some other parallel processors. However in the figure, Cell processor is not the best, it performed list ranking comparably to the other parallel processors. Fig. 15. Comparison of the list ranking on Cell processor with (a)single processors and (b)parallel processors with 8 million ordered and random nodes[13]. 2.5 Graphic Processor Units: Basic and Advanced Problems As we discussed in section 1, GPUs are efficient parallel processors that can act as a coprocessor and even in some situations have a higher performance than CPUs, especially for computations that can be SIMDized. GPUs are originally designated for graphic and geometric computations. There are some research both basic and advanced in the fields of image processing and computational geometry implemented on GPUs. Some of these works include delaunay triangulation, biomedical image analysis, motion estimation, etc. These 13

14 problems have been discussed and implemented in [21], [22], [23], respectively to run on GPU. GPU Programming needs special environments that are designated for this purpose. Without these environments, efficient use of parallel simple cores of a GPU is not possible. Computer unified device architecture(cuda) is one of these programming environments. This environment has been used for coding of different algorithms and problems in the recent generations of Nvidia GPUs. In the following, we review a sorting algorithm called GPUTerasort which is implemented on GPU and uses bitonic sort as kernel. For more detail, please refer to [24] Sorting Algorithms on GPU The idea behind sorting on GPUs is that they are large parallel vector co-processors while the disadvantage of CPU-based sorting algorithms is significant cache misses on large datasets[24]. Current GPUs have 10x higher main memory bandwidth and use data parallelism to achieve 10x more operations per second than CPUs[24]. Many researchers have proposed GPU-based sorting algorithms. T. Purcell et. al. in [25] proposed a bitonic sort method on GPU. N. Govindaraju in [26] proposed a sorting algorithm based on a periodic balanced sorting network and used texture mapping and blending operations of GPU for sorting. Another recent work by N. Govindaraju in [24] called GPUTeraSort used a bitonic sort as its sorting algorithm and implemented it on GPU. The implementation tries to handle sorting of wide keys with huge number of records resided in external memory. Their method tries to exploit maximum processing power of both GPU and CPU. Most of external memory sorting algorithms pass two phases. In the first phase, a set of files that are locally sorted is produced and in the second phase, all input files are globally sorted[24]. External sorting algorithms have to deal with I/O communications, because they sort large sets of data and files that can not fit in main memory. GPUTeraSort has 5 stages[24]: 1. Reader: Reads input files into main memory buffer. To achieve more speed, files are divided into parallel disks so that data can be transfered to memory in parallel. 2. Key-Generator: Compute (key,pointer) pairs for the records of data from the buffer. 3. Sorter: Reads and sorts the key-pointer pairs. This stage is both computationintensive and memory-intensive because it should read the pairs from main memory, sort them and write them back to main memory. 4. Reorder: Rearranges the buffer according to the sorted key-pointer pairs to generate a sorted output buffer. 5. Writer: Writes the output buffer into parallel disks. In fact, in GPUTeraSort, sorting stage is performed by GPU which can sort in a higher parallel rate than CPU, especially it frees CPU to achieve higher I/O performance[24]. Besides that, GPU has access to main memory with a higher bandwidth compared to CPU and this can provide a higher throughput. Fig. 16 shows a comparison of required total sorting time by increasing the number of records between GPUTeraSort and some high performance CPU-based sorting algorithms. CPU-based algorithms are a Quicksort algorithms evaluated on three multicore processors and GPUTeraSort is evaluated on a Nvidia 7900 GTX GPU. As it can be seen in the figure, GPUTeraSort has a comparable performance to the other CPU-based algorithms. Especially when the price of the platform is taken into account[24]. Another comparison is illustrated in Fig. 17 between GPUTeraSort and some other GPU-based algorithms proposed in [25] and [26]. As can be seen, GPUTeraSort outperforms the other GPU-based algorithms and takes less time for sorting by increasing the number of records. 14

Fig. 16. Comparison of the total required sorting time of GPUTeraSort to Quicksort implementations on multicore processors by increasing the number of records[24] Fig. 17.

15 Fig. 16. Comparison of the total required sorting time of GPUTeraSort to Quicksort implementations on multicore processors by increasing the number of records[24] Fig. 17. Comparison of the total required sorting time of GPUTeraSort to the other GPU-based algorithms proposed in [19] and [20] by increasing the number of records[24]. The computational complexity of GPUTeraSort is O( n log2 n 2 ) and its communication complexity is O(n), so that data transfer time takes only 10 percent of the total sorting time. This is better shown in Fig.18(a). This figure shows the curves of the required time for data transfering compared to the total time for sorting by increasing the number of records. Fig. 18(b) shows the required time for each stage of GPUTeraSort algorithm on three different GPUs: Nvidia 6800, Nvidia 6800 Ultra, and Nvidia 7800 GT. Fig. 18. (a)comparison of data transfer time with total required sorting time of GPUTeraSort[24], (b)comparison of the total required sorting time of GPUTeraSort to the other GPU-based algorithms proposed in [25] and [26] by increasing the number of records. 15

16 References [1] Tom R. Halfhill, PARALLEL PROCESSING WITH CUDA, MICROPROCESSOR Report, [2] Bugra Gedik, Rajesh R. Bordawekar, Philip S. Yu, CellSort: High Performance Sorting on the Cell Processor, Proceedings of the 33rd international conference on Very large data bases, pages: , Austria, [3] M. de Berg, M. van Kreveld, M. Overmars, O. Schwarzkopf, Computational Geometry: Algorithms and Applications, Springer, [4] K. Nyberg, Multi-core + multi-tasking = multi-opportunity?, ACM SIGAda Ada Letters, Volume XXVII, Issue 3, Pages: 79-82, [5] Victor Pankratius, Christoph Schaefer, Ali Jannesari, Walter F. Tichy, Software engineering for multicore systems: an experience report, IWMSE 08: Proceedings of the 1st international workshop on Multicore software engineering, May [6] M. Mustafa Rafique, Ali R. Butt, Dimitrios S. Nikolopoulos, DMA-based prefetching for i/o-intensive workloads on the cell architecture, Proceedings of the 2008 conference on Computing frontiers, Pages 23-32, May [7] J. Sancho, D. Kerbyson, Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE, IEEE International Symposium on Parallel and Distributed Processing, pages: 1-12, [8] Tyler Sondag, Viswanath Krishnamurthy, Hridesh Rajan Predictive thread-to-core assignment on a heterogeneous multi-core processor, Proceedings of the 4th workshop on Programming languages and operating systems, [9] M. Chu, R. Ravindran,S. Mahlke, Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures, 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages: , [10] James H. Anderson, John M. Calandrino, Parallel task scheduling on multicore platforms, ACM SIGBED Review, Volume 3, Issue 1, Special issue: The work-in-progress (WIP) session of the RTSS 2005, Pages: 1-6, [11] Lixia Liu, Zhiyuan Li, Ahmed H. Sameh, Analyzing memory access intensity in parallel programs on multicore, International Conference on Supercomputing, Pages , [12] T. Chen, Z. Sura, K. OBrien, and K. OBrien. Optimizing the Use of Static Buffers for DMA on a CELL Chip. In Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2006), Orleans, Louisiana, [13] D. Bader, V. Agarwall, K. Madduri, On the design and analysis of irregular algorithms on the Cell processor: A case study on list ranking. In Proc. of IEEE IPDPS, [14] A. Kumar, N. Jayam, A. Srinivasan, G. Senthilkumar, P. Baruah, S. Kapoor, M. Krishna, R. Sarma, Feasibility study of MPI implementation on the heterogeneous multicore cell BE architecture, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures,pages: 55-56, [15] [16] R. Chowdhury, V. Ramachandran, Cache-efficient dynamic programming algorithms for multicores, Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures,pages , [17] E. Chan, E. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, Supermatrix outof-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings 16

17 of the nineteenth annual ACM symposium on Parallel algorithms and architectures, Pages: , [18] A. Sarje, S. Aluru, Parallel biological sequence alignments on the Cell Broadband Engine, IEEE International Symposium on Parallel and Distributed Processing, Pages: 1-11, [19] S. G. Akl. Parallel Sorting Algorithms. Academic Press Inc., [20] H. Inoue, T. Moriyama, H. Komatsu, T. Nakatani, AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, Pages , [21] G. Rong, Tiow-Seng Tan, Thanh-Tung Cao, Computing two-dimensional Delaunay triangulation using graphics hardware, Proceedings of the 2008 symposium on Interactive 3D graphics and games, Pages 89-97, [22] T. Hartley, U. Catalyurek, A. Ruiz, F. Igual, R. Mayo, M. Ujaldon, Biomedical image analysis on a cooperative cluster of GPUs and multicores, Proceedings of the 22nd annual international conference on Supercomputing,Pages 15-25, [23] Wei-Nien Chen,Hsueh-Ming Hang, H.264/AVC MOTION ESTIMATION IMPL- MENTATION ON COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA), IEEE International Conference on Multimedia and Expo, [24] N. Govindaraju, J. Gray, R. Kumar, D. Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, International Conference on Management of Data, Pages: , [25] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 4150, [26] N. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD,

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February