LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms

Size: px
Start display at page:

Download "LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms"

Transcription

1 LITERATURE REVIEW: Parallel Computational Geometry on Multicore Processors A Review of Basic Algorithms Hamidreza Zaboli School of Computer Science Carleton University Ottawa, Canada K1S 5B6 hzaboli@connect.carleton.ca October 13, Introduction Parallel Computing has been one of the most important problems in computer science as well as a solution to satisfy the growing need of computation speed. Although in the past decades, parallel computing has been an interested area, it is going to be more popular and useful among simple users. In the past years, the area was strange for home users due to the high cost of installing a parallel computer machine. Besides that, large necessary space to install such systems, inexistence of efficient algorithms, etc. caused the area to be popular only among computer scientists. These days, by reducing the cost of producing different parallel systems with various architectures, these machines can be seen in every PC. Intel multicore processor is one of the most common parallel processors on current desktops and laptops. The ease of accessing such simple parallel processors makes experts to redesign and revise old scientific algorithms and software architectures which were originally designated for sequential processing. In fact, in order to keep today s computations efficient, most of the basic algorithms which are optimal for single-core processors, should be modified. Some of these algorithms are sorting algorithms, searching algorithms, matrix operatons, and graph algorithms. As a result of revising basic algorithms, advanced algorithms in different scientific areas, which use basic algorithms as a small portion of their bodies, need to be revised as well. It is clear that scientific algorithms in the fields of computational geometry, image processing, bioinformatics, etc. use the basic mentioned algorithms to solve a bigger problem related to their special fields. Therefore, because of the vast use of parallel processors, especially different multicore processors, and in order to keep use of them efficient, it is necessary to redesign both basic and advanced algorithms in different scientific fields. In this project, we are going to review basic and advanced problems of computational geometry on parallel machines. Note that the parallel platforms that we chose are current multicore platforms. They includes Cell-BE of IBM,Sony,Toshiba and Multicore processor of Intel. In the past years, researchers did many works on parallel computational geometry. These works are mostly on expensive, large-size parallel machines and sometimes on virtual and theoretical parallel models such as PRAM. These days, by running small parallel processing 1

2 units on PCs(multi-processors and multicores), it seems important to revise the algorithms of computational geometry in order to make them optimal for running on the multicore processors. Thus, in this work, we start with reviewing current works on these processors, specially we are interested in some basic algorithms that are widely used in computational geometry such as list ranking and sorting algorithms. Due to the novelty of the mentioned multicore processors, there are only a few works on them. In this survey, we review these works and address related useful works. In addition, we give a fast review on the works related to basic and advanced computational geometry algorithms on graphic processor unit(gpu). This matter is interesting due to the fact that GPU is a helpful co-processor that can undertake a significant portion of the computations especially graphic and geometric computations. Current GPUs are going to be a huge SIMD (single instruction multiple data) processors exploited in PCs. Tough they are originally designated to perform graphic computations on the pixels of output image, their high abilities of performing parallel computations and the increasing number of simple processors inside a GPU have made them interesting parallel machines for researchers. As it is illustrated in Fig. 1, GPUs are large parallel processors compared to current parallel processors. For example, one of the most powerfull and recent Cell processors (cell broadband engine) have 8 parallel cores (Fig. 2) which is a few compared to a Nvidia GeForce 8 GPU which has 128 simple thread processors. However computation power of a single core in a multicore processor is much more than a single core of a GPU, number of parallel streams that can be computed in a GPU is many times more than this number in a multicore processor. Especially when computations can be performed in a SIMDized way, workloads requested by multicore for accessing memory will be time-consuming compared to the workloads in GPU. Besides the huge parallel architecture of GPUs, they have a higher bandwidth to access main memory. Note that even in current multicore proccesors, data transfer between CPU and main memory, and the provided bandwidth are problematic issues that force experts to deal with them. These advantages of GPUs over CPUs in parallel computations create a good opportunity to transfer some problems and computations of CPUs to GPUs, even for problems that are not related to graphic and geometric computations. On the other hand, transfering and solving problems especially computational geometry problems using GPUs necessitate the revision and assessment of some basic algorithms that are widely used in advanced computational geometry problems. In this literature review, we are interested in the basic and advanced computational geometry problems and we present a fast review on them. Fig. 1. The architecture of a Nvidia GeForce 8 GPU. Each 8 thread processors have a shared local memory of size 16KB[1]. 2

3 Fig. 2. The architecture of a Cell processor with 1 PPE and 8 SPEs[2]. Afterwards, as a perspective of the project, we are going to find the possible optimizations, modifications, and solutions applicable to both basic studied algorithms and advanced problems of computational geometry on the multicore platform. We expect to reach an encouraging area with a variety of computational geometry problems that can be improved to work more efficiently on multicore platforms. 2 Literature Review In this section, we start with a general review over the studies and works on the multicore platforms and address related works. Next, we concentrate on our interested basic algorithms and problems i.e. list ranking and sorting on the multicore platforms and present and discuss our selected works in detail. We suppose that reader is familiar with the importance of these problems in the field of computational geometry. Just to give a hint, we point out that most of the graph and tree problems are not solvable without having knowledge about sorting and list ranking. Please refer to [3] for more information about the use of sorting and list ranking algorithms in different computational geometry problems. 2.1 Hardware-Related Optimization Techniques on Multicore Processors In the recent few years, by introducing multicore processors, researchers focused on these processors and studied their different properties. They are also trying to find the impact of them on the running of different problems and algorithms. Because the multicore architectures are new, as a tradition, researchers have firstly tried to introduce them and their especial characteristics that may have impact on algorithms and different properties of scientific computations such as computational complexity, communication complexity, execution time, I/O loads, etc. For example, K. Nyberg in [4] and V. Pankratius et. al. in [5] tried to introduce the opportunities that multicore architecture prepares for a new generation of algorithms and softwares. K. Nyberg in [4] discussed about the old multi-tasking scheme of Ada programming language and tried to renovate multi-tasking programming environment in Ada with respect to the new multicore architecture. The paper tried to test the impact of multi-tasking in Ada on the multicore. On the other hand, general purpose applications and aspects of software engineering on the multicore have been presented in [5]. The paper considers four case studies of general-purpose applications coded with different programming languages(c++ with OpenMP, Java,C-sharp) and shows the results of their runnings on two multicore platforms manufactured by Intel and Sun microsystems. In addition to basic concepts of algorithms and software systems on the new parallel architecture i.e. multicore, an important area that catch the attention of researchers is how to find and design methods that can help to achieve a better performance on the multicore. 3

4 There are some reports about the tricks and techniques that are applied to algorithms to enhance the performance of the multicore. Most of these approaches, fall into tricks that try to hide the latency of accessing memory by DMA requests in order to keep all cores of the processor busy. Generally talking, these tricks and techniques try to use all resources of the machine, (both the multicore processor and other resources such as memory, I/O, and GPU) as much as possible. M. Rafique et. al. in [6] presented an approach to deal with I/O intensive workloads on Cell processor. The paper tries to overlap the time spent on getting/putting data from/to memory by a technique called asynchronous prefetching. In this method, the blocks of the data which will be processed in the next iterations of the execution are requested to be fetched in advance, while the cores of the Cell are working on the current data. Please note that the current data has been prefetched in the previous iterations. Briefly, the paper presents studies on the following items: 1. I/O path in Cell processor 2. I/O tricks and techniques applicable to Cell processor 3. evaluation of data prefetching techniques on Cell processor A prefetching technique which is fairly popular is double buffering. This technique has been discussed in [7]. We will study this paper in detail in the next section. A thread-to-core assignment method has been proposed in [8]. In the parallel execution of a parallel code, different parts of the code are divided to parallel parts so that each part can be run by a thread. Finally, every thread is sent to a core. The paper uses the fact that each thread with respect to its related code section needs some resources. On the other hand, each core of CPU at a moment, has access to a set of resources not all of them. The method tries to assign threads to cores so that the allocated pair of thread and core is the best match at the moment and the core can provide the thread with the required resources. Moreover, fairness and overall throughput should be taken into account. The method uses the fact that most codes have several phases of execution that show approximately similar runtime characteristics compared to other phases. If we can approximately predict similarity among program phases, then we can use this information to assign threads to cores optimally[8]. M. Chu et. al. in [9] and Anderson in [10] proposed a similar method in that they partition data access requests over cores and their associated local stores, instead of partitioning execution of the code. L. Liu et. al. in [11] show that if the number of cores exceeds a certain threshold, performance will degrade due to the bandwidth problem between CPU and memory. As it can be seen in Fig. 3, multicore processor of Intel has 4 cores that are connected to main memory using a memory controller hub(mch). MCH also provides a connection for data transfer between main memory and other devices. By increasing number and volume of main memory accesses, MCH will be a bottleneck[11]. The paper tries to define a measure called memory access intensity and uses it to determine the mentioned threshold. Fig. 3. Block diagram of a 4-core Intel multicore processor[11]. 4

5 2.1.1 Double-Buffering Technique on Multicore In this section we briefly review an efficient technique called double-buffering which is exploited for data prefetching by cores of multicore processors in order to hide data transfer latency. Double-buffering is an especial type of a more general technique which is called multi-buffering. Double-buffering and multi-buffering are popular techniques that have applications in different aspects of computer science and engineering, both hardware related and software related. Double-buffering technique is interesting because it is the most effective and suitable type of multi-buffering techniques with respect to the small local stores of cores of multicore processors[7]. For example the capacity of the local store available to each core of Cell processor is only 256KB. For more information on multi-bufering techniques on multicore, including single-buffering, double-buffering, and triple-buffering, please refer to [12]. Please note that using more buffers avoids most complexities of double buffering but it needs more memory that is not a desirable request for multicore processors[7]. Doublebuffering technique has been used in basic algorithms on multicore platforms recently. For example, B. Gedik et. al. in [2] and D. Bader et. al. in [13] have used double-buffering to improve the performance of sorting and list ranking algorithms on multicore platforms. Especially they tried to hide memory access latency using this technique. In what follows we give a brief review of a recent work by J. Sancho et. al. in [7] that analyzed the impact of double-buffering on two multicore processors: Cell processor of IBM, Sony, and Toshiba and Quad-core Opteron of AMD. Assume two major operations that are performed when it is required to transfer data from the cores of CPU to main memory and in the reverse direction, i.e. from main memory to the cores of CPU. We name these operations as put transfer and get transfer, respectively[7]. Also we place two separated buffers in the local store of each cores of processor, one allocated for computations and the other for getting/putting data from/to main memory. Data transfers are performed on one buffer at the same time that the other buffer is used for computations[7]. This model is shown in Fig. 4. In fact, double-buffering is possible and useful for computations and processes in which it is possible to know or predict what data is necessary or should be processed in the next iterations. Using this method, data transfers can be overlapped with computations and consequently, the latency of data transfer will be hidden. Experiments have been performed on the two mentioned multicore processors. In the first experiment, on Quad-core Opteron, its hardware prefetcher is used to prefetch data from main memory. The hardware prefetcher prefetches data from the memory to L1 cache. Fig. 4. Double-Buffering model for the local store of a core: A buffer is designated for computations and the other for both getting/putting data from/to memory[7]. 5

6 In order to do this, the hardware prefetcher watches memory accesses during a set of iterations and then predicts the memory blocks or lines that are more likely to be requested in the next iterations and then, it prefetches them. For example, if two consecutive accesses to memory occur in blocks n and n+1, then the hardware prefetcher prefetches block n+2. However the hardware prefetcher works automatically and does not need any particular attention, it does not have the flexibility of software prefetching. On the other hand, Cell processor has dedicated DMA engines to transfer data between the cores of Cell processor(which are called SPEs) and main memory. These DMA engines can be controlled by user. DMA controllers can operate and transfer data at the time that SPEs are processing other data. The following plots illustrate elapsed time and the achieved speedup using double-buffering technique in comparison to using a single buffer. Fig. 5 shows results of using double-buffering on 4 Quad-core Opteron processors which work in parallel[7]. Fig. 5(a) shows the execution time of double-buffering technique compared to single-buffering by increasing the number of cores. Please note that the Y axis is in logaritmic scale. As can be seen, by increasing the number of cores, execution time using both techniques decreases. Also, in Fig. 5(b), achieved speedup of using double-buffering technique against singlebuffering technique is shown. Speedup is evaluated for two prefetching cases: stride-1 and stride-1k. Please note that stride is defined as the distance between each two consecutive blocks of the data which is being prefetched from memory. Fig. 5. (a)execution time of double and single buffering schemes by increasing the number of cores on 4 Quad-core Opteron, (b)speedup of double-buffering technique over single-buffering technique with two strides by increasing computation intensity[7]. Finally, Fig. 6. compares the execution times of using double-buffering technique on both Cell processor and Quad-core Opteron by increasing computation intensity. As it can be seen, the Quad-core Opteron outperforms the Cell processor. This superiority, as authors stated, is due to the higher aggregate memory bandwidth and higher peak processing rate[7]. However, by increasing computation intensity, the superiority of Opteron against Cell reduces, it still is higher than Cell processor. Please note that with a computation intensity of 1, Opteron operates about 6 times faster than Cell and with a computation intensity of 20, Opteron operates only about 2 times faster than Cell. Please refer to [7] for more information on the computation intensity and how it is measured by authors. 2.2 Basic Problems and Algorithms on Multicore Processors Up to this point, we reviewed the studies on multicore processors which are not about algorithms but about the aspects of hardware and techniques that can improve the performance and speed of executions. 6

7 Fig. 6. Execution times of using double-buffering technique on Quad-core Opteron and Cell processor[7]. As we mentioned earlier, due to the novelty of the multicore processors, the number of these works are not many. As a matter of fact, works related to different problems and algorithms on the multicore are a few. They are on basic methods and algorithms such as sorting, matrix multiplication, dynamic programming, and list ranking. The other issue that is important, is compilers and programming environments that are necessary for execution and evaluation of algorithms and codes on the multicore. Unfortunately, There is no dedicated compiler for multicore platforms so far and current programmings and executions are done in the traditional environments which are originally designated for sequential processing. At best, programming environment is a modified or optimized compiler or programming language that may not achieve the highest possible performance of the multicore processor. Some of these environments are MPI, OpenMP, and CILK. MPI is a modified version of C++ that is designated for programming in message-passing parallel programming models. However A. Kumar et. al. in [14] studied the feasibility of using MPI for programming on the Cell multicore processor. OpenMp is a similar modified version of C++ designated for shared-memory parallel architecture. As it can be seen in the architecture of the multicore processors, and as it has been treated so far in the recent research and projects, multicore processors fall mostly into the category of sharedmemory architectures. Therefore, using OpenMP for programming on the multicore is preferable. In addition to the mentioned environments, a new extension of C++ called Cilk has been developed for programming on the multicore platforms. Please refer to [15] for more information. The other instances of basic algorithms on the multicore are dynamic programming and matrix operations which are discussed in [16] and [17], respectively. The first one evaluated three classes of dynamic programming algorithms on the multicore: local dependency dynamic programming, gaussian elimination paradigm, and parenthesis problem. Experiments are performed on an AMD Opteron with 8 cores. The second one tried to evaluate linear algebra matrix operations on the multicore. Recently, a new work on the problem of sequence alignment has been performed on Cell processor. Although this problem is not an advanced problem, it is considered one level upper than the mentioned basic algorithms. It seems that researchers are going to transfer gradually from basic algorithms to complex and especial algorithms that have applications only in a especial scientific field. A. Sarje et. al. in [18] presented two advanced alignment techniques called spliced alignments and syntenic alignments implemented on the Cell 7

8 processor. These alignment techniques are used in biological applications. Experimental results show speedups of about 4 on the Cell processor compared to serial algorithms on the Cell and Pentium 4 processors[18]. 2.3 Sorting Algorithms on Multicore Processors In this section we give a brief review of sorting algorithms on multicore processors. There are some works on big parallel machines and multi-processors that can be found in [19]. However, as we have seen so far, there are only a few works that studied the optimization of sorting algorithms on multicore in [2], [20]. In most cases, the routine for designing a sorting algorithm for multicore processor is like the following. First, a basic sorting algorithm that seems to be effective and efficient for parallel sorting is chosen. Then it is used as the kernel of the whole sorting process and may be modified for achieving better performance. Finally, some optimizations are applied to the whole process for a more effective implementation and result. Multi-buffering and memory latency hiding are of popular modifications. In the following, we discuss CellSort algorithm by B. Gedik et. al. which uses a bitonic sorting kernel to design a sorting process on the Cell processor. More information about this algorithm can be found in [2] CellSort: A sorting algorithm on the Cell Processor CellSort, as the authors mentioned in their report, is based on distributed bitonic merge with a SIMDized bitonic sorting kernel[2]. Their sorting process has three levels. At the innermost level which is called single-spe local sort, local data items to a core of the Cell processor are sorted using only that core. We refer to the cores of the Cell processor as SPEs. In the second level, the data items which are stored in the local stores of SPEs are sorted using a distributed bitonic sort. This level is called in-core sort because the sorting in this level is performed inside the Cell processor, between the SPEs or inside a SPE. SPEs are connected using a bus called Element Interconnect Bus(EIB). The architecture of the Cell processor, SPEs, PPE, and EIB are shown in Fig. 2. Practically, we need to sort data items that do not fit inside the local stores of the Cell processor. Therefore, The sorting algorithm should be able to handle such situations. CellSort has this ability and handles it by moving data back and forth between the Cell processor and main memory. As the authors stated, this level is called distributed out-of-core sort. This level is very costly and time-consuming because of the data transfer between Cell processor and main memory. Single-SPE local sort is a simple bitonic sort that is applied to the local data items in the local memory of the SPEs. Fig. 7 shows a simple scheme of the simple bitonic sort[2]. Assuming a list of n items, bitonic sort starts with lists of size 1, merging them at each stage, and doubling the size of the lists up to merging two sorted bitonic lists of size n/2. This bitonic merging process is shown in Fig. 7. The computational complexity of simple bitonic sort for sorting m items is O(m log 2 m). One of the optimizations that can be applied to the sorting process is using SIMD instructions that are provided by the SPEs of cell processor. Using SIMD instruction set, two vectors each containing 4 items can be compared and sorted only by three SIMD instructions, while without using them, it takes up to 8 operations for the two vectors to be sorted[2]. Given two vectors of size 4 in the bitonic merging process, the comparison instruction of the SIMD set combined with a selection instruction can put 4 lower items into a vector and put the 4 higher items into the other vector. These instructions and the 8

9 support that cell processor provides for using them improve the whole sorting process to achieve a higher speedup. SIMD instruction set can be used for comparison of any two vectors with 4 or more data items. Even for vectors with less than 4 items, it is possible to modify them so that SIMD instructions can be applied. Fig. 7. Phases of bitonic merging process for sorting 8 numbers[2]. For example consider the last phase of comparisons in Fig. 7. Comparisons are performing on every two consecutive data items. In this case, SIMD instructions can not be used to compare and swap each item with its consecutive item. But it is possible to change the locations of items so that each item is compared and swaped with its consecutive item using SIMD instructions. This can be performed by shuffling. Fig. 8 shows the possible shuffling that can be applied to the last stage of Fig. 7. After shuffling, SIMD instructions are used and finally, another shuffling is applied to the items. Fig. 8. Using SIMD instructions for comparing and swapping vectors with less than 4 items[2]. After local sorts, next level is distributed in-core sort in which all SPEs co-operates to sort a vector of size 8m (m is the number of items in the local store of a SPE.). In this level, first of all, every SPE sorts its local numbers(data items) using the single-spe local sort so that each two consecutive SPEs sort their local numbers in opposite order (one ascending and the other descending). Now, we can apply bitonic merges starting from lists with m items (which generates a sorted list of length 2m) to the final list in the last phase with length 8m. Please note that for merging two lists of items in two consecutive SPEs, it is possible to divide the lists so that each SPE compares and swaps only half of the numbers in the lists. For example assume the first phase of merging lists of SPEs. Every SPE has a list of numbers which is sorted in the opposite order of its consecutive SPE. Therefore, it is possible to compare each number of the first list with its corresponding number in the second list, and put the smaller number in the first list and the bigger number in the second 9

10 list. Because comparing and swapping of each number with its corresponding number is independent of the other numbers, it is possible to divide the list of numbers and assign them to different SPEs. In order to do so, we split each of two consecutive lists into two equal lists as it is shown in Fig. 9. Then we assign the first half of the first list to the first SPE to be compared and swapped with the first half of the second list. Similar to this, we assign the second half of the first list to the second SPE to be compared and swapped with the second half of the second list. Using this way, both SPEs will be busy and have equal workloads to do. Besides that, each SPE has to transfer only half its list to the other SPE. This will result in a lower numbers of communications between SPEs. Fig. 9 shows the described optimization for the phase in which the length of each list is 2m. Fig. 9. Dividing lists of SPEs and their assignings to different SPEs for the phase in which length of a sorted list is 2m[2]. The last level of bitonic sorting on the Cell processor is distributed out-of-core sort. In this level, firstly a series of in-core sorts are performed so that all of data items are divided into lists of size 8m. Each sorted list is sorted in the opposite order of its next and previous list. Next, out-of-core bitonic merge is applied to these lists until the final sorted list including all of the data items is obtained[2]. Therefore, The whole process of bitonic sorting is a distributed bitonic merge which includes many bitonic in-core merges and single-spe local sorts. Note that in the distributed out-of-core merge there is a huge number of data transfers between the SPEs of Cell processor and main memory. These data transfers are very time-consuming. In order to hide this latency, prefetching is used which is described in section The prefetching is performed using DMA requests which are available to SPEs. Every SPE can request a DMA to get (i+1)th block of data items while sorting (i)th block of data and also can put (i-1)th block of data items into main memory. Using this way, a great deal of latency will be overlapped with computations of SPEs. Total computational complexity of the whole sorting process, assuming p SPEs and a total of n data items is O((n/p)log 2 n)[2]. CellSort has been evaluated on a 16-core Cell processor(ibm QS20 Cell blade) and its result has been compared to the following algorithms on Cell processor, Intel Xeon, Intel Pentium 4: simple bitonic sort, SIMDized bitonic sort, shell sort, and quick sort. Maximum number of items in a single SPE is 32K. Number of SPEs is 16 and maximum number of sorted items is 128M. Results of evaluations are separated based on three levels for single-spe local sort, distributed in-core sort, and distributed out-of-core sort. The first two plots shown in Fig. 10 are achieved results of sorting 32K of numbers using single-spe local sort[2]. It shows the sorting time of SIMDized bitonic sort relative to the other algorithms on the Cell processor and Intel processors. Fig 10(a) is sorting time for sorting integer numbers and Fig 10(b) is sorting time for float numbers. As can be seen, SIMDized bitonic sort achieved the best results compared to the others. 10

11 Fig. 10. Single-SPE local sort: (a)integer numbers,(b)float numbers[2]. Next experiment is distributed in-core sort. Again, the results of bitonic sort has been compared to the result of the other algorithms on the mentioned processors. Fig. 11 summerizes these results. Please note that the Y axis is the time of other algorithms relative to the time of in-core bitnic sort. Fig 11(a) is sorting time for sorting integer numbers and Fig 11(b) is sorting time for float numbers. Again, bitonic in-core sort on the Cell processor outperforms the other algorithms, even it showed better results than bitonic sort implemented on Intel Xeon processor. Fig. 11. Distributed in-core bitonic sort: (a)integer numbers,(b)float numbers[2]. The last and the most comprehensive experiment is on distributed out-of-core sort. In this experiment 16 SPEs are used for the evaluation of distributed out-of-core bitonic sort. Similar to the two previous experiments, this experiment is performed separately for integer and float numbers for up to 128M of data items and results are shown in Fig 12(a),(b). As it is shown in the figures, distributed out-of-core bitonic sort on the Cell processor achieved the best results compared to the other combinations of algorithms and processors. Fig. 12. Distributed out-of-core bitonic sort: (a)integer numbers,(b)float numbers[2]. 11

12 2.4 List Ranking on Multicore List ranking is one of the most important problems that is a requisite for many problems in the field of computational geometry. Given a list of nodes, list ranking problem determines the rank (distance) of each node relative to the first node. As we have seen so far, there is only one study on list ranking evaluated on multicore platforms. The work has been done by D. Bader et. al. in [13]. In the following, we briefly present this work and we refer reader to [13] for more information. Assume a list of n nodes, each has a prefix property and a value property, so that the value of each node is 1 and the prefix of each node is sum of its value and the prefix of its previous node. Based on the above assumptions, list ranking in [13] is performed in four steps as follows: first, the algorithm partitions the input list into s sublists by randomly choosing n/(s 1) nodes. Then, prefix of each node in the sublists is computed. Finally, each node of the lists computes its final prefix by adding its prefix to the prefix of the last node in the previous sublist[13]. Also in this algorithm, similar to CellSort, prefetching using DMA requests are used to hide the latency of getting data from memory. Therefore, total number of DMA requests is in O(n) and computational complexity of the algorithm for each core is in O(n/p). Thus, if we have a Cell processor with 8 SPEs, computational complexity of each SPE is O(n/8). Although using prefetching technique can reduce the latency of data transfer especially in sorting, it can not help in the problem of list ranking. Because nobody knows the location of the next node of the current node in main memory. In this case, prefetching technique can only prefetch the blocks of data around the current node and performs some predictions. Even these predictions can not help much. As a result, by occurring cache misses due to the inexistence of the next node in the current fetched block of data, the number of DMA requests and consequently data transfer will be high. To solve this problem and hide these latencies, multi-threading technique is used in [13]. Because SPEs do not support hardware multi-threading, a software managed multithreading is used in [13] so that many threads, each one including a sublist are allocated to each SPE. As a result, whenever a cache miss occurs which means that next node is not inside the fetched block of data, a thread switching is performed and a DMA request is isuued. Meanwhile, the SPE is kept busy by undertaking the next thread. Threads are assigned to a SPE using a round robin scheme. Fig. [13] shows software managed multithreading technique. Using this technique, as authors stated, there will be no stall if there are sufficient number of threads. Fig. 13. Software managed multi-threading technique. Many thread are allocated to each SPE[13]. Experiments are performed on a IBM BladeCenter QS20, with two 3.2 GHz Cell processors. One of the processors is used for measuring performance. Fig. 14 illustrates running time of the list ranking algorithm for each SPE. As can be seen in the figure, running times are different because the lengths of the lists allocated to SPEs are different. Meanwhile, the experiment is performed for two cases: one without software multi-threading and the other 12

13 using multi-threading. As it can be seen, using multi-threading with 64 sublists(each SPE is allocated with 8 sublists), the total required time for list ranking is significantly less than that without multi-threading. Fig. 14. Result of list ranking with and without multi-threading and running times of SPEs[13]. In another experiment, the proposed list ranking algorithm on Cell processor has been compared to the algorithm on the other single and parallel processors with a list of size 8 million nodes. Fig. 15 shows these comparisons for two types of lists: random list and ordered list. Fig. 15(a) shows the comparison of the algorithm on Cell processor with other single processors. As can be seen, Cell processor achieved a smaller running time compared to the other single processors. Also, in Fig 15(b), the same comparison has been performed for Cell processor and some other parallel processors. However in the figure, Cell processor is not the best, it performed list ranking comparably to the other parallel processors. Fig. 15. Comparison of the list ranking on Cell processor with (a)single processors and (b)parallel processors with 8 million ordered and random nodes[13]. 2.5 Graphic Processor Units: Basic and Advanced Problems As we discussed in section 1, GPUs are efficient parallel processors that can act as a coprocessor and even in some situations have a higher performance than CPUs, especially for computations that can be SIMDized. GPUs are originally designated for graphic and geometric computations. There are some research both basic and advanced in the fields of image processing and computational geometry implemented on GPUs. Some of these works include delaunay triangulation, biomedical image analysis, motion estimation, etc. These 13

14 problems have been discussed and implemented in [21], [22], [23], respectively to run on GPU. GPU Programming needs special environments that are designated for this purpose. Without these environments, efficient use of parallel simple cores of a GPU is not possible. Computer unified device architecture(cuda) is one of these programming environments. This environment has been used for coding of different algorithms and problems in the recent generations of Nvidia GPUs. In the following, we review a sorting algorithm called GPUTerasort which is implemented on GPU and uses bitonic sort as kernel. For more detail, please refer to [24] Sorting Algorithms on GPU The idea behind sorting on GPUs is that they are large parallel vector co-processors while the disadvantage of CPU-based sorting algorithms is significant cache misses on large datasets[24]. Current GPUs have 10x higher main memory bandwidth and use data parallelism to achieve 10x more operations per second than CPUs[24]. Many researchers have proposed GPU-based sorting algorithms. T. Purcell et. al. in [25] proposed a bitonic sort method on GPU. N. Govindaraju in [26] proposed a sorting algorithm based on a periodic balanced sorting network and used texture mapping and blending operations of GPU for sorting. Another recent work by N. Govindaraju in [24] called GPUTeraSort used a bitonic sort as its sorting algorithm and implemented it on GPU. The implementation tries to handle sorting of wide keys with huge number of records resided in external memory. Their method tries to exploit maximum processing power of both GPU and CPU. Most of external memory sorting algorithms pass two phases. In the first phase, a set of files that are locally sorted is produced and in the second phase, all input files are globally sorted[24]. External sorting algorithms have to deal with I/O communications, because they sort large sets of data and files that can not fit in main memory. GPUTeraSort has 5 stages[24]: 1. Reader: Reads input files into main memory buffer. To achieve more speed, files are divided into parallel disks so that data can be transfered to memory in parallel. 2. Key-Generator: Compute (key,pointer) pairs for the records of data from the buffer. 3. Sorter: Reads and sorts the key-pointer pairs. This stage is both computationintensive and memory-intensive because it should read the pairs from main memory, sort them and write them back to main memory. 4. Reorder: Rearranges the buffer according to the sorted key-pointer pairs to generate a sorted output buffer. 5. Writer: Writes the output buffer into parallel disks. In fact, in GPUTeraSort, sorting stage is performed by GPU which can sort in a higher parallel rate than CPU, especially it frees CPU to achieve higher I/O performance[24]. Besides that, GPU has access to main memory with a higher bandwidth compared to CPU and this can provide a higher throughput. Fig. 16 shows a comparison of required total sorting time by increasing the number of records between GPUTeraSort and some high performance CPU-based sorting algorithms. CPU-based algorithms are a Quicksort algorithms evaluated on three multicore processors and GPUTeraSort is evaluated on a Nvidia 7900 GTX GPU. As it can be seen in the figure, GPUTeraSort has a comparable performance to the other CPU-based algorithms. Especially when the price of the platform is taken into account[24]. Another comparison is illustrated in Fig. 17 between GPUTeraSort and some other GPU-based algorithms proposed in [25] and [26]. As can be seen, GPUTeraSort outperforms the other GPU-based algorithms and takes less time for sorting by increasing the number of records. 14

15 Fig. 16. Comparison of the total required sorting time of GPUTeraSort to Quicksort implementations on multicore processors by increasing the number of records[24] Fig. 17. Comparison of the total required sorting time of GPUTeraSort to the other GPU-based algorithms proposed in [19] and [20] by increasing the number of records[24]. The computational complexity of GPUTeraSort is O( n log2 n 2 ) and its communication complexity is O(n), so that data transfer time takes only 10 percent of the total sorting time. This is better shown in Fig.18(a). This figure shows the curves of the required time for data transfering compared to the total time for sorting by increasing the number of records. Fig. 18(b) shows the required time for each stage of GPUTeraSort algorithm on three different GPUs: Nvidia 6800, Nvidia 6800 Ultra, and Nvidia 7800 GT. Fig. 18. (a)comparison of data transfer time with total required sorting time of GPUTeraSort[24], (b)comparison of the total required sorting time of GPUTeraSort to the other GPU-based algorithms proposed in [25] and [26] by increasing the number of records. 15

16 References [1] Tom R. Halfhill, PARALLEL PROCESSING WITH CUDA, MICROPROCESSOR Report, [2] Bugra Gedik, Rajesh R. Bordawekar, Philip S. Yu, CellSort: High Performance Sorting on the Cell Processor, Proceedings of the 33rd international conference on Very large data bases, pages: , Austria, [3] M. de Berg, M. van Kreveld, M. Overmars, O. Schwarzkopf, Computational Geometry: Algorithms and Applications, Springer, [4] K. Nyberg, Multi-core + multi-tasking = multi-opportunity?, ACM SIGAda Ada Letters, Volume XXVII, Issue 3, Pages: 79-82, [5] Victor Pankratius, Christoph Schaefer, Ali Jannesari, Walter F. Tichy, Software engineering for multicore systems: an experience report, IWMSE 08: Proceedings of the 1st international workshop on Multicore software engineering, May [6] M. Mustafa Rafique, Ali R. Butt, Dimitrios S. Nikolopoulos, DMA-based prefetching for i/o-intensive workloads on the cell architecture, Proceedings of the 2008 conference on Computing frontiers, Pages 23-32, May [7] J. Sancho, D. Kerbyson, Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE, IEEE International Symposium on Parallel and Distributed Processing, pages: 1-12, [8] Tyler Sondag, Viswanath Krishnamurthy, Hridesh Rajan Predictive thread-to-core assignment on a heterogeneous multi-core processor, Proceedings of the 4th workshop on Programming languages and operating systems, [9] M. Chu, R. Ravindran,S. Mahlke, Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures, 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages: , [10] James H. Anderson, John M. Calandrino, Parallel task scheduling on multicore platforms, ACM SIGBED Review, Volume 3, Issue 1, Special issue: The work-in-progress (WIP) session of the RTSS 2005, Pages: 1-6, [11] Lixia Liu, Zhiyuan Li, Ahmed H. Sameh, Analyzing memory access intensity in parallel programs on multicore, International Conference on Supercomputing, Pages , [12] T. Chen, Z. Sura, K. OBrien, and K. OBrien. Optimizing the Use of Static Buffers for DMA on a CELL Chip. In Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2006), Orleans, Louisiana, [13] D. Bader, V. Agarwall, K. Madduri, On the design and analysis of irregular algorithms on the Cell processor: A case study on list ranking. In Proc. of IEEE IPDPS, [14] A. Kumar, N. Jayam, A. Srinivasan, G. Senthilkumar, P. Baruah, S. Kapoor, M. Krishna, R. Sarma, Feasibility study of MPI implementation on the heterogeneous multicore cell BE architecture, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures,pages: 55-56, [15] [16] R. Chowdhury, V. Ramachandran, Cache-efficient dynamic programming algorithms for multicores, Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures,pages , [17] E. Chan, E. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, Supermatrix outof-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings 16

17 of the nineteenth annual ACM symposium on Parallel algorithms and architectures, Pages: , [18] A. Sarje, S. Aluru, Parallel biological sequence alignments on the Cell Broadband Engine, IEEE International Symposium on Parallel and Distributed Processing, Pages: 1-11, [19] S. G. Akl. Parallel Sorting Algorithms. Academic Press Inc., [20] H. Inoue, T. Moriyama, H. Komatsu, T. Nakatani, AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, Pages , [21] G. Rong, Tiow-Seng Tan, Thanh-Tung Cao, Computing two-dimensional Delaunay triangulation using graphics hardware, Proceedings of the 2008 symposium on Interactive 3D graphics and games, Pages 89-97, [22] T. Hartley, U. Catalyurek, A. Ruiz, F. Igual, R. Mayo, M. Ujaldon, Biomedical image analysis on a cooperative cluster of GPUs and multicores, Proceedings of the 22nd annual international conference on Supercomputing,Pages 15-25, [23] Wei-Nien Chen,Hsueh-Ming Hang, H.264/AVC MOTION ESTIMATION IMPL- MENTATION ON COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA), IEEE International Conference on Multimedia and Expo, [24] N. Govindaraju, J. Gray, R. Kumar, D. Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, International Conference on Management of Data, Pages: , [25] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, pages 4150, [26] N. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD,

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory September 18 th, 2007 PACT2007

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Hybrid Parallel Sort on the Cell Processor

Hybrid Parallel Sort on the Cell Processor Hybrid Parallel Sort on the Cell Processor Jörg Keller 1, Christoph Kessler 2, Kalle König 3 and Wolfgang Heenes 3 1 FernUniversität in Hagen, Fak. Math. und Informatik, 58084 Hagen, Germany joerg.keller@fernuni-hagen.de

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA (Technical Report UMIACS-TR-2010-08) Zheng Wei and Joseph JaJa Department of Electrical and Computer Engineering Institute

More information

CellSort: High Performance Sorting on the Cell Processor

CellSort: High Performance Sorting on the Cell Processor CellSort: High Performance Sorting on the Cell Processor Buğra Gedik bgedik@us.ibm.com Rajesh R. Bordawekar bordaw@us.ibm.com Philip S. Yu psyu@us.ibm.com Thomas J. Watson Research Center, IBM Research,

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors

AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani IBM Tokyo Research Laboratory {inouehrs, moriyama, komatsu, nakatani}@jp.ibm.com

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University

More information

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Buffered-Mode MPI Implementation for the Cell BE Processor A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks 15-823 Advanced Topics in Database Systems Performance Sorting Shimin Chen School of Computer Science Carnegie Mellon University 22 March 2001 Sort benchmarks A base case: AlphaSort Improving Sort Performance

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Optimizing Assignment of Threads to SPEs on the Cell BE Processor

Optimizing Assignment of Threads to SPEs on the Cell BE Processor Optimizing Assignment of Threads to SPEs on the Cell BE Processor T. Nagaraju P.K. Baruah Ashok Srinivasan Abstract The Cell is a heterogeneous multicore processor that has attracted much attention in

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

A Recursive Data-Driven Approach to Programming Multicore Systems

A Recursive Data-Driven Approach to Programming Multicore Systems 1 A Recursive Data-Driven Approach to Programming Multicore Systems Rebecca Collins and Luca P. Carloni Technical Report CUCS-046-07 Department of Computer Science Columbia University 1214 Amsterdam Ave,

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine

Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine 37th International Conference on Parallel Processing Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine Seunghwa Kang David A. Bader Georgia Institute of Technology, Atlanta, GA 30332

More information

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Fast BVH Construction on GPUs

Fast BVH Construction on GPUs Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Accelerating Spark RDD Operations with Local and Remote GPU Devices

Accelerating Spark RDD Operations with Local and Remote GPU Devices Accelerating Spark RDD Operations with Local and Remote GPU Devices Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani Dept.ofICS,KeioUniversity, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan 223-8522 Email:

More information