CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation

Size: px
Start display at page:

Download "CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation"

Transcription

1 th International Conference on Parallel and Distributed Computing, Applications and Technologies CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation Si-Yuan Jing School of Computer Sciences Leshan Normal University Leshan, China Gao-Rong Yan School of Foreign Language Leshan Normal University Leshan, China Xing-Yuan Chen, Peng Jin, Zhao-Yi Guo School of Computer Science Leshan Normal University Leshan, China Abstract Word alignment is a basic task in natural language processing and it usually serves as the starting point when building a modern statistical machine translation system. However, the state-of-art parallel algorithm for word alignment is still time-consuming. In this work, we explore a parallel implementation of word alignment algorithm on Graphics Processor Unit (GPU), which has been widely available in the field of high performance computing. We use the Compute Unified Device Architecture (CUDA) programming model to re-implement a state-of-the-art word alignment algorithm, called IBM Expectation-Maximization (EM) algorithm. A Tesla K40M card with 2880 cores is used for experiments and execution times obtained with the proposed algorithm are compared with a sequential algorithm and a multi-threads algorithm on an IBM X3850 server, which has two Intel Xeon E7 CPUs (2.0GHz * 10 cores). The best experimental results show a 16.8-fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. Keywords-Word Alignment; GPU; Parallel Computation; Expectation-Maximization Algorithm; CUDA I. INTRODUCTION Word Alignment is a basic task in natural language processing. By given a training corpus which contains a set of bi-texts (i.e. aligned parallel sentence pair), the aim of word alignment is to identify translation relationships among the words in each bi-text, meanwhile obtaining a bipartite graph between the two sides of the bi-text. Assuming there is an English-Chinese sentence pair, <" I am a teacher from Leshan normal university ", " ">, the word alignment should obtain the mapping relations among English words and Chinese words, just like {<I, >, <am, >, <a, >, <teacher, >, <from, NULL>, <Leshan, >, <normal, >, <university, >}. Word alignment usually serves as the start point when builting a modern statistical machine translation system. Moreover, for phrase-based statistical machine translation system, it is also the basis of many subsequent tasks, such as phrase extraction, reordering, etc. Besides the statistical machine translation systems, word alignment can be used in translation lexicon induction, word sense discovery, word sense disambiguation, and so on. So far, most of the word alignment algorithms are serial. In 1993, P. F. Brown et al. [1] firstly proposed an EM algorithm for lexical statistical machine translation models training. The models are well known as IBM Model 1-5. The word alignment is the by-product of the IBM models 1. We call it IBM EM algorithm latter in this paper. S. Vogel et al. [2] proposed a HMM-based method to obtain word alignment. These two algorithms are still widely used until today. F. J. Och et al. [3] developed a famous word alignment tool named GIZA++. GIZA++ is the most popular word alignment tool in last decade and it implements both of the IBM model and the HMM model. Later Q. Gao et al. [4] developed another tool named MGIZA++, which is the multi-threads version of GIZA++. Besides these works, researchers also put forward some other methods, such as Bayesian [5], deep learning [6], etc. Unfortunately, most of the algorithms are serial. Thus they are time consuming when they face a relative large corpus. Even the performance of MGIZA++ is also not satisfactory. To address the problem, this paper explores a parallel implementation of word alignment algorithm on GPU. We use the CUDA programming model to re-implement the IBM EM algorithm. The proposed algorithm is evaluated on an Nvidia Tesla K40M card with 2880 cores. Moreover, it is compared to a serial EM algorithm and a multi-threads EM algorithm running on a powerful server. The experimental results show the efficiency of the proposed algorithm. The rest of this paper is organized as follow. Section 2 introduces the word alignment problem as well as the IBM EM algorithm. Section 3 firstly explains some knowledge of GPU computing, and then presents the parallel implementation of the IBM EM algorithm by CUDA. Section 4 gives the results of experiments as well as the analysis. Finally section 5 gives conclusions. II. PRELIMILARIES This section briefly introduces the preliminaries of word alignment problem as well as the IBM EM algorithm. More details can be found in [1] and [7]. Given a parallel bilingual corpus, denoted by (E, F), where E represents a set of source language sentences and F represents a set of target language sentences. Let (e, f) denote the s-th sentence pair in the corpus, where e = {e 1,e 2,,e l} contains l source language words, and f = {f 1,f 2,,f m} contains m target language words. The aim of word alignment task is to find a set of binary relation <e i, f j>, where e i and f j are translations of each other in the sentence pair. Generally, we model the problem as an aligning function a: j i, which represents that a source language /16 $ IEEE DOI /PDCAT

2 word e i is aligned to a target language word f j. The aim is to get this function. In IBM model 1, word alignment is obtained by training a lexical translation model. P. F. Brown uses a generative modeling method and an EM algorithm to get the lexical translation model. The basic idea of the IBM EM algorithm is as follow: (1). Assuming a given parallel bilingual corpus contains VE source language words as well as VF target language words, where VE (VF) means the vocabulary of source (target) language sentences. Moreover, we suppose that the translation probability (denoted as t( f e )) between arbitrary word pair < e, f > is same in initial phrase. (2). Based on the current translation model, we can calculate the aligning probability of arbitrary word pairs in a sentence pair, denoted as P( a e, f ). The formula is shown below. The meaning of this step is that we apply a known translation model to data, and then obtain an unknown aligning probability. This is the E step of the EM algorithm. m t( f j ea( j) ) P( a e, f ) = (1) l j= 1 t( f ) i = 0 j ei (3). With the new word alignment probability, we can reestimate the translation probability of all source language - target language word pairs in corpus. The formula is (2). This is the M step of the EM algorithm. c( f e; e, f ( ) ) e, f t( f e) = (2) c f e; e, f ( ) ( ) f e, f Here, c( f e; e, f ) is a function who counts how many times a specific source language word e is translated to a target language word f in a sentence (see formula (3)). δ ( x, y) is the Kronecker function, δ ( xy, ) = 1 if x = y, otherwise δ ( xy, ) = 0. t( f e l m ) c( f e; e, f ) = (, i) (, l δ e e δ f f j) (3) t f e i= 1 j= 1 j= 1 ( i ) (4). Steps (2) and (3) will be repeated several iterations and the algorithm finally converge to a stable point. The final aligning model and lexical translation model will be obtained. III. PARALLEL IMPLEMENTATION OF IBM EM ALGORITHM A. Parallel Computing on GPU Before introducing the parallel design of the IBM EM algorithms, we briefly recall some basic knowledge of NVIDIA s GPU architecture and the CUDA parallel programming model. Figure 1 shows the architecture of a typical Nvidia GPU [8]. Current GPUs have hundreds (or even thousands) of processing cores, called streaming processor (SP). All the SPs are divided into many groups. The SPs in each group constitute a streaming multiprocessor (SM). A number of SMs form an independent processing unit, called texture/processor cluster (TPC). For example, a GeForce 8800 GPU has 128 SPs organized as 16 SMs in 8 TPCs. Although the modern GPU supports simultaneous execution of millions of operations, it is not easy to achieve high performance if we do not understand the principle of the underlying hardware architecture as well as the programming model [9]. CUDA is the most popular GPU programming model proposed by Nvidia. As is known to us, GPU was initially designed to improve the efficiency of image processing. It suffers from some restrictions when it was applied to general purpose computing. CUDA provides a good solution to solve the problem. It allows software developers to use a CUDAenabled GPU for general purpose computing. So far, CUDA has been applied to many tasks of high performance scientific computing, such as medical image processing [10], machine learning [11], natural language processing [12], and so on. A typical CUDA program can be divided into two parts, i.e. a sequential host program and one or more parallel device program (called kernels). The former is executed on CPU and the latter is parallel executed on GPU. A kernel will be parallel executed on GPU according to the userdefined parameters, including the number of threads and the number of thread blocks. Therefore, design and optimization of kernels is a key to high performance programming on GPU. Next, we will briefly introduce some keynotes in CUDA programming. A kernel executes a scalar sequential program across a set of parallel threads. For CUDA, thread is the basic unit for program execution. Furthermore, all the threads are hierarchically organized. A number of threads constitute a thread block, and a number of thread blocks form a grid. For example, a thread block can contain up to 512 threads in the G80/GT200 series GPU. A kernel thus consists of a grid of one or more thread blocks. The threads in a thread block are concurrent and they can cooperate amongst themselves through barrier synchronization and a per-block shared memory space private to that block. This is a key feature of CUDA. CUDA provides a set of APIs for thread synchronization, such as _syncthreads() and cudadevicesynchronize() [9]. The former implements barrier synchronization among threads in the same thread block, and the latter provide global barrier synchronization. This provides a very flexible synchronization mechanism for programmer designer. When a specific algorithm requires global synchronization, the designer can divide the program into several independent kernels and invoke cudadevicesynchronize() to keep global synchronization between each two kernels. Single Instruction Multiple Threads (SIMT) is another important feature of CUDA [9][14][15]. SIMT means that threads are executed in bundles (called warps). The threads in a warp share a single multithreaded instruction unit. This design allows the programs to achieve substantial efficiencies when executing data-parallel program. To achieve the best efficiency, kernels should avoid execution 190

3 divergence. Since different execution path will lead to performance penalty. However, difference among warps will introduce no performance loss. In modern GPUs, a warp consists of 32 threads. In another words, the number of threads in a thread block must be a multiple of 32. This ensures that the resource would not be wasted. Finally, it is necessary to indicate that the CUDA relies on multi-threads switch to hide the latency of transactions with external memory [14][15]. Different with CPU, there is no cache in current GPUs. Thus, we should launch enough threads to keep the machine fully utilized. For current GPUs, a minimum of round 5,000 threads must be live simultaneously to efficiently utilize the entire resource. Figure 1: Architecture of a typical Nvidia GPU B. A Specific Hash Table According to the description in section, we can get that lexical translation model is the core of the algorithm. It runs through the whole algorithm life cycle. Thus, one of the most important problems is how to represent and store the lexical translation model on GPU. We require a data structure which can be high efficiently and concurrently accessed to store the lexical translation model. However, Current models, just like that used in GIZA++ or MGIZA++, do not meet this requirement. We propose a specific word-alignment-oriented hash table for CUDA environment (see figure 2). The proposed hash table is referenced to literature [9]. In the hash table, Pool variable is used to store entries of translation model. Each entry has four member variables which are Key, Prob, Count and next, and they store word pair <e i, f j>, translation probability of the word pair, the number of times that the word pair appears in a sentence, a pointer to the next entry with same hash value, respectively. Another variable Buckets is used to store pointers of entries. The word pairs who have same hash value are organized into a linked list and the pointer in Buckets points to the last one (For linked list, it is the first one). For example, when a new word pair is prepared to be inserted into the hash table, we create a new entry whose Key value is set to the word pair, and insert it to the end of the Pool. Meanwhile we compute its hash value. We set the next variable of the new entry to be the location of the first entry in the linked list (with same hash value), and finally let the pointer (in Buckets ) point to the new entry. The proposed data structure has some advantages, including: (1) It is very easy to be implemented and bidirectional copied; (2) Since CUDA provides a set of APIs of atomic operations, it is very easy to concurrently access the hash table on GPU; (3) Memory utilization rate is high. Figure 2: A specific hash table C. Parallel design and pseudo code Data-parallelism is the most direct way to parallel implement of the IBM EM algorithm, i.e. each thread executes one sentence pair. The advantage of this method is simple and intuitive. By this way, all the threads are independent and they do not need to communicate with each other, thus shared memory become unnecessary in this algorithm. Although we may not find a more suitable solution, the proposed solution shows an ideal performance in the experiments. In this work, we have to keep the hash table which is used to represent and store the lexical translation model, to stay in global memory of GPU through the whole algorithm life cycle. There are two reasons behind this. (1) For a given corpus, we have no idea about what word-pairs will appear in a sentence pair. In another word, any word-pair may appear in a sentence pair. Therefore, the whole translation model must stay in memory all the time. (2) A translation model is usually very large (maybe over hundreds of megabyte). Therefore, other types of memory can not store it except global memory. In this paper, the IBM EM algorithm is divided into two kernels. Kernel 1 is responsible for expectation calculation. Kernel 2 is responsible for the regularization of the results obtained from Kernel 1. The pseudo code of the algorithm is shown below. Kernel 1: A program for expectation calculation Input : 1. tmodel : the translation model represented by the hash table 2. corpus : a set of parallel sentence pairs 3. ftotal : an array for word-pair counting 4. sentnum : number of sentences Output : The updated tmodel and ftotal (1) tid = threadidx.x + blockidx.x * blockdim.x ; (2) offset = griddim.x * blockdim.x ; (3) for( i = tid ; i < sentnum ; i += offset ) (4) < e, f > = get_sentence_pair_from_corpus( corpus, i ) ; (5) stotal[] = 0; // stotal is an array for counting in a sentence (6) for( j = 0; j < get_setence_length( e ) ; j++ ) (7) for( k = 0; k < get_sentence_length( f ) ; k++ ) (8) location = query_word_pair_from_model( tmodel, e[j], f[k] ) ; (9) if( tmodel.pool[location] > DEFAULT_PROB ) (10) stotal[j] += tmodel.pool[location].prob ; (11) else (12) stotal[j] += DEFAULT_PROB ; 191

4 (13) end if (14) end for (15) end for (16) for( j = 0; j < get_setence_length( e ) ; j++ ) (17) for( k = 0; k < get_sentence_length( f ) ; k++ ) (18) location = query_word_pair_from_model( tmodel, e[j], f[k] ) ; (19) if( tmodel.pool[location] > DEFAULT_PROB ) (20) val = tmodel.pool[location].prob / stotal[j] ; (21) else (22) val = DEFAULT_PROB / stotal[j] ; (23) end if (24) atomicadd( tmodel.pool[location].count, val ); (25) atomicadd( ftotal[ f[k] ], val ); (26) end for (27) end for (28) end for Kernel 2: A program for regularization Input : 1. tmodel 2. ftotal Output : The translation model which has been regularized (1) tid = threadidx.x + blockidx.x * blockdim.x ; (2) offset = griddim.x * blockdim.x ; (3) for( i = tid ; i < tmodel.length ; i += offset ) (4) tmodel.pool[i].prob = tmodel.pool[i].count / ftotal[tmodel.pool[i].key.f] ; (5) tmodel.pool[i].count = 0 ; (6) end for In Kernel 1, each thread is responsible for one sentence pair at a time. For example, when the first loop, threads in first warp deal with 1-32 sentence pairs and threads in second warp deal with sentence pairs, etc. When get a word pair <e[j], f[k]>, the algorithm firstly compute its hash value and query the location from hash table. Then it counts the number of times the word pair <e[j], f[k]> appears in the sentence pair. Both of stotal and ftotal are used to record the numbers. The difference is that the former is only for counting in a sentence pair and the latter is for counting in the whole corpus. Finally, we use atomicadd() to implement atomic add function, which is an API provided by CUDA. Different from Kernel 1, each thread of Kernel 2 deals with one word pair at a time. Because the number of word pairs is much more than the number of sentence pairs, thus we can launch more threads to handle results regularization. This change can significantly increase the efficiency of this step. Kernel 1 and 2 will be repeated several times until the lexical translation probability converges. A CUDA function cudadevicesynchronize() is used for global synchronization between two kernels. D. Adaptability In this paper, two adaptive strategies are proposed to increase the efficiency of the proposed algorithm. The first adaptive strategy is for parameter setting when launching kernels. Different from CPU, a GPU program need to launch enough threads to keep the GPU resource be fully utilized. That s to say, the more concurrent threads, the higher efficiency of the algorithm. Because each thread deals with one sentence pair in Kernel 1, the number of threads is unnecessary to exceed the number of sentence pairs. We firstly fix the number of threads of each thread block to 512 or 1024, and then divided the number of sentence pair by this number, and finally we can get the number of thread blocks. If the obtained block number is more than the maximum block number of GPU, we set the block number to the maximum value. For example, the Nvidia Tesla K40M GPU supports up to thread blocks. This strategy is also applied to Kernel 2. Since Kernel 2 works on word pairs, it will launch much more threads than Kernel 1. This strategy guarantees that it can launch as many threads as possible. Second adaptive strategy is for the proposed hash table. It is easy to find that the querying efficiency of hash table mainly depends on the length of the linked list. The querying efficiency will decrease if the bucket size (it is equal to the prime used in hash function) is too small. Oppositely, it will lead to resource waste if the bucket size is too large. Therefore, we adopt an adaptive strategy that the bucket size (i.e. the selected prime) changes with the size of corpus. IV. EXPERIMENTS A. Experiment design This section evaluates the efficiency of the proposed algorithm. We compare it with other two state-of-art implementation of the IBM EM algorithms. One is a sequential algorithm in GIZA++, and the other is a multiple threads algorithm in MGIZA++. The proposed algorithm is tested on a Dell r720 server which has an Nvidia Tesla K40M card with 2880 cores and 12 GB RAM. The other two programs run on a powerful server - IBM X3850, which is equipped with 2 Intel Xeon E7 CPUs (2.0GHz * 10 cores for each CPU) and 128 GB RAM. Our aim is mainly focused on the speedup ratio of the proposed algorithm. Data set used in experiments is an English-Chinese parallel corpus from [13]. We cut the original corpus into 10 data sets with different size, from to The detailed description of the data sets is shown in table. In addition to the size of data sets (i.e. number of sentences pairs), it also contains the number of English words, the number of Chinese words and the number of word pairs. It should be pointed out that word pair is the subset of the Cartesian product of English words and Chinese words. TABLE I. DESCRIPTION OF DATA SETS Data Sentences English Chinese Word Pairs Sets Pairs Words Words

5 B. Results and Analysis The experimental results are shown in table. Three algorithms are respectively named as sequential EM, multithreads EM, and CUDA EM. Each algorithm has experienced five iterations. The execution time of the three algorithms on different data sets is recorded. Moreover, two steps of the IBM EM algorithm (i.e. expectation calculation and regularization) are timed separately. They are respectively named as 'S1' and 'S2' in the table. Moreover, the average execution time of five iterations is calculated. We also record the time cost of data copy between CPU and GPU, which is listed in the Copy column. The Total columns list the total execution time of the algorithms. From table, we can easily find that the performance of the proposed algorithm is far superior to the other two algorithms. In order to make the results more clear, we further compute the speedup ratio of three aspects, including expectation calculation, regularization and the total execution time. The results are shown in table. Next, we analyze the statistical results from the three aspects. (1). Speedup ratio of expectation calculation. Compared against the sequential EM, the proposed algorithm can achieve times speedup ratio when the number of sentence pairs is higher than When the number of sentence pairs is lower than 30000, the experimental result is easy to be affected by some factors, such as the overhead of thread switching. In addition, the proposed algorithm can achieve times speedup ratio compared with the multithreads EM. (2). Speedup ratio of regularization. For this step, we can see that the acceleration effect of the proposed algorithm is significant. Compared against the sequential EM, the proposed algorithm obtains speedup ratio more than times when the number of sentence pairs is higher than For the multi-threads EM, the proposed algorithm can reach about 800 times speedup ratio. These results prove the efficiency of the adaptive strategy in our algorithm. In 3.2.3, we have explained that the parameters of kernel 2 (i.e. number of thread blocks) are calculated based on the number of word pairs. Since this number is far more than the number of sentence pairs which is applied to kernel 1, the number of threads of kernel 2 is far more than that of kernel 1. Therefore, kernel 2 can obtain much higher speedup ratio than kernel 1. (3). Speedup ratio of the total execution time. From table, we can see that the best experimental results show a 16.8-fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. These results prove the efficiency of the proposed algorithm. TABLE II. EXECUTION TIME (UNITS: MILLISECOND) Data Sequential EM ( 10 3 ) Multi-Threads EM ( 10 3 ) CUDA EM Sets S1 S2 Total S1 S2 Total S1 S2 Copy Total TABLE III. SPEEDUP RATIO Data CUDA EM vs. Sequential EM CUDA EM vs. Multi-Threads EM Sets S1 S2 Total S1 S2 Total * * Next, we analyze the sizeup performance of the proposed algorithm. Sizeup analysis grows the size of the data sets by the factor p, and measures how much longer it takes by an algorithm, when the size of data set is p-times larger than that of the original data set. The formula to compute sizeup is shown below, where T Di is the execution time for i D, TD1 is the execution time for D (D denotes the size of the data set containing sentence pairs). Four indexes of the proposed algorithm (i.e. S1, S2, Copy, Total ) are 193

6 calculated. The results are shown in figure 3. As can be seen from the figure, the sizeup of S1 and S2 varies linearly with the scale of the data, and the sizeup of data copy varies approximately linearly with the data size. The reason behind this phenomenon is that data copy not only depends on the device, but also depends on the host. Overall, the results show that the proposed algorithm has a good sizeup performance. TDi Sizeup = (4) T D1 to all, the memory of GPU is limited. Therefore, for a huge corpus which can not be directly compute by our algorithm, how to handle it? We prepare to explore a CPU-GPU or a GPU-cluster solution. ACKNOWLEDGMENT This work is supported by the Fund of National Nature Science (Grand No ); the Scientific Research Fund of Leshan Normal University (Grand No. Z1325, Z1411, Z1504, S1511). Figure 3: Sizeup Finally, we tested the effect of another adaptive strategy on algorithm s performance. Based on our strategy, the proposed algorithm will dynamically select a hash prime according to the size of date set. We perform the experiment on the data set which contains sentence pairs as well as word pairs. We select an 8-digit prime and a 7- digit prime to carry out the experiment. Experimental result shows that the execution time of the algorithm that uses 7- digit prime is about three times as long as the algorithm that use 8-digit. The reason is obvious that a smaller prime lead to a longer linked list which would increase the querying time. V. CONCLUSIONS AND FUTURE WORKS This paper proposes a parallel implementation of the IBM EM algorithm using CUDA programming model. All details of the algorithm have been explained. The proposed algorithm is tested on a modern GPU card (i.e. Nvidia Tesla K40M), and the results are compared with other two state-ofart implement of the IBM EM algorithm running on a powerful server. The best experimental results show a fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. This proves the efficiency of the proposed algorithm. In the future, we will continue to study two problems. (1) Whether there is a much better solution of the parallel implementation of the IBM EM algorithm. (2) As is known REFERENCES [1] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, et al., The mathematics of statistical machine translation: Parameter estimation, Computational Linguistics, Vol. 19, No. 2, 1993, pp [2] S. Vogel, H. Ney, C. Tillmann, HMM-based word alignment in statistical translation, Proc. of the Coling'96, 1996, pp [3] F. J. Och, H. Ney, Improved statistical alignment models, Proc. of the ACL 00, 2000, pp [4] Q. Gao, S. Vogel. Parallel implementations of word alignment tool. Proc. of the SETQA-NLP 08, 2008, pp [5] C. Mermer, M. Saraçlar, R. Sarikaya, Improving Statistical Machine Translation Using Bayesian Word Alignment and Gibbs Sampling, IEEE Trans. on Audio, Speech and Language Processing, Vol. 21, No. 5, 2013, pp [6] T. Songyot, D. Chiang, Improving Word Alignment using Word Similarity, Proc. of the EMNLP 14, 2014, pp [7] P. Koehn, Statistical Machine Translation, Cambridge University Press, London UK [8] E. Lindholm, J. Nickolls, S. Oberman, Nvidia tesla: A unified graphics and computing architecture, IEEE Micro, Vol. 28, No. 2, 2008, pp [9] K. David, W. M. Hwu, Programming massively parallel processors: a hand-on approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [10] H. D. Tagare, A. Barthel, F. J. Sigworth, An adaptive expectationmaximization algorithm with GPU implementation for electron cryomicroscopy, Journal of Structural Biology, Vol. 171, No. 3, 2010, pp [11] S. Chetlur, C. Woolley, P Vandermersch, et al., cudnn: Efficient Primitives for Deep Learning, Computer Science, [12] Y. M. Yi, C. Y. Lai, S. Petrov, et al., Efficient parallel CKY parsing on GPUs, Journal of Logic & Computation, Vol. 24, 2011, pp [13] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, et al., Optimization principles and application performance evaluation of a multi-threaded GPU using CUDA, Proc. of the PPoPP 08, 2008, pp [14] N. Satish, M. Harris, M. Garland, Designing Efficient Sorting Algorithms for Manycore GPUs, Proc. of the IPDPS 09, 2009, pp [15] T. Xiao, J. B. Zhu, H. Zhang, et al., NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation, Proce. of the ACL 12, 2012, pp

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Optimizing CUDA for GPU Architecture. CSInParallel Project

Optimizing CUDA for GPU Architecture. CSInParallel Project Optimizing CUDA for GPU Architecture CSInParallel Project August 13, 2014 CONTENTS 1 CUDA Architecture 2 1.1 Physical Architecture........................................... 2 1.2 Virtual Architecture...........................................

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Programming GPUs for database applications - outsourcing index search operations

Programming GPUs for database applications - outsourcing index search operations Programming GPUs for database applications - outsourcing index search operations Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Quo Vadis? + special

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

Shared Memory and Synchronizations

Shared Memory and Synchronizations and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can

More information

Efficient Parallelization of Natural Language Applications using GPUs

Efficient Parallelization of Natural Language Applications using GPUs Efficient Parallelization of Natural Language Applications using GPUs Chao-Yue Lai Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-54

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture Parth Shah 1 and Rachana Oza 2 1 Chhotubhai Gopalbhai Patel Institute of Technology, Bardoli, India parthpunita@yahoo.in

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

arxiv: v1 [cs.dc] 24 Feb 2010

arxiv: v1 [cs.dc] 24 Feb 2010 Deterministic Sample Sort For GPUs arxiv:1002.4464v1 [cs.dc] 24 Feb 2010 Frank Dehne School of Computer Science Carleton University Ottawa, Canada K1S 5B6 frank@dehne.net http://www.dehne.net February

More information

Inter-Block GPU Communication via Fast Barrier Synchronization

Inter-Block GPU Communication via Fast Barrier Synchronization CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Efficient Computation of Radial Distribution Function on GPUs

Efficient Computation of Radial Distribution Function on GPUs Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida 2 Overview Introduction

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Cooperative Multitasking for GPU-Accelerated Grid Systems

Cooperative Multitasking for GPU-Accelerated Grid Systems 21 1th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing Cooperative Multitasking for GPU-Accelerated Grid Systems Fumihiko Ino, Akihiro Ogita, Kentaro Oita and Kenichi Hagihara Graduate

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

CUDA Programming. Aiichiro Nakano

CUDA Programming. Aiichiro Nakano CUDA Programming Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102 1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

A Batched GPU Algorithm for Set Intersection

A Batched GPU Algorithm for Set Intersection A Batched GPU Algorithm for Set Intersection Di Wu, Fan Zhang, Naiyong Ao, Fang Wang, Xiaoguang Liu, Gang Wang Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University Weijin

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

A Parallel Access Method for Spatial Data Using GPU

A Parallel Access Method for Spatial Data Using GPU A Parallel Access Method for Spatial Data Using GPU Byoung-Woo Oh Department of Computer Engineering Kumoh National Institute of Technology Gumi, Korea bwoh@kumoh.ac.kr Abstract Spatial access methods

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

A GPU-Based Simulation Kernel within Heterogeneous Collaborative Computation on Large-Scale Artificial Society

A GPU-Based Simulation Kernel within Heterogeneous Collaborative Computation on Large-Scale Artificial Society A GPU-Based Simulation Kernel within Heterogeneous Collaborative Computation on Large-Scale Artificial Society Li Zhen, Qiuxiao Gang, Guo Gang, and Chen Bin Abstract The graphic processing unit (GPU) gets

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

GPU-Accelerated Apriori Algorithm

GPU-Accelerated Apriori Algorithm GPU-Accelerated Apriori Algorithm Hao JIANG a, Chen-Wei XU b, Zhi-Yong LIU c, and Li-Yan YU d School of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b wei1517@126.com,

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Lecture 3: CUDA Programming & Compiler Front End

Lecture 3: CUDA Programming & Compiler Front End CS 515 Programming Language and Compilers I Lecture 3: CUDA Programming & Compiler Front End Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/19/2017 1 Lecture 3A: GPU Programming 2 3 Review: Performance

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Accelerating SRD Simulation on GPU

Accelerating SRD Simulation on GPU Accelerating SRD Simulation on GPU by Zhilu Chen A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Master of Science

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information