CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation

Size: px

Start display at page:

Download "CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation"

Jesse Haynes
5 years ago
Views:

2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine

1 th International Conference on Parallel and Distributed Computing, Applications and Technologies CUDA-based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation Si-Yuan Jing School of Computer Sciences Leshan Normal University Leshan, China Gao-Rong Yan School of Foreign Language Leshan Normal University Leshan, China Xing-Yuan Chen, Peng Jin, Zhao-Yi Guo School of Computer Science Leshan Normal University Leshan, China Abstract Word alignment is a basic task in natural language processing and it usually serves as the starting point when building a modern statistical machine translation system. However, the state-of-art parallel algorithm for word alignment is still time-consuming. In this work, we explore a parallel implementation of word alignment algorithm on Graphics Processor Unit (GPU), which has been widely available in the field of high performance computing. We use the Compute Unified Device Architecture (CUDA) programming model to re-implement a state-of-the-art word alignment algorithm, called IBM Expectation-Maximization (EM) algorithm. A Tesla K40M card with 2880 cores is used for experiments and execution times obtained with the proposed algorithm are compared with a sequential algorithm and a multi-threads algorithm on an IBM X3850 server, which has two Intel Xeon E7 CPUs (2.0GHz * 10 cores). The best experimental results show a 16.8-fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. Keywords-Word Alignment; GPU; Parallel Computation; Expectation-Maximization Algorithm; CUDA I. INTRODUCTION Word Alignment is a basic task in natural language processing. By given a training corpus which contains a set of bi-texts (i.e. aligned parallel sentence pair), the aim of word alignment is to identify translation relationships among the words in each bi-text, meanwhile obtaining a bipartite graph between the two sides of the bi-text. Assuming there is an English-Chinese sentence pair, <" I am a teacher from Leshan normal university ", " ">, the word alignment should obtain the mapping relations among English words and Chinese words, just like {<I, >, <am, >, <a, >, <teacher, >, <from, NULL>, <Leshan, >, <normal, >, <university, >}. Word alignment usually serves as the start point when builting a modern statistical machine translation system. Moreover, for phrase-based statistical machine translation system, it is also the basis of many subsequent tasks, such as phrase extraction, reordering, etc. Besides the statistical machine translation systems, word alignment can be used in translation lexicon induction, word sense discovery, word sense disambiguation, and so on. So far, most of the word alignment algorithms are serial. In 1993, P. F. Brown et al. [1] firstly proposed an EM algorithm for lexical statistical machine translation models training. The models are well known as IBM Model 1-5. The word alignment is the by-product of the IBM models 1. We call it IBM EM algorithm latter in this paper. S. Vogel et al. [2] proposed a HMM-based method to obtain word alignment. These two algorithms are still widely used until today. F. J. Och et al. [3] developed a famous word alignment tool named GIZA++. GIZA++ is the most popular word alignment tool in last decade and it implements both of the IBM model and the HMM model. Later Q. Gao et al. [4] developed another tool named MGIZA++, which is the multi-threads version of GIZA++. Besides these works, researchers also put forward some other methods, such as Bayesian [5], deep learning [6], etc. Unfortunately, most of the algorithms are serial. Thus they are time consuming when they face a relative large corpus. Even the performance of MGIZA++ is also not satisfactory. To address the problem, this paper explores a parallel implementation of word alignment algorithm on GPU. We use the CUDA programming model to re-implement the IBM EM algorithm. The proposed algorithm is evaluated on an Nvidia Tesla K40M card with 2880 cores. Moreover, it is compared to a serial EM algorithm and a multi-threads EM algorithm running on a powerful server. The experimental results show the efficiency of the proposed algorithm. The rest of this paper is organized as follow. Section 2 introduces the word alignment problem as well as the IBM EM algorithm. Section 3 firstly explains some knowledge of GPU computing, and then presents the parallel implementation of the IBM EM algorithm by CUDA. Section 4 gives the results of experiments as well as the analysis. Finally section 5 gives conclusions. II. PRELIMILARIES This section briefly introduces the preliminaries of word alignment problem as well as the IBM EM algorithm. More details can be found in [1] and [7]. Given a parallel bilingual corpus, denoted by (E, F), where E represents a set of source language sentences and F represents a set of target language sentences. Let (e, f) denote the s-th sentence pair in the corpus, where e = {e 1,e 2,,e l} contains l source language words, and f = {f 1,f 2,,f m} contains m target language words. The aim of word alignment task is to find a set of binary relation <e i, f j>, where e i and f j are translations of each other in the sentence pair. Generally, we model the problem as an aligning function a: j i, which represents that a source language /16 $ IEEE DOI /PDCAT

2 word e i is aligned to a target language word f j. The aim is to get this function. In IBM model 1, word alignment is obtained by training a lexical translation model. P. F. Brown uses a generative modeling method and an EM algorithm to get the lexical translation model. The basic idea of the IBM EM algorithm is as follow: (1). Assuming a given parallel bilingual corpus contains VE source language words as well as VF target language words, where VE (VF) means the vocabulary of source (target) language sentences. Moreover, we suppose that the translation probability (denoted as t( f e )) between arbitrary word pair < e, f > is same in initial phrase. (2). Based on the current translation model, we can calculate the aligning probability of arbitrary word pairs in a sentence pair, denoted as P( a e, f ). The formula is shown below. The meaning of this step is that we apply a known translation model to data, and then obtain an unknown aligning probability. This is the E step of the EM algorithm. m t( f j ea( j) ) P( a e, f ) = (1) l j= 1 t( f ) i = 0 j ei (3). With the new word alignment probability, we can reestimate the translation probability of all source language - target language word pairs in corpus. The formula is (2). This is the M step of the EM algorithm. c( f e; e, f ( ) ) e, f t( f e) = (2) c f e; e, f ( ) ( ) f e, f Here, c( f e; e, f ) is a function who counts how many times a specific source language word e is translated to a target language word f in a sentence (see formula (3)). δ ( x, y) is the Kronecker function, δ ( xy, ) = 1 if x = y, otherwise δ ( xy, ) = 0. t( f e l m ) c( f e; e, f ) = (, i) (, l δ e e δ f f j) (3) t f e i= 1 j= 1 j= 1 ( i ) (4). Steps (2) and (3) will be repeated several iterations and the algorithm finally converge to a stable point. The final aligning model and lexical translation model will be obtained. III. PARALLEL IMPLEMENTATION OF IBM EM ALGORITHM A. Parallel Computing on GPU Before introducing the parallel design of the IBM EM algorithms, we briefly recall some basic knowledge of NVIDIA s GPU architecture and the CUDA parallel programming model. Figure 1 shows the architecture of a typical Nvidia GPU [8]. Current GPUs have hundreds (or even thousands) of processing cores, called streaming processor (SP). All the SPs are divided into many groups. The SPs in each group constitute a streaming multiprocessor (SM). A number of SMs form an independent processing unit, called texture/processor cluster (TPC). For example, a GeForce 8800 GPU has 128 SPs organized as 16 SMs in 8 TPCs. Although the modern GPU supports simultaneous execution of millions of operations, it is not easy to achieve high performance if we do not understand the principle of the underlying hardware architecture as well as the programming model [9]. CUDA is the most popular GPU programming model proposed by Nvidia. As is known to us, GPU was initially designed to improve the efficiency of image processing. It suffers from some restrictions when it was applied to general purpose computing. CUDA provides a good solution to solve the problem. It allows software developers to use a CUDAenabled GPU for general purpose computing. So far, CUDA has been applied to many tasks of high performance scientific computing, such as medical image processing [10], machine learning [11], natural language processing [12], and so on. A typical CUDA program can be divided into two parts, i.e. a sequential host program and one or more parallel device program (called kernels). The former is executed on CPU and the latter is parallel executed on GPU. A kernel will be parallel executed on GPU according to the userdefined parameters, including the number of threads and the number of thread blocks. Therefore, design and optimization of kernels is a key to high performance programming on GPU. Next, we will briefly introduce some keynotes in CUDA programming. A kernel executes a scalar sequential program across a set of parallel threads. For CUDA, thread is the basic unit for program execution. Furthermore, all the threads are hierarchically organized. A number of threads constitute a thread block, and a number of thread blocks form a grid. For example, a thread block can contain up to 512 threads in the G80/GT200 series GPU. A kernel thus consists of a grid of one or more thread blocks. The threads in a thread block are concurrent and they can cooperate amongst themselves through barrier synchronization and a per-block shared memory space private to that block. This is a key feature of CUDA. CUDA provides a set of APIs for thread synchronization, such as _syncthreads() and cudadevicesynchronize() [9]. The former implements barrier synchronization among threads in the same thread block, and the latter provide global barrier synchronization. This provides a very flexible synchronization mechanism for programmer designer. When a specific algorithm requires global synchronization, the designer can divide the program into several independent kernels and invoke cudadevicesynchronize() to keep global synchronization between each two kernels. Single Instruction Multiple Threads (SIMT) is another important feature of CUDA [9][14][15]. SIMT means that threads are executed in bundles (called warps). The threads in a warp share a single multithreaded instruction unit. This design allows the programs to achieve substantial efficiencies when executing data-parallel program. To achieve the best efficiency, kernels should avoid execution 190

divergence. Since different execution path will lead to performance penalty. However, difference among warps will introduce no performance loss. In modern GPUs, a warp consists of 32 threads.

3 divergence. Since different execution path will lead to performance penalty. However, difference among warps will introduce no performance loss. In modern GPUs, a warp consists of 32 threads. In another words, the number of threads in a thread block must be a multiple of 32. This ensures that the resource would not be wasted. Finally, it is necessary to indicate that the CUDA relies on multi-threads switch to hide the latency of transactions with external memory [14][15]. Different with CPU, there is no cache in current GPUs. Thus, we should launch enough threads to keep the machine fully utilized. For current GPUs, a minimum of round 5,000 threads must be live simultaneously to efficiently utilize the entire resource. Figure 1: Architecture of a typical Nvidia GPU B. A Specific Hash Table According to the description in section, we can get that lexical translation model is the core of the algorithm. It runs through the whole algorithm life cycle. Thus, one of the most important problems is how to represent and store the lexical translation model on GPU. We require a data structure which can be high efficiently and concurrently accessed to store the lexical translation model. However, Current models, just like that used in GIZA++ or MGIZA++, do not meet this requirement. We propose a specific word-alignment-oriented hash table for CUDA environment (see figure 2). The proposed hash table is referenced to literature [9]. In the hash table, Pool variable is used to store entries of translation model. Each entry has four member variables which are Key, Prob, Count and next, and they store word pair <e i, f j>, translation probability of the word pair, the number of times that the word pair appears in a sentence, a pointer to the next entry with same hash value, respectively. Another variable Buckets is used to store pointers of entries. The word pairs who have same hash value are organized into a linked list and the pointer in Buckets points to the last one (For linked list, it is the first one). For example, when a new word pair is prepared to be inserted into the hash table, we create a new entry whose Key value is set to the word pair, and insert it to the end of the Pool. Meanwhile we compute its hash value. We set the next variable of the new entry to be the location of the first entry in the linked list (with same hash value), and finally let the pointer (in Buckets ) point to the new entry. The proposed data structure has some advantages, including: (1) It is very easy to be implemented and bidirectional copied; (2) Since CUDA provides a set of APIs of atomic operations, it is very easy to concurrently access the hash table on GPU; (3) Memory utilization rate is high. Figure 2: A specific hash table C. Parallel design and pseudo code Data-parallelism is the most direct way to parallel implement of the IBM EM algorithm, i.e. each thread executes one sentence pair. The advantage of this method is simple and intuitive. By this way, all the threads are independent and they do not need to communicate with each other, thus shared memory become unnecessary in this algorithm. Although we may not find a more suitable solution, the proposed solution shows an ideal performance in the experiments. In this work, we have to keep the hash table which is used to represent and store the lexical translation model, to stay in global memory of GPU through the whole algorithm life cycle. There are two reasons behind this. (1) For a given corpus, we have no idea about what word-pairs will appear in a sentence pair. In another word, any word-pair may appear in a sentence pair. Therefore, the whole translation model must stay in memory all the time. (2) A translation model is usually very large (maybe over hundreds of megabyte). Therefore, other types of memory can not store it except global memory. In this paper, the IBM EM algorithm is divided into two kernels. Kernel 1 is responsible for expectation calculation. Kernel 2 is responsible for the regularization of the results obtained from Kernel 1. The pseudo code of the algorithm is shown below. Kernel 1: A program for expectation calculation Input : 1. tmodel : the translation model represented by the hash table 2. corpus : a set of parallel sentence pairs 3. ftotal : an array for word-pair counting 4. sentnum : number of sentences Output : The updated tmodel and ftotal (1) tid = threadidx.x + blockidx.x * blockdim.x ; (2) offset = griddim.x * blockdim.x ; (3) for( i = tid ; i < sentnum ; i += offset ) (4) < e, f > = get_sentence_pair_from_corpus( corpus, i ) ; (5) stotal[] = 0; // stotal is an array for counting in a sentence (6) for( j = 0; j < get_setence_length( e ) ; j++ ) (7) for( k = 0; k < get_sentence_length( f ) ; k++ ) (8) location = query_word_pair_from_model( tmodel, e[j], f[k] ) ; (9) if( tmodel.pool[location] > DEFAULT_PROB ) (10) stotal[j] += tmodel.pool[location].prob ; (11) else (12) stotal[j] += DEFAULT_PROB ; 191

4 (13) end if (14) end for (15) end for (16) for( j = 0; j < get_setence_length( e ) ; j++ ) (17) for( k = 0; k < get_sentence_length( f ) ; k++ ) (18) location = query_word_pair_from_model( tmodel, e[j], f[k] ) ; (19) if( tmodel.pool[location] > DEFAULT_PROB ) (20) val = tmodel.pool[location].prob / stotal[j] ; (21) else (22) val = DEFAULT_PROB / stotal[j] ; (23) end if (24) atomicadd( tmodel.pool[location].count, val ); (25) atomicadd( ftotal[ f[k] ], val ); (26) end for (27) end for (28) end for Kernel 2: A program for regularization Input : 1. tmodel 2. ftotal Output : The translation model which has been regularized (1) tid = threadidx.x + blockidx.x * blockdim.x ; (2) offset = griddim.x * blockdim.x ; (3) for( i = tid ; i < tmodel.length ; i += offset ) (4) tmodel.pool[i].prob = tmodel.pool[i].count / ftotal[tmodel.pool[i].key.f] ; (5) tmodel.pool[i].count = 0 ; (6) end for In Kernel 1, each thread is responsible for one sentence pair at a time. For example, when the first loop, threads in first warp deal with 1-32 sentence pairs and threads in second warp deal with sentence pairs, etc. When get a word pair <e[j], f[k]>, the algorithm firstly compute its hash value and query the location from hash table. Then it counts the number of times the word pair <e[j], f[k]> appears in the sentence pair. Both of stotal and ftotal are used to record the numbers. The difference is that the former is only for counting in a sentence pair and the latter is for counting in the whole corpus. Finally, we use atomicadd() to implement atomic add function, which is an API provided by CUDA. Different from Kernel 1, each thread of Kernel 2 deals with one word pair at a time. Because the number of word pairs is much more than the number of sentence pairs, thus we can launch more threads to handle results regularization. This change can significantly increase the efficiency of this step. Kernel 1 and 2 will be repeated several times until the lexical translation probability converges. A CUDA function cudadevicesynchronize() is used for global synchronization between two kernels. D. Adaptability In this paper, two adaptive strategies are proposed to increase the efficiency of the proposed algorithm. The first adaptive strategy is for parameter setting when launching kernels. Different from CPU, a GPU program need to launch enough threads to keep the GPU resource be fully utilized. That s to say, the more concurrent threads, the higher efficiency of the algorithm. Because each thread deals with one sentence pair in Kernel 1, the number of threads is unnecessary to exceed the number of sentence pairs. We firstly fix the number of threads of each thread block to 512 or 1024, and then divided the number of sentence pair by this number, and finally we can get the number of thread blocks. If the obtained block number is more than the maximum block number of GPU, we set the block number to the maximum value. For example, the Nvidia Tesla K40M GPU supports up to thread blocks. This strategy is also applied to Kernel 2. Since Kernel 2 works on word pairs, it will launch much more threads than Kernel 1. This strategy guarantees that it can launch as many threads as possible. Second adaptive strategy is for the proposed hash table. It is easy to find that the querying efficiency of hash table mainly depends on the length of the linked list. The querying efficiency will decrease if the bucket size (it is equal to the prime used in hash function) is too small. Oppositely, it will lead to resource waste if the bucket size is too large. Therefore, we adopt an adaptive strategy that the bucket size (i.e. the selected prime) changes with the size of corpus. IV. EXPERIMENTS A. Experiment design This section evaluates the efficiency of the proposed algorithm. We compare it with other two state-of-art implementation of the IBM EM algorithms. One is a sequential algorithm in GIZA++, and the other is a multiple threads algorithm in MGIZA++. The proposed algorithm is tested on a Dell r720 server which has an Nvidia Tesla K40M card with 2880 cores and 12 GB RAM. The other two programs run on a powerful server - IBM X3850, which is equipped with 2 Intel Xeon E7 CPUs (2.0GHz * 10 cores for each CPU) and 128 GB RAM. Our aim is mainly focused on the speedup ratio of the proposed algorithm. Data set used in experiments is an English-Chinese parallel corpus from [13]. We cut the original corpus into 10 data sets with different size, from to The detailed description of the data sets is shown in table. In addition to the size of data sets (i.e. number of sentences pairs), it also contains the number of English words, the number of Chinese words and the number of word pairs. It should be pointed out that word pair is the subset of the Cartesian product of English words and Chinese words. TABLE I. DESCRIPTION OF DATA SETS Data Sentences English Chinese Word Pairs Sets Pairs Words Words

5 B. Results and Analysis The experimental results are shown in table. Three algorithms are respectively named as sequential EM, multithreads EM, and CUDA EM. Each algorithm has experienced five iterations. The execution time of the three algorithms on different data sets is recorded. Moreover, two steps of the IBM EM algorithm (i.e. expectation calculation and regularization) are timed separately. They are respectively named as 'S1' and 'S2' in the table. Moreover, the average execution time of five iterations is calculated. We also record the time cost of data copy between CPU and GPU, which is listed in the Copy column. The Total columns list the total execution time of the algorithms. From table, we can easily find that the performance of the proposed algorithm is far superior to the other two algorithms. In order to make the results more clear, we further compute the speedup ratio of three aspects, including expectation calculation, regularization and the total execution time. The results are shown in table. Next, we analyze the statistical results from the three aspects. (1). Speedup ratio of expectation calculation. Compared against the sequential EM, the proposed algorithm can achieve times speedup ratio when the number of sentence pairs is higher than When the number of sentence pairs is lower than 30000, the experimental result is easy to be affected by some factors, such as the overhead of thread switching. In addition, the proposed algorithm can achieve times speedup ratio compared with the multithreads EM. (2). Speedup ratio of regularization. For this step, we can see that the acceleration effect of the proposed algorithm is significant. Compared against the sequential EM, the proposed algorithm obtains speedup ratio more than times when the number of sentence pairs is higher than For the multi-threads EM, the proposed algorithm can reach about 800 times speedup ratio. These results prove the efficiency of the adaptive strategy in our algorithm. In 3.2.3, we have explained that the parameters of kernel 2 (i.e. number of thread blocks) are calculated based on the number of word pairs. Since this number is far more than the number of sentence pairs which is applied to kernel 1, the number of threads of kernel 2 is far more than that of kernel 1. Therefore, kernel 2 can obtain much higher speedup ratio than kernel 1. (3). Speedup ratio of the total execution time. From table, we can see that the best experimental results show a 16.8-fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. These results prove the efficiency of the proposed algorithm. TABLE II. EXECUTION TIME (UNITS: MILLISECOND) Data Sequential EM ( 10 3 ) Multi-Threads EM ( 10 3 ) CUDA EM Sets S1 S2 Total S1 S2 Total S1 S2 Copy Total TABLE III. SPEEDUP RATIO Data CUDA EM vs. Sequential EM CUDA EM vs. Multi-Threads EM Sets S1 S2 Total S1 S2 Total * * Next, we analyze the sizeup performance of the proposed algorithm. Sizeup analysis grows the size of the data sets by the factor p, and measures how much longer it takes by an algorithm, when the size of data set is p-times larger than that of the original data set. The formula to compute sizeup is shown below, where T Di is the execution time for i D, TD1 is the execution time for D (D denotes the size of the data set containing sentence pairs). Four indexes of the proposed algorithm (i.e. S1, S2, Copy, Total ) are 193

6 calculated. The results are shown in figure 3. As can be seen from the figure, the sizeup of S1 and S2 varies linearly with the scale of the data, and the sizeup of data copy varies approximately linearly with the data size. The reason behind this phenomenon is that data copy not only depends on the device, but also depends on the host. Overall, the results show that the proposed algorithm has a good sizeup performance. TDi Sizeup = (4) T D1 to all, the memory of GPU is limited. Therefore, for a huge corpus which can not be directly compute by our algorithm, how to handle it? We prepare to explore a CPU-GPU or a GPU-cluster solution. ACKNOWLEDGMENT This work is supported by the Fund of National Nature Science (Grand No ); the Scientific Research Fund of Leshan Normal University (Grand No. Z1325, Z1411, Z1504, S1511). Figure 3: Sizeup Finally, we tested the effect of another adaptive strategy on algorithm s performance. Based on our strategy, the proposed algorithm will dynamically select a hash prime according to the size of date set. We perform the experiment on the data set which contains sentence pairs as well as word pairs. We select an 8-digit prime and a 7- digit prime to carry out the experiment. Experimental result shows that the execution time of the algorithm that uses 7- digit prime is about three times as long as the algorithm that use 8-digit. The reason is obvious that a smaller prime lead to a longer linked list which would increase the querying time. V. CONCLUSIONS AND FUTURE WORKS This paper proposes a parallel implementation of the IBM EM algorithm using CUDA programming model. All details of the algorithm have been explained. The proposed algorithm is tested on a modern GPU card (i.e. Nvidia Tesla K40M), and the results are compared with other two state-ofart implement of the IBM EM algorithm running on a powerful server. The best experimental results show a fold speedup compared to the multi-threads algorithm and a fold speedup compared to the sequential algorithm. This proves the efficiency of the proposed algorithm. In the future, we will continue to study two problems. (1) Whether there is a much better solution of the parallel implementation of the IBM EM algorithm. (2) As is known REFERENCES [1] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, et al., The mathematics of statistical machine translation: Parameter estimation, Computational Linguistics, Vol. 19, No. 2, 1993, pp [2] S. Vogel, H. Ney, C. Tillmann, HMM-based word alignment in statistical translation, Proc. of the Coling'96, 1996, pp [3] F. J. Och, H. Ney, Improved statistical alignment models, Proc. of the ACL 00, 2000, pp [4] Q. Gao, S. Vogel. Parallel implementations of word alignment tool. Proc. of the SETQA-NLP 08, 2008, pp [5] C. Mermer, M. Saraçlar, R. Sarikaya, Improving Statistical Machine Translation Using Bayesian Word Alignment and Gibbs Sampling, IEEE Trans. on Audio, Speech and Language Processing, Vol. 21, No. 5, 2013, pp [6] T. Songyot, D. Chiang, Improving Word Alignment using Word Similarity, Proc. of the EMNLP 14, 2014, pp [7] P. Koehn, Statistical Machine Translation, Cambridge University Press, London UK [8] E. Lindholm, J. Nickolls, S. Oberman, Nvidia tesla: A unified graphics and computing architecture, IEEE Micro, Vol. 28, No. 2, 2008, pp [9] K. David, W. M. Hwu, Programming massively parallel processors: a hand-on approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [10] H. D. Tagare, A. Barthel, F. J. Sigworth, An adaptive expectationmaximization algorithm with GPU implementation for electron cryomicroscopy, Journal of Structural Biology, Vol. 171, No. 3, 2010, pp [11] S. Chetlur, C. Woolley, P Vandermersch, et al., cudnn: Efficient Primitives for Deep Learning, Computer Science, [12] Y. M. Yi, C. Y. Lai, S. Petrov, et al., Efficient parallel CKY parsing on GPUs, Journal of Logic & Computation, Vol. 24, 2011, pp [13] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, et al., Optimization principles and application performance evaluation of a multi-threaded GPU using CUDA, Proc. of the PPoPP 08, 2008, pp [14] N. Satish, M. Harris, M. Garland, Designing Efficient Sorting Algorithms for Manycore GPUs, Proc. of the IPDPS 09, 2009, pp [15] T. Xiao, J. B. Zhu, H. Zhang, et al., NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation, Proce. of the ACL 12, 2012, pp

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization