Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Size: px

Start display at page:

Download "Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations"

Thomasina Webb
6 years ago
Views:

1 Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima University Kagamiyama 1-4-1, Higashi Hiroshima, Jaan Abstract The Hierarchical Memory Machine (HMM) is a theoretical arallel comuting model that catures the essence of comuting on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of comuter vision hich can be obtained by comuting the column-ise refix-sums and then the roise refix-sums. The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), hich suorts asynchronous execution of CUDA blocks, and sho a global-memory-access-otimal arallel algorithm for comuting the SAT on the asynchronous HMM. A straightforard algorithm (2R2W SAT algorithm) on the asynchronous HMM, hich comutes the refix-sums in every column using one thread each and then comutes the refix-sums in every ro, erforms 2 read oerations and 2 rite oerations er element of a matrix. The reviously ublished best algorithm (2R1W SAT algorithm) erforms 2 read oerations and 1 rite oeration er element. We resent a more efficient algorithm (1R1W SAT algorithm) hich erforms 1 read oeration and 1 rite oeration er element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be ritten, our 1R1W SAT algorithm is otimal in terms of the global memory access. We also sho a combined algorithm ((1 + r)r1w SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better erformance. We have imlemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)r1w SAT algorithms on GeForce GTX 78 Ti. The exerimental results sho that our (1+r)R1W SAT algorithm runs faster than any other SAT algorithms for large inut matrices. Also, it runs more than 1 times faster than the best SAT algorithm using a single CPU. Keyords-memory machine models, refix-sums, summed area table, image rocessing, GPU, CUDA I. INTRODUCTION The GPU (Grahics Processing Unit), is a secialized circuit designed to accelerate comutation for building and maniulating images [1], [2], [3]. Latest GPUs are designed for general urose comuting and can erform comutation in alications traditionally handled by the CPU. Hence, GPUs have recently attracted the attention of many alication develoers [1]. NVIDIA rovides a arallel comuting architecture called CUDA (Comute Unified Device Architecture) [4], the comuting engine for NVIDIA GPUs. CUDA gives develoers access to the virtual instruction set and memory of the arallel comutational elements in NVIDIA GPUs. CUDA-enabled GPUs have streaming multirocessors (SMs) each of hich executes multile threads in arallel. CUDA can use to tyes of memories in the NVIDIA GPUs: the shared memory and the global memory [4]. The efficient usage of the shared memory and the global memory is a key for CUDA develoers to accelerate alications using GPUs. In articular, e need to consider bank conflicts of shared memory access and coalescing of global memory access [5], [6], [7]. The address sace of shared memory is maed into several hysical memory banks. If to or more threads access the same memory banks at the same time, the access requests are rocessed in turn. Hence, to maximize the memory access erformance, CUDA threads should access distinct memory banks to avoid the bank conflicts of the memory accesses. To maximize the bandidth of the global memory, CUDA threads should erform coalesced access. In our revious aer [8], e have introduced to models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), hich reflect the essential features of the shared memory and the global memory of NVIDIA GPUs. The outline of the architectures of the DMM and the UMM is illustrated in Figure 1. In both architectures, asea of threads (Ts) are connected to the memory banks (MBs) through the memory management unit (MMU). Each thread is a Random Access Machine (RAM) [9], hich can execute fundamental oerations in a time unit. MBs constitute a single address sace of the memory. A single address sace of the memory is maed to the MBs in an interleaved ay such that the ord of data of address i is stored in the (i mod )-th bank, here is the number of MBs. The main difference of the to architectures is the connection of the address line beteen the MMU and the MBs, hich can transfer an address value. In the DMM, the address lines connect the MBs and the MMU searately, hile a single address line from the MMU is connected to the MBs in the UMM. Hence, in the UMM, the same address value is broadcast to every MB, and the same address of the MBs can be accessed in each time unit. On the other hand, different addresses of the MBs can be accessed in the DMM.

2 a sea of threads MMU MB MB MB MB DMM a sea of threads MMU MB MB MB MB UMM address line data line Figure 1. The architectures of the DMM and the UMM ith =4 Quite recently, e have introduced the Hierarchical Memory Machine (HMM) [1], hich is a hybrid of the DMM and the UMM. The HMM is a more ractical arallel comuting model that reflects the hierarchical architecture of CUDA-enabled GPUs. Figure 2 illustrates the architecture of the HMM. The HMM consists of d DMMs and a single UMM. Each DMM has memory banks and the UMM has memory banks. We call the memory banks of each DMM the shared memory and those of the UMM the global memory after CUDA-enabled GPUs. Each DMM can ork indeendently and can erform the comutation using its shared memory. Also, all threads of DMMs ork as a single UMM and can access to the global memory. While the memory access latency of the shared memory of GPUs is very lo, that of the global memory is several hundred clock cycles [4]. Hence, e assume that the latency of the shared memory is 1, and e use arameter l to denote the latency of the global memory in the HMM. Figure 2. =4 DMM MB MB MB MB MMU T T NoC and MMU UMM DMM MB MB MB MB MMU T T MB MB MB MB shared memory global memory The architecture of the HMM ith d =2DMMs and idth Suose that a matrix a of size n n is given. The summed area table (SAT) [11] is a matrix b of the same size such that b[i][j] = X»i»i;»j»j a[i ][j ]: It should have no difficulty to confirm that the SAT can be obtained by comuting the column-ise refix-sums and the ro-ise refix-sums as illustrated in Figure 3. Once e have the summed area table, the sum of any rectangular area of a can be comuted by evaluating X u<i»d;l<j»r a[i][j] =b[d][r] b[u][r] b[d][l] +b[u][l]: Since the sum of any rectangular area can be comuted in O(1) time the summed area table has a lot of alications in the area of image rocessing and comuter vision [12]. In our revious aer [13], e have resented a arallel algorithm that comutes the SAT in O( n + nl + l log n) time units using threads on the UMM ith idth and latency l. This algorithm is otimal in the sense that any SAT algorithm takes at least Ω( n + nl + l log n) time units. Hoever, this algorithm reeats airise addition and has a large constant factor in the comuting time and it is not ractically efficient. The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), hich suorts asynchronous execution of CUDA blocks, and sho a global-memory-access-otimal arallel algorithm for comuting the summed area table stored in the global memory of the asynchronous HMM. A straightforard algorithm (2R2W SAT algorithm) on the asynchronous HMM, hich comutes the column-ise refix-sums and then the ro-ise refix-sums, erforms 2 read oerations and 2 rite oerations er element of a matrix. The best knon algorithm (2R1W SAT algorithm) so far erforms 2 read oerations and 1 rite oeration er element [14]. We resent a more efficient algorithm (1R1W SAT algorithm) on the asynchronous HMM, hich erforms only 1 read oeration and 1 rite oeration er element. Clearly, since every element in a matrix must be read at least once and all resulting values must be ritten, our 1R1W SAT algorithm is otimal in terms of the global memory access. We also sho a combined algorithm ((1 + r)r1w SAT algorithm) of 2R1W and 1R1W SAT algorithms that can run faster than any other algorithms for large matrices. Table I shos the total number of memory access oerations to the global memory and the shared memory, the number of barrier synchronization stes, and the global memory access cost. The global memory access cost, hich is comuted from the number of global memory access oerations and the number of barrier synchronization stes, aroximates the comuting time on the HMM. For simlicity, in the table, e omit small terms to focus on dominant terms. For examle, 2R2W SAT algorithm erforms n n coalesced rite oerations, but e simly rite n in the corresonding entry.

3 inut matrix summed area table (SAT) Figure 3. The summed area table (SAT) of a 9 9 matrix and 2R2W SAT algorithm Table I THE PERFORMANCE OF SAT ALGORITHMS ON THE HMM global memory access shared memory access barrier global memory SAT algorithms Coalesced Stride synchronization access cost Read/Write Read/Write Read/Write stes 2R2W n / n n / n - 1 2n +2 n +2l 4R4W 4n / 4n - n / n 3 8 n +4l 4R1W - 4n / n - 2 n 5n +2 nl 2R1W 2n / n - 4n / 4n 2k +2 3 n +(2k +3)l 1R1W n / n - 2n / n 2n 2 2 n +2 n l 1.25R1W 1:25n / n - 2:5n / n 2:5n +4k +4 1:25 n +( n +4k +5)l 2 n +4k +4 (1 + r) n +(2(1 r) n +4k +5)l k is the deth of recursion, hich takes value no more than 1 from the ractical oint of vie. r can take any value in the range [; 1] (1 + r)r1w (1 + r)n / n - (2 + r)n / (2 + r)n 2 (1 r) We have also imlemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)r1w SAT algorithms on GeForce GTX 78 Ti. The exerimental results sho that our (1 + r)r1w SAT algorithm runs faster than any other SAT algorithms for large inut matrices. Further, it runs more than 1 times faster than the best SAT algorithm using a single CPU. II. THE DMM, THE UMM, THE HMM AND THE ASYNCHRONOUS HMM We first define the Discrete Memory Machine (DMM) of idth and latency l. LetB[j] = fj; j + ; j +2; j + 3;:::g (» j» 1) beasetofaddressofthe j- th memory bank of the memory. In other ords, address i is in the (i mod )-th memory bank. We assume that addresses in different banks can be accessed in a time unit, but no to addresses in the same bank can be accessed in a time unit. Also, e assume that l time units are necessary to comlete an access request and continuous requests are rocessed in a ieline fashion. Thus, it takes k + l 1 time units to comlete memory access requests to k addresses in a articular bank. We assume that threads are artitioned into grous of threads called ars. More secifically, threads T (), T (1), :::, T ( 1) are artitioned into ars W ();W(1), :::, W ( 1) such that W (i) =ft (i );T(i +1);:::;T((i +1) 1)g (» i» 1). Wars are disatched for memory access in turn, and threads in a ar try to access the memory at the same time. In other ords, W ();W(1);:::;W( 1) are disatched in a round-robin manner if at least one thread in a ar requests memory access. If no thread in a ar needs memory access, such ar is not disatched for memory access. When W (i) is disatched, threads in W (i) send memory access requests, at most one request er thread, to the memory. We also assume that a thread cannot send a ne memory access request until the revious memory access request is comleted. Hence, if a thread sends a memory access request, it must ait at least l time units to send a ne memory access request. We next define the Unified Memory Machine (UMM) of idth and latency l as follos. Let A[j] =fj ; j + 1;:::;(j +1) 1g denote a set of addresses in the j-th address grou. We assume that addresses in the same address grou are rocessed at the same time. Hoever, if they are in the different grous, one time unit is necessary for each of the grous. Also, similarly to the DMM, threads are artitioned into ars and each ar accesses the memory

4 in turn. Figure 4 shos examles of memory access on the DMM and the UMM. We assume that each memory access request is comleted hen it reaches the last ieline stage. To ars W () and W (1) access to h7; 5; 15; i and h1; 11; 12; 9i, resectively. In the DMM, memory access requests by W () are searated into to ieline stages, because addresses 7 and 15 are in the same bank B(3). Those by W (1) occuies 1 stage, because all requests are in distinct banks. Thus, the memory requests occuy three stages, it takes = 7 time units to comlete the memory access. In the UMM, memory access requests by W () are destined for three address grous. Hence the memory requests occuy three stages. Similarly those by W (1) occuy to stages. Hence, it takes 5+5 1=9time units to comlete the memory access. W () W (1) l-stage ieline registers Figure 4. DMM B[] B[1] B[2] B[3] UMM A[] A[1] A[2] A[3] Examles of memory access on the DMM and the UMM Next, e define the Hierarchical Memory Machine (HMM) [1]. The HMM consists of d DMMs and a single UMM as illustrated in Figure 2. Each DMM has memory banks and the UMM also has memory banks. We call the memory banks of each DMM the shared memory and those of the UMM the global memory. Each DMM orks indeendently. Threads are artitioned into ars of threads, and each ar are disatched for the memory access for the shared memory in turn. Further, each ar of threads in all DMMs can send memory access requests to the global memory. Figure 2 illustrates the architecture of the HMM ith d =2DMMs. Each DMM and the UMM has = 4 memory banks. The shared memory of each DMM and the global memory of the UMM corresond to the shared memory of each streaming multirocessor and the global memory of GPUs. We also assume that the shared memory in each DMM of the HMM can store u to O( 2 ) numbers. The caacity of the shared memory of latest CUDA-enabled GPUs is u to 48KBytes and the number of the banks is 32 [4]. Since an array of 32 2 double (64-bit) numbers occuy 8KBytes, each shared memory can store at most 6 such matrices. Thus, it is reasonable to assume that DMM can store O( 2 ) numbers in the shared memory. For more realistic model for GPUs, e introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM). In the asynchronous HMM, DMMs ork asynchronously in the sense that some DMMs may ork slightly sloer or faster than the others. Instead, all threads in all DMMs can execute a barrier synchronization instruction. If a thread in a DMM executes the barrier synchronization instruction, it must ait until all the other threads in all DMMs execute it. Also, after all threads execute the barrier synchronization instruction, all DMMs are reset, thatis,the shared memory of all DMMs are initialized and all data stored in it are lost. The reader may think that this reset assumtion of all DMMs is not reasonable. Hoever, this assumtion is mandatory for rogram scalability of DMMs in the HMM. More secifically, suose that a rogrammer rite a rogram of the HMM ith d DMMs. It may be ossible to execute this rogram in the HMM ith d DMMs such that d < d. If this is the case, the rogram of d DMMs are executed until a barrier synchronization instruction is executed. After that, the d DMMs are reset and the rogram of next d DMMs are executed. The same rocedure is reeated until all threads execute the barrier synchronization instruction. Hence, it makes sense to assume that all DMMs are reset after each barrier synchronization ste. The revious DMMs is resonsible for coying the data stored in the shared memory to the global memory before barrier synchronization if they are used after the synchronization. Actually, e need to terminate a CUDA kernel call for a GPU hen barrier synchronization of all threads is necessary [4]. When a CUDA kernel call is terminated for barrier synchronization, the data stored in the shared memory by a CUDA block are lost. This is because CUDA blocks are executed in streaming multirocessors ith small shared memory one by one in turn. III. THE GLOBAL MEMORY ACCESS COST ON THE HMM AND THE DIAGONAL ARRANGEMENT ON THE DMM Let C, S, and B be the total number of coalesced global memory access oerations, the total number of stride global memory access oerations, and the number of barrier synchronization stes erformed on the HMM. The global memory access cost is defined to be C +S +(B +1)(l 1). We ill sho that the global memory access cost aroximates the comuting time on the HMM if the comutation erformed in each DMM is negligible.

5 n barrier synchronization stes n1 l 1 l 1 l 1 n2 (; ) (; 1) (; 2) (; 3) time (1; 3) (1; ) (1; 1) (1; 2) Figure 5. Timing chart of coalesced memory access to the global memory ith to barrier synchronization stes Suose that an algorithm erforms n coalesced memory access oerations and to barrier synchronization stes. Clearly, by to barrier synchronization stes, the memory access is artitioned into three stages as illustrated in Figure 5. Let n, n 1,andn 2 such that n = n + n 1 + n 2 be the numbers of memory access oerations erformed in the three stages. Since coalesced memory access oerations by a ar of threads occuy one ieline stage, the three stages takes n + l 1, n 1 + l 1, and n2 + l 1 time units resectively. Hence, the algorithm runs n +3(l 1) time units. Also, if n memory access oerations are stride, each memory access oeration occuy one ieline stage, the algorithm runs in n +3(l 1) time units. In general, if an algorithm erforms C coalesced memory access oerations, S stride memory access oerations, and B barrier synchronization stes, it runs C +S +(B +1)(l 1), hich is equal to the global memory access cost. Suose that e have a matrix of size in the shared memory of a DMM in the HMM. Since a column of the matrix is in the same bank, column-ise access by threads has bank conflicts, hile ro-ise access is conflict-free. In our revious aer [15], e have resented a diagonal arrangement of a matrix such that each (i; j) element is arranged in a[i][(i + j) mod]. Figure 6 illustrates the diagonal arrangement of a 4 4 matrix. We can confirm that both a ro-ise access to (1; ); (1; 1); (1; 2); (1; 3) and a column-ise access to (; 1); (1; 1); (2; 1); (3; 1) are conflict-free. Thus, e have, Lemma 1: In the diagonal arrangement of a matrix, both a ro-ise access and a column-ise access are conflict-free. The diagonal arrangement is used to comute the SAT of a matrix in a shared memory and transose of a matrix in the global memory of the HMM. IV. 2R2W AND 4R4W SAT ALGORITHMS Let s i denote a local register of thread T (i) (» i» n 1). As illustrated in Figure 3, the summed area table (SAT) of a n n matrix a can be comuted by the column-ise refix-sums and the ro-ise refix-sums: [2R2W SAT algorithm] for i ψ to n do in arallel // column-ise refix-sums T (i) erforms s i ψ a[][i] for j ψ 1 to n 1 do Figure 6. (2; 2) (2; 3) (2; ) (2; 1) (3; 1) (3; 2) (3; 3) (3; ) Diagonal arrangement of a 4 4 matrix T (i) erforms s i ψ s i + a[j][i] T (i) erforms a[j][i] ψ s i barrier synchronization for i ψ to n do in arallel // ro-ise refix-sums T (i) erforms s i ψ a[i][] for j ψ 1 to n 1 do T (i) erforms s i ψ s i + a[i][j] T (i) erforms a[i][j] ψ s i In the comutation of the column-ise refix-sums, a[][];a[][1];:::;a[][ n 1] are read. After that, for each j (1» j» n 1), a[j][];a[j][1];:::;a[j][ n 1] are read and ritten. Clearly, memory access to these elements are coalesced. In the comutation of the column-ise refixsums, a[][];a[1][];:::;a[ n 1][] are read. After that, for each j (1» j» n 1), a[][j];a[1][j];:::;a[ n 1][j] are read and ritten. Memory access to these elements are coalesced. Hence, 2R2W SAT algorithm erforms 2n n coalesced memory access oerations and 2n n stride memory access oerations. Since 2R2W SAT algorithm has one barrier synchronization ste, e have, Lemma 2: The global memory access cost of 2R2W SAT algorithm is at most 2n +2 n +2(l 1). We can avoid stride memory access hen e comute the ro-ise refix-sums by transosing a matrix. More secifically, the ro-ise refix-sums can be obtained by transose, column-ise refix-sums, and transose. It has been shon in [16] that transose of a matrix of size n n in the global memory of the HMM can be done in 2n coalesced memory access oerations ith no barrier synchronization ste. The idea of the transose is to artition the matrix into n n blocks ith elements each. We can transose a block via a matrix ith diagonal arrangement in a shared memory of a DMM efficiently. First, a block in the global memory is read in ro-ise and it is ritten in a matrix ith diagonal arrangement in ro-ise. After that, the matrix is read in columnise and it is ritten in a a block in the global memory in ro-ise. The reader should refer to Figure 7 illustrating transose of a block using a 4 4 matrix ith diagonal

6 arrangement. By executing this block transose for all blocks in arallel so that a air of corresonding to blocks are saed aroriately, the transose of a n n can be done. The reader should refer to [16] for the details of the transose Figure 7. Transose of a block using a 4 4 matrix ith diagonal arrangement By executing matrix transose tice, e can design 4R4W SAT algorithm as follos: [4R4W SAT algorithm] Ste 1: Comute the column-ise refix-sums Ste 2: Transose Ste 3: Comute the column-ise refix-sums Ste 4: Transose After Stes 1, 2, and 3, barrier synchronization is necessary. Also, each ste needs no more than 2n coalesced global memory access. Thus, e have, Lemma 3: The global memory access cost of 4R4W SAT algorithm is at most 8 n +4(l 1). V. 2R1W SAT ALGORITHM The main urose of this section is to revie a SAT algorithm for GPU shon in [14]. Since this SAT algorithm erforms 2n read and n rite oerations to the global memory, e call it 2R1W SAT algorithm. 2R1W SAT algorithm that e ill exlain is slightly different from that in [14] for easy understanding of the algorithm. Suose that a n n matrix a is artitioned into n blocks of elements each. 2R1W SAT algorithm n has three stes as follos: [2R1W SAT algorithm] Ste 1: Each DMM reads a block in the global memory and rite it in the shared memory. The column-ise sums, the ro-ise sums, and the sum of the block are comuted. More secifically, for a block a of size, ffl column-ise sums: C[i] = P 1 j= a [j][i] for all i (» i» 1), ffl ro-ise sums: R[i] = P 1 j= a [i][j] for all i (» i» P 1 j= a [i][j], 1), and P 1 ffl sum: S = i= are comuted. The column-ise sums of all blocks excluding the bottom blocks are ritten in the global memory such that they constitute a matrix of size n ( 1) n in the global memory. Similarly, the ro-ise sums of all blocks constitute a matrix of size n ( 1), and the sums n constitute a matrix of size n ( 1) ( n 1). The reader should refer to Figure 8 for illustrating the resulting values for a matrix in Figure 3 ith = 3. LetC, R, ands denote matrices of the resulting values for the column-ise sums, the ro-ise sums, and the sums, resectively. In Figure 8, the sizes of C, R, ands are 2 9, 9 2, and 2 2, resectively. Ste 2: The column-ise refix-sums of C and the ro-ise refix-sums of R arecomutedinthesameayas2r2w SAT algorithm. If S is no larger than then e comute the SAT of S using a single DMM. Otherise, e execute 2R1W SAT algorithm recursively for S. The reader should refer to Figure 8 for the resulting values of C, R, ands. Ste 3-1: Each DMM reads a block from the global memory and rite it in the shared memory. Let a denote a block read by a articular DMM. It reads elements in C, and adds them to the to ro of a so that each of the resulting sums is the sum of all elements above it, inclusive, in the same column. Similarly, it reads elements in R, and adds them to the leftmost column of a so that each of the resulting sums is the sum of all elements to the left-side of it, inclusive, in the same ro. Further, an element in S is added to the to left corner of a so that the resulting sum is the the sum of all blocks above and to the left of it, inclusive. The reader should refer to Figure 9 illustrating these oerations for a block. Also, Figure 8 illustrates the resulting values of all blocks. Ste 3-2: Each DMM comutes the SAT of a block obtained in Ste 3-1 and the resulting values are ritten in the global memory. Figure 9 illustrates the values of a block before and after this ste. The reader should have no difficulty to confirm that the block thus obtained stores the SAT of the inut matrix correctly Figure S R C Ste Ste 3-1 Ste 3 of 2R1W SAT algorithm Let us evaluate the global memory access cost. In Ste 1, all elements in a are read, and C, R, ands are ritten. Thus, n elements are read from the global memory and less than 2 n + n 2 elements are ritten in the global memory.

7 After Ste 1 After Ste 2 After Ste column-ise sums ro-ise sums sums C R S Figure 8. 2R1W SAT algorithm executed for a matrix in Figure 3 ith =3 Note that e should rite R T, that is, transosed R in the global memory for the urose of coalesced memory access for the ro-ise refix-sums comutation in Ste 2. If this is the case, the ro-ise refix-sums of R corresonds to the column-ise refix-sums of R T, hich can be comuted by coalesced memory access to the global memory. Also, in Ste 1, the column-ise sums, the ro-ise sums, and the SAT of a block in a shared memory are comuted. This comutation can be done ithout bank conflicts by diagonal arrangement of a block. Each of the d DMMs erforms the n comutation of the column-ise sums for blocks of size 2 d. Since the memory access is conflict-free, this takes only n time units, hich is so small that it can be hidden d by latency overhead. Ste 2 erforms the comutation of the the column-ise refix-sums and the ro-ise refix-sums for matrices of n elements. It also comutes the SAT of n elements, 2 recursively, hich erforms global memory access to O( n 2 ) elements. In Ste 3-1, all elements in a and the resulting values of C, R, ands are read from the global memory. In Ste 3-2 the resulting SAT is ritten in the global memory. Thus, 2R1W SAT algorithm erforms at most 3n+8 n +O( n 2 ) coalesced memory access oerations. This includes the memory access by recursive comutation of S. Let us evaluate the number of barrier synchronization stes. Barrier synchronization ste is necessary after Stes 1 and 2 if the SAT of S is comuted ithout recursion. If the SAT of S is comuted recursively, additional to barrier synchronization stes is necessary for each recursion. Hence, if 2R1W SAT algorithm involves deth k recursions, it erforms 2k +2 barrier synchronization stes. Thus, e have, Lemma 4: The global memory access cost of 2R1W SAT algorithm ith recursion deth k is 3 n +8 n 2 + O( n 3 )+ (2k + 3)(l 1). Since =32in current GPUs, d =if n» 2 2 and d =1 if n» 2 3.Thus,d is no more than 1 from the ractical oint of vie. VI. OUR 1R1W SAT ALGORITHM The main urose of this section is to sho our novel SAT algorithm called 1R1W SAT algorithm. This algorithm erforms only n + O( n ) read oerations and n + O( n ) rite oerations to the global memory. Before shoing 1R1W SAT algorithm, e resent 4R1W SAT algorithm. By combining techniques used in 4R1W SAT and 2R1W SAT algorithms, e can obtain 1R1W SAT algorithm. Let b be the SAT of an inut n n matrix a. Suose that the values of b[i 1][j 1], b[i][j 1], andb[i 1][j] are already comuted. We can obtain the value of b[i][j] by evaluating the folloing formula: b[i][j] =a[i][j] b[i][j 1] b[i 1][j]+b[i 1][j 1] (1) From this formula, 4R1W SAT algorithm comutes the SAT in a diagonal scan order from the to left to the bottom right. More secifically, 4R1W algorithm has 2 n 1 stages and each Stage k (» k» 2 n 2) comutes Formula (1) for all i and j such that i + j = k. Figure 1 illustrates the comutation erformed in Stage 7. Let us evaluate the erformance of 4R1W SAT algorithm. To comute each b[i][j], 3 elements in b and 1 element in a are read. Also, the resulting value is ritten in b. Thus, 4n reading oerations and n riting oerations are erformed. Unfortunately, all memory access are stride. Further, barrier

8 Figure 1. Stage 7 of 4R1W SAT algorithm synchronization ste is necessary after every stage from to 2 n 3. Thus,ehave, Lemma 5: The global memory access cost of 4R1W SAT algorithm is 5n +(2 n 1)l. We are no in a osition to sho our ne 1R1W SAT algorithm. The idea is to extend 4R1W SAT algorithm to erform SAT comutation in block-ise. In each block-ise comutation, a similar comutation to 2R1W SAT algorithm is erformed. Again, an inut matrix a of size n n is artitioned into n n blocks of size each. Let A(i; j) (» i; j» n 1) denote a block in the i-th ro and in the j-th column. 1R1W SAT algorithm has n 2 1 stages. Each Stage k (» k» n 2 2) comutes the SAT of block A(i; j) ith i + j = k. The reader should refer to Figure 11 for illustrating the comutation erformed in Stage 3. In Stage 3, A(2; 1) and A(1; 2) are comuted using the resulting values in A(2; ), A(1; 1), anda(; 2). From these resulting values, e can obtain C, R, ands for each of A(2; 1) and A(1; 2). For examle, in Figure 11, C for A(2; 1) is (9; 11; 9). This can be obtained by the resulting value 13 of the bottom to corner in A(1; ) and the resulting values (22; 33; 42) in the bottom ro of A(1; 1). More secifically, e can obtain C by comuting airise subtraction (22; 33; 42) (13; 22; 33) = (9; 11; 9). Afterthe values of C, R, ands, these values are added to A(2; 1) in the same ay as Ste 3-1 of 2R1W SAT algorithm. Finally, e comute the SAT of A(2; 1) in the same ay as Ste 3-2 of 2R1W SAT algorithm. Let us evaluate the erformance of 1R1W SAT algorithm. In each stage, the values of a block in the global memory are read and the resulting values are ritten to the global memory. Also, the values necessary to comute C, R, ands are read and ritten to the global memory. For each block, 2 +1 elements are read from the global memory for this task. Since e have n blocks, n +(2 +1) n elements 2 2 n are read and n +(2 +1) elements are ritten in all 2 stages. Barrier synchronization is necessary after each of Stages from to n 2 3. Thus,ehave, Theorem 6: The global memory access cost of 4R1W C R S Figure 11. Stage 3 of 1R1W SAT algorithm for a matrix in Figure 3 SAT algorithm is 2 n + O( n n 2 )+(2 2)l. VII. (1 + r)r1w SAT ALGORITHM The main urose of this section is to accelerate the SAT comutation further by combining 1R1W and 2R1W SAT algorithms. The idea of further acceleration is to use 2R1W SAT algorithm in early and late stages of 1R1W SAT algorithms to reduce the latency overhead. Again, suose that a n n matrix a is artitioned into n n blocks of size each. As illustrated in Figure 12, for any fixed arameter r ( < r < 1), e artition blocks into (A) to left triangle, (B) bottom right triangle, and (C) remaining blocks. Clearly, (A) and (B) have r n +( r n 1)+ +1 = r n ( r n +1)=2 ß rn 2 2 blocks each. We first use 2R1W SAT algorithm to comute the SAT of (A). After that, e use 1R1W SAT algorithm for (C). Finally, 2R1W SAT algorithm is used for the comutation of the SAT in (B). Let us evaluate the erformance. Since (A) and (B) have aroximately rn elements in rn blocks each, 2R1W SAT algorithm for (A) and (B) erforms rn + O( rn ) read oerations and rn 2 + O( rn ) rite oerations each. Also, since (C) has (1 r)n elements, 1R1W SAT algorithm for (C) erforms (1 r)n + O( (1 r)n ) read oerations and (1 r)n + O( (1 r)n ) rite oerations. Hence, this SAT algorithm erforms (1 + r)n + O( n ) read oerations and n+o( n ) rite oerations. Thus, e call this SAT algorithm (1 + r)r1w SAT algorithm. Further, 2R1W SAT algorithm for (A) and (B) needs 2+2k barrier synchronization stes each, here k is the deth of the recursion of 2R1W SAT algorithm. Since 1R1W SAT algorithm for (C) has 2 (1 r) n 1 stages, it needs 2 (1 r) n 2 barrier synchronization stes. Also, after the comutation of the SAT for (A) and (B), 1 barrier synchronization ste each

9 is necessary. Totally, (1 + r)r1w SAT algorithm executes 2 (1 r) n +4+4k barrier synchronization stes. Thus, e have, Theorem 7: The global memory access cost of (1 + r)r1w SAT algorithm is (2+r) n +O( n 2 )+(2 (1 r) n + 5+4k)l. Figure 12. r n n (A) 2R1W (C) 1R1W (B) 2R1W Partition of a matrix for (1 + r)r1w SAT algorithm When r = :25, the global memory access cost of 1.25R1W SAT algorithm is 2:5 n +( n +5+4d)l time units. Since 2R1W and 1R1W SAT algorithms run aroximately 3 n +(3+2d)l and 2 n +2 n l time units, resectively, 1.25R1W SAT algorithm may run faster than these algorithms. Further, e can select the best value r that minimize the running time of (1 + r)r1w SAT algorithm. VIII. EXPERIMENTAL RESULTS We have imlemented all SAT algorithms resented so far in this aer on GeForce GTX 78 Ti. Since the number of memory banks and the number of threads in a ar is 32 [4], e have imlemented SAT algorithms ith =32. Barrier synchronization of all threads is imlemented by invoking searate CUDA kernel calls. For examle, one CUDA kernel call is invoked for each of n 2 1 stages of 4R1W SAT algorithm. We have tested several configuration in terms of the number of threads in a CUDA block, and selected the best configuration. For examle, in 2R2W, 4R4W, and 4R1W SAT algorithms, CUDA blocks ith 64 threads each are invoked. Since each Stage i and each Stage 2 1 i (» i» n 1) of 4R1W SAT algorithm comutes i +1 values of the SAT, it uses i +1 threads in i+1 CUDA 64 blocks. In 2R1W and 1R1W SAT algorithm, a block in a matrix is coied to the shared memory in a streaming multirocessor and the column-ise sums, the ro-ise sums, and/or the SAT of it is comuted. After that, the resulting values are coied to the global memory. For this oeration, e use one CUDA block ith 128 threads for each block. All 128 threads in a CUDA block areusedtocoya block in the shared memory and 32 threads out of 128 threads are used to comute the column-ise sums, the ro-ise sums, and/or the SAT. n Table II shos the running time of SAT algorithms for a double (64-bit) matrix of size from 1K 1K (= ) to 18K 18K (= ). Since a 18K 18K 64-bit matrix uses 2.53GBytes, it is hard to store a matrix larger than it in the global memory of GeForce GTX 78 Ti of size 3GBytes. The running time of the best SAT algorithm for each value of n is highlighted in boldface. Since 4R1W SAT algorithm erforms a lot of kernel calls and stride memory access, and has large memory access latency overhead, it needs much more comuting time than the other algorithms. Recall that 4R4W SAT algorithm corresonds to 2R2W SAT algorithm ith transose and 4R4W SAT algorithm erforms much more memory access oerations than 2R2W SAT algorithm. Since 2R2W SAT algorithm erforms stride memory access, it is much sloer than 4R4W SAT algorithm. These exerimental results imly that stride memory access imoses a large enalty on the comuting time. Recall that 2R1W and 1R1W SAT algorithms are blockbased algorithms, that erform 3n + O( n ) and 2n + O( n ) global memory access oerations, resectively. Hence, they are faster than 4R4W SAT algorithm, hich erforms aroximately 8n global memory access oerations. Although 1R1W SAT algorithm erforms feer global memory access oerations than 2R1W SAT algorithm, it runs sloer hen n»6k. The reason is that 1R1W SAT algorithm has a larger latency overhead than 2R1W SAT algorithm and the latency overhead dominates the bandidth overhead hen the size of inut is small. 1.25R1W SAT algorithm runs faster than both 2R1W and 1R1W SAT algorithms henever n 5K. We have evaluated the comuting time for all ossible values r = 2 ; (2)2 ;:::; ( n ) 2 n n n to find the best value r that minimize the running time of (1 + r)r1w. Table II also shos the values of r ( < r < 1) that minimize the running time of (1 + r)r1w SAT algorithm. From the table, e can see that (1 + r)r1w SAT algorithm attain the best erformance hen n 5K. Also, the value of r that gives the best erformance decreases as the size of a matrix increases. This is because the memory bandidth overhead of 1R1W SAT algorithm dominates the latency overhead for larger matrices and 1R1W SAT algorithm has better erformance than 2R1W algorithm. We can conjecture that 1R1W SAT algorithm could be the best if an inut matrix as much larger than 18K 18K. To see a seed-u factor of SAT algorithms running on the GPU over a conventional CPU, e have evaluated the erformance of several sequential SAT algorithms on Intel Xeon X746 (2.66GHz). Table II shos the running time of to to sequential algorithms as follos: 2R2W(CPU): The column-ise refix-sums are comuted in a raster scan order from the to ro to the bottom ro. More secifically, a[i +1][j] ψ a[i + 1][j] +a[i][j] is executed in a raster scan order of (i; j). The ro-ise

10 Table II THE RUNNING TIME OF SAT ALGORITHM (IN MILLISECONDS) AND THE VALUE OF r THAT MINIMIZE THE RUNNING TIME OF (1 + r)r1w SAT ALGORITHM FOR MATRICES OF SIZES FROM 1K 1K TO 18K 18K SAT Algorithms 1K 2K 3K 4K 5K 6K 7K 8K 1K 12K 14K 16K 18K 2R2W R4W R1W R1W R1W R1W fastest (1 + r)r1w r ( < r < 1) R2W(CPU) R1W(CPU) refix-sums are also comuted in a raster scan order, that is, a[i][j +1]ψ a[i][j +1]+a[i][j] is executed in a raster scan order of (i; j). 4R1W(CPU): Formula (1) is evaluated in a raster scan order of (i; j). From the table, e can see that 4R1W(CPU) SAT algorithm runs faster than 2R2W(CPU) SAT algorithm, because of the memory access locality. Also, (1 + r)r1w SAT algorithm runs more than 1 times faster than 4R1W(CPU) SAT algorithm hen n 5K. IX. CONCLUSION The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine, hich cature the essence of CUDA-enabled GPUs. We have also resented a global-memory-access-otimal arallel algorithm for comuting the summed area table on the asynchronous HMM. The exerimental results on GeForce GTX 78 Ti shoed that our best algorithm, (1 +r)r1w SAT algorithm, runs faster than any other algorithms for an inut matrix of size 5K 5K or larger. It also runs at least 1 times faster than the best sequential algorithm running on a single CPU. REFERENCES [1] W. W. Hu, GPU Comuting Gems Emerald Edition. Morgan Kaufmann, 211. [2] K. Ogaa, Y. Ito, and K. Nakano, Efficient Canny edge detection using a GPU, in Proc. of International Conference on Netorking and Comuting. IEEE CS Press, Nov. 21, [3] A. Uchida, Y. Ito, and K. Nakano, Fast and accurate temlate matching using ixel rearrangement on the GPU, in Proc. of International Conference on Netorking and Comuting. IEEE CS Press, Dec. 211, [4] NVIDIA Cororation, NVIDIA CUDA C rogramming guide version 5., 212. [5] D. Man, K. Uda, H. Ueyama, Y. Ito, and K. Nakano, Imlementations of a arallel algorithm for comuting euclidean distance ma in multicore rocessors and GPUs, International Journal of Netorking and Comuting, vol. 1, no. 2, , July 211. [6] NVIDIA Cororation, NVIDIA CUDA C best ractice guide version 3.1, 21. [7] Y. Ito and K. Nakano, A GPU imlementation of dynamic rogramming for the otimal olygon triangulation, IEICE Transactions on Information and Systems, vol. E96-D, no. 12, , Dec [8] K. Nakano, Simle memory machine models for GPUs, International Journal of Parallel, Emergent and Distributed Systems, vol. 29, no. 1, , 214. [9] A. V. Aho, J. D. Ullman, and J. E. Hocroft, Data Structures and Algorithms. Addison Wesley, [1] K. Nakano, The hierarchical memory machine model for GPUs, in Proc. of International Parallel and Distributed Processing Symosium Workshos, May 213, [11] F. Cro, Summed-area tables for texture maing, in Proc. of the 11th annual conference on Comuter grahics and interactive techniques, 1984, [12] A. Lauritzen, Chater 8: Summed-area variance shado mas, in GPU Gems 3. Addison-Wesley, 27. [13] K. Nakano, Otimal arallel algorithms for comuting the sum, the refix-sums, and the summed area table on the memory machine models, IEICE Trans. on Information and Systems, vol. E96-D, no. 12, , 213. [14] D. Nehab, A. Maximo, R. S. Lima, and H. Hoe, GPUefficient recursive filtering and summed-area tables, ACM Trans. Grah., vol. 3, no. 6,. 176, 211. [15] K. Nakano, Simle memory machine models for GPUs, in Proc. of International Parallel and Distributed Processing Symosium Workshos, May 212, [16] A. Kasagi, K. Nakano, and Y. Ito, An otimal offline ermutation algorithm on the hierarchical memory machine, ith the GPU imlementation, in Proc. of International Conference on Parallel Processing. IEEE CS Press, Oct. 213,

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information