Array transposition in CUDA shared memory

Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some dffculty gettng my head around t, and decded t would be helpful to have a few fgures to explan t. I ve also extended t slghtly to cover more general cases. 1

j Fgure 1: Illustraton of array to be wrtten nto, and read from 1 Objectve As llustrated n the fgure above we want to work wth a shared memory array whch s mathematcally of wdth I, and heght 32 (equal to the warp sze). We want to effectvely transpose some data by wrtng nto t wth the threads n the thread block fllng t row-wse, workng across the frst row, then the second, and so on n ascendng order of + ji, and then (after a synchronsaton) readng data out of t column-wse, workng up the frst column, then the second and so n n ascendng order of j + 32. Or, vce versa, we mght want to fll t by columns, and then read t out by rows. The man applcaton for ths s for loadng n (or storng) data whch s stored as a Array-of-Structs, each of sze I. To acheve a coalesced read from devce memory usng a sngle warp, the threads can load n contguous vectors from devce memory and fll the shared memory array by rows. Then they can read t back nto regsters from columns so that each thread gets ts requred struct data. The process would be reversed for wrtng back to devce memory. In both cases, the array ndex n devce memory would correspond to +ji. The second applcaton arses n the ADI solvers whch Jeremy and I are workng on. Here, for part of the calculaton, n order to maxmse coalesced memory transfers t s effcent for threads to work on the array row-wse ntally, but there s then a mddle secton n whch t s necessary to work on columns wth a separate warp for each column, before fnally revertng to the orgnal thread mappng for the fnal stage. The challenge s to come up wth a mappng (, j) k to an ndex k n the lnear shared memory array so that there are no memory bank conflcts when accessng the data n ether drecton. 2

j 10 11 12 13 14 5 6 7 8 9 0 1 2 3 4 Fgure 2: Shared memory ndces when I = 5. 2 I odd When I s odd we can defne k = + j I. Ths naturally gves no bank conflcts when readng row-wse, snce each warp gets 32 contguous addresses and current NVIDIA GPUs have 32 shared memory banks. Furthermore, there are no bank conflcts n each column, because j = 32 s the smallest strctly postve nteger such that j I mod 32 = 0 whch would lead to a bank conflct wth the element j =0. 3

padded by 1 37 38 39 40 33 34 35 36 28 29 30 31 j 3 I a power of 2 8 9 10 11 4 5 6 7 0 1 2 3 Fgure 3: Shared memory ndces when I = 4. When I s a power of 2, then the defnton k = + j I would lead to bank conflcts along each column. The frst bank conflct s when = 0, j = 8, k = 32. Ths suggests the dea of paddng by 1 after every 32 elements, gvng the mappng k = + j I + ( + ji)/32 where the dvson s nterpreted n the nteger sense (.e. dscardng the remander). 4

padded by 1 97 98 99 90 91 92 93 94 95 j 4 General I 0 1 2 3 4 5 Fgure 4: Shared memory ndces when I = 6. Havng handled the two extreme cases, now we consder the general case n whch I = P F, where P s a power of 2, and F s odd. In ths case k = + j I leads to the frst bank conflct when = 0 and If P 32 then ths mples j I mod 32 = 0. j F mod (32/P ) = 0, whch happens frst when j = 32/P, and hence ji = 32F. Thus, the padded defnton to avod conflcts s k = + j I + ( + j I)/(32F ). (Note: when I = F, then ( + j I)/(32F ) = 0 whch correctly gves us back the unpadded verson.) Alternatvely, f P > 32 then the frst bank conflct occurs when j = 1, and an approprate paddng s k = + j I + j. 5

Implementaton notes The smplest thng s to defne the (, j) pars for loadng and storng, and then compute the k for each. At worst, the paddng ncreases the shared memory requrements by approxmately 3%. The computaton of k requres 2 addtonal nteger operatons, a bt-shft and an addton. An alternatve, when I s even, s to use the mappng k = + j (I +1). Ths avods the 2 addtonal nteger operatons, but at the expense of usng more shared memory. 6