Asynchronous Memory Machine Models with Barrier Synchronization

Size: px

Start display at page:

Download "Asynchronous Memory Machine Models with Barrier Synchronization"

Erik Jacobs
5 years ago
Views:

1 Asychroous Memory Machie Models ith Barrier Sychroizatio Koji Nakao Deartmet of Iformatio Egieerig Hiroshima Uiversity Kagamiyama 1-4-1, Higashi Hiroshima, Jaa Abstract The Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM) are theoretical arallel comutig models that cature the essece of the shared memory ad the global memory of GPUs. It as assumed that ars (i.e. grous of threads) o the DMM ad the UMM ork sychroously i the roud-robi maer. Hoever, ars ork asychroously i the actual GPUs, i the sese that ars may be radomly (or arbitrarily) disatched for executio. The first cotributio of this aer is to itroduce a asychroous versio of the DMM ad the UMM, i hich ars are arbitrarily disatched. Istead, e assume that threads ca execute the sycthreads istructio for barrier sychroizatio. Sice the barrier sychroizatio oeratio is costly, e should evaluate ad miimize the umber of barrier sychroizatio oeratios erformed by arallel algorithms. The secod cotributio of this aer is to sho a arallel algorithm to comute the sum of umbers i otimal comutig time ad fe barrier sychroizatio stes. Our arallel algorithm comutes the sum of umbers i O( + l log ) time uits ad O(log l +loglog) barrier sychroizatio stes usig l threads both o the asychroous DMM ad o the asychroous UMM ith idth ad latecy l. Wealso rove that the comutig time is otimal because it matches the theoretical loer boud. Quite surrisigly, the umber of barrier sychroizatio stes ad the umber of threads are ideedet of. Eve if the iut size is quite large, our arallel algorithm comutes the sum i otimal time uits ad a fixed umber of sycthreads usig a fixed umber of threads. Keyords-arallel comutig models, arallel algorithms, cotiguous memory access, asychroous models, GPU, CUDA I. INTRODUCTION The research of arallel algorithms has a log history of more tha 40 years. Sequetial algorithms have bee develoed mostly o the Radom Access Machie (RAM) [1]. I cotrast, sice there are a variety of coectio methods ad atters betee rocessors ad memories, may arallel comutig models have bee reseted ad may arallel algorithmic techiques have bee sho o them. The most ell-studied arallel comutig model is the Parallel Radom Access Machie (PRAM) [2], [3], [4], hich cosists of rocessors ad a shared memory. Each rocessor o the PRAM ca access ay address of the shared memory i a time uit. The PRAM is a good arallel comutig model i the sese that arallelism of each roblem ca be revealed by the erformace of arallel algorithms o the PRAM. Hoever, sice the PRAM requires a shared memory that ca be accessed by all rocessors at the same time, it is ot feasible. The GPU (Grahical Processig Uit), is a secialized circuit desiged to accelerate comutatio for buildig ad maiulatig images [5], [6], [7], [8], [9]. Latest GPUs are desiged for geeral urose comutig ad ca erform comutatio i alicatios traditioally hadled by the CPU. Hece, GPUs have recetly attracted the attetio of may alicatio develoers [5], [], [11]. NVIDIA rovides a arallel comutig architecture called CUDA (Comute Uified Device Architecture) [12], the comutig egie for NVIDIA GPUs. CUDA gives develoers access to the virtual istructio set ad memory of the arallel comutatioal elemets i NVIDIA GPUs. I may cases, GPUs are more efficiet tha multicore rocessors [13], sice they have hudreds of rocessor cores ad very high memory badidth. CUDA uses to tyes of memories i the NVIDIA GPUs: the shared memory ad the global memory [12]. The shared memory is a extremely fast o-chi memory ith loer caacity, say, Kbytes. The global memory is imlemeted as a off-chi DRAM, ad has large caacity, say, Gbytes, but its access latecy is very log. The efficiet usage of the shared memory ad the global memory is a key for CUDA develoers to accelerate alicatios usig GPUs. I articular, e eed to cosider the bak coflict of the shared memory access ad the coalescig of the global memory access [7], [13], [14]. The address sace of the shared memory is maed ito several hysical memory baks. If to or more threads access the same memory baks at the same time, the access requests are rocessed sequetially. Hece, to maximize the memory access erformace, threads of CUDA should access distict memory baks to avoid the bak coflicts of the memory accesses. To maximize the badidth betee the GPU ad the DRAM chis, the cosecutive addresses of the global memory must be accessed at the same time. Thus, CUDA threads should erform coalesced access he they access the global memory. I our revious aer [15], e have itroduced to models, the Discrete Memory Machie (DMM) ad the

2 Uified Memory Machie (UMM), hich reflect the essetial features of the shared memory ad the global memory of NVIDIA GPUs. The outlie of the architectures of the DMM ad the UMM is illustrated i Figure 1. I both architectures, a sea of threads (Ts) is coected to the memory baks (MBs) through the memory maagemet uit (MMU). Each thread is a Radom Access Machie (RAM) [1], hich ca execute oe of the fudametal oeratios i a time uit. We do ot discuss the architecture of the sea of threads i this aer, but e ca imagie that it cosists of a set of multi-core rocessors hich ca execute may threads i arallel ad/or i time-sharig maer. Threads are executed i SIMD [16] fashio, ad the rocessors ru o the same rogram ad ork o the differet data. a sea of threads MMU MB MB MB MB Figure 1. DMM address lie a sea of threads MMU MB MB MB MB UMM data lie The architectures of the DMM ad the UMM MBs costitute a sigle address sace of the memory. A sigle address sace of the memory is maed to the MBs i a iterleaved ay such that the ord of data of address i is stored i the (i mod )-th bak, here is the umber of MBs. The mai differece of the to architectures is the coectio of the address lie betee the MMU ad the MBs, hich ca trasfer a address value. I the DMM, the address lies coect the MBs ad the MMU searately, hile a sigle address lie from the MMU is coected to the MBs i the UMM. Hece, i the UMM, the same address value is broadcast to every MB, ad the same address of the MBs ca be accessed i each time uit. O the other had, differet addresses of the MBs ca be accessed i the DMM. Sice the memory access of the UMM is more restricted tha that of the DMM, the UMM is less oerful tha the DMM. The erformace of algorithms o the PRAM is usually evaluated usig to arameters: the size of the iut ad the umber of rocessors. For examle, it is ell ko that the sum of umbers ca be comuted i O( +log) time o the PRAM [2]. We ill use four arameters, the size of the iut, the umber of threads, the idth ad the latecy l of the memory he e evaluate the erformace of algorithms o the DMM ad o the UMM. The idth is the umber of memory baks ad the latecy l is the umber of time uits to comlete the memory access. Hece, the erformace of algorithms o the DMM ad the UMM is evaluated as a fuctio of (the size of a roblem), (the umber of threads), (the idth of a memory), ad l (the latecy of a memory). Further, r (the umber of local registers used by each thread) may be additioally used. Note that idth ad latecy l deed o the architecture. They are fixed values ad caot be chaged. O the other had, the umber of threads ca be chaged. Users ca choose otimal value of to get the best erformace. Thus, the comutig time of algorithms o the DMM ad the UMM ca be evaluated ithout usig. For examle, i our revious aer [17], e have sho that the refixsums of umbers ca be comuted i O( + l + l log ) time uits o the DMM ad the UMM. To get the best erformace, should choose = l. If this is the case, the refix-sums ca be comuted i O( + l log ) time uits. Suose that e use threads T (0),T(1),...,T( 1). Threads o the DMM ad the UMM are artitioed ito grous of threads called ars. Let W (0),W(1),...,W( 1) deote the grous. I our revious aer, it is assumed that threads the DMM ad the UMM orks sychroously i the sese that ars are activated for memory access from W (0) to W ( 1) i tur by the roud-robi maer. The first cotributio of this aer is to exted memory machie models reseted i our revious aer [15] for more realistic arallel comutig models. More secifically, e assume that threads orks asychroously i the sese that ars are disatched for memory access arbitrarily. The scheduler arbitrarily selects oe of the ars i hich at least oe thread tries to access the memory, ad disatches it for memory access. Istead, e assume that threads ca execute a istructio sycthreads for the urose of barrier sychroizatio. I NVIDIA GPUs, sycthreads() istructio is suorted for threads i a block, hich takes 16 clock cycles [12]. Also, for the urose of sychroizatio of threads i multile blocks e eed to searate algorithm ito differet kerel calls [12]. Hece, barrier sychroizatio is costly. I this aer, he e evaluate the erformace of arallel algorithm o the asychroous DMM ad the asychroous UMM, e also evaluate the umber of sycthreads oeratios erformed. Note that, arallel algorithms o the asychroous versios of the DMM ad the UMM must ork correctly for ay orst choice of ars by a malicious scheduler. Also, the erformace icludig the comutig time must be evaluated for the case of orst choice of ars. The secod cotributio of this aer is to sho efficiet

3 summig algorithm o the asychroous versio of the DMM ad the UMM ith idth ad latecy l. Wefirst sho that a simle algorithm sho i [17] ca comute the sum of umbers i O( +l log ) time uits ad O(log ) barrier sychroizatio stes (Algorithm Simle). We the go o to rove that Ω( + l log ) time uits are ecessary to comute the sum of umbers. Thus, Algorithm Simle is time otimal. We also sho that the sum of umbers ca be comuted i O( l + l log ) time uits ad 0 barrier sychroizatio ste (Algorithm Oe-War). Although this algorithm does ot erform barrier sychroizatio, it is ot time otimal ad has large overhead of factor l. Next, e ill sho that a arallel algorithm based o a 2-ary tree ca comute the sum of umbers i O( log +l log ) time uits ad O( log log ) barrier sychroizatio stes (Algorithm Tree). By combiig Algorithm Simle ad Algorithm Tree, e sho that the sum of umbers ca be comuted i O( log + l log ) time uits ad O( log +loglog) barrier sychroizatio stes (Algorithm Simle-Tree). Clearly, Algorithm Sum-Tree is time otimal. Fially, e ill sho that the barrier sychroizatio stes ca be reduced to +loglog) (Algorithm Hybrid). Quite surrisigly, the umber of barrier sychroizatio stes ad the umber of threads of Algorithm Hybrid are ideedet of. Eve if the iut size is quite large, our arallel algorithm comutes the sum i otimal time uits ad a fixed umber of sycthreads usig a fixed umber of threads. Table I summarizes our summig algorithms reseted i this aer. O( log l log This aer is orgaized as follos. Sectio II defies the Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM) itroduced i our revious aer [15] ad defie the asychroous versio of the DMM ad the UMM. I Sectio III, e evaluate the comutig time of the cotiguous memory access to the memory of the asychroous DMM ad the asychroous UMM. The cotiguous memory access is a key igrediet of arallel algorithm develomet o the memory machie models. Usig the cotiguous access, e sho that Algorithm Simle ca comute the sum of umbers i O( log +l log ) time uits ad O( log ) barrier sychroizatio stes i Sectio IV. We also discuss the loer boud of the time comlexity ad sho to loer bouds, Ω( )-time badidth limitatio ad Ω(l log )- time reductio limitatio. Sectio V shos Algorithm Oe- War that comutes the sum of umbers i O( l+l log ) time uits ad 0 sychroizatio stes. I Sectio VI shos a tree-based summig algorithm Algorithm Tree that comutes the sum of umbers i O( log + l log ) time uits ad O( log log ) barrier sychroizatio stes. Fially, SectioVII shos time-otimal summig algorithm. Algorithm Simle-Tree, hich is a combiatio of Algorithm Simle ad Algorithm Tree, uses O( log log +log log ) barrier sychroizatio stes. By a aroriate recomutatio, e sho that the barrier sychroizatio stes ca be reduced to O( log l log +log log ). Sectio VIII offers cocludig remarks. II. PARALLEL MEMORY MACHINES: DMMAND UMM The mai urose of this sectio is to defie the Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM). itroduced i our revious aer [15], [17]. We first defie the Discrete Memory Machie (DMM) of idth ad latecy l, Letm[i] (i 0) deote a memory cell of address i i the memory. Let B[j] ={m[j],m[j + ],m[j +2],m[j +3],...} (0 j 1) deote the j-th bak of the memory. Clearly, a memory cell m[i] is i the (i mod )-th memory bak. We assume that memory cells i differet baks ca be accessed i a time uit, but o to memory cells at the same bak ca be accessed i a time uit. Also, e assume that l time uits are ecessary to comlete a access request ad cotiuous requests are rocessed i a ielie fashio through the MMU. Thus, it takes k + l 1 time uits to comlete k access requests to a articular bak. We assume that threads are artitioed ito grous of threads called ars. More secifically, threads are artitioed ito ars W (0),W(1),..., W ( 1) such that W (i) ={T (i ), T (i +1),...,T ((i +1) 1)} (0 i 1). Wars are disatched for memory access i tur ad threads i a ar try to access the memory at the same time. We defie to assumtios sychroous maer ad asychroous maer i terms of disatchig of ars. I the sychroous maer, W (0),W(1),...,W( 1) are disatched i a roud-robi maer if at least oe thread i a ar requests memory access. More secifically, suose that every thread executes T istructios. I the sychroous maer, ars ork equally as follos: [Sychroous Model] for t 0 to T do for i 0 to 1 do Every thread i W (i) executes a istructio. O the other had, i asychroous oeratios, oe of the ars is disatched ad executed as follos: [Asychroous Model] for t 0 to T 1 do Arbitrarily select a ar W (i) to be executed. Each thread i W (i) executes a istructio. Note that, i asychroous maer, if all threads i a ar W (i) have o istructio to be executed, such ar W (i) is ot selected. For examle, if threads i W (i) have just set memory access requests ad they are aitig for comletio of memory access, W (i) is ot selected. Such ar W (i) ill be selected after the comletio of memory access. We also assume that, for the urose of barrier sychroizatio, all threads ca execute the sycthreads istructio. Suose that at least oe of the threads executes sycthreads. After that, all threads that have executed

4 Table I PERFORMANCE OF PARALLEL ALGORITHM FOR COMPUTING THE SUM algorithms time uits threads sycthreads time otimality Simle O( + l log ) O(log 2 ) otimal Oe-War O( l + l log ) 0 overhead of factor l Tree O( log log + l log ) O( log ) overhead of factor log Simle-Tree O( + l log ) O( log 2 log +loglog) otimal Hybrid O( log l + l log ) l O( log +loglog) otimal B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] memory baks of DMM address grous of UMM Figure 2. Baks ad address grous for =4 sycthreads have bee blocked util all threads execute sycthreads. Oce all threads execute sycthreads, they restart executig istructios. We assume that a thread caot sed a e memory access request util the revious memory access request is comleted. Hece, if a thread sed a memory access request, it must ait l time uits to sed a e memory access request. For the reader s beefit, let us evaluate the time for memory access usig Figure 3 o the DMM for = 8, =4,adl =3. I the figure, =8threads are artitioed ito =2ars W (0) = {T (0),T(1),T(2),T(3)} ad W (1) = {T (4), T (5), T (6), T (7)}. As illustrated i the figure, 4 threads i W (0) try to access m[0],m[1],m[6], ad m[], ad those i W (1) try to access m[8],m[9],m[14], ad m[15]. The time for the memory access are evaluated uder the assumtio that memory access are rocessed by imagiary l ielie stages ith registers each as illustrated i the figure. Each ielie register i the first stage receives memory access request from threads i a activated ar. Each i-th (0 i 1) ielie register receives the request to the i-th memory bak. I each time uit, a memory request i a ielie register is moved to the ext oe. We assume that the memory access comletes he the request reaches the last ielie register. Note that, the architecture of ielie registers illustrated i Figure 3 are imagiary, ad it is used oly for evaluatig the comutig time. The actual architecture should ivolves a multistage itercoectio etork [18], [19] or sortig etork [20], [21], to route memory access requests. Let us evaluate the time for memory access o the DMM. First, access request for m[0],m[1],m[6] are set to the first stage. Sice m[6] ad m[] are at the same bak B[2], their memory requests caot be set to the first stage at the same time. Next, the m[] is set to the first stage. After that, memory access requests for m[8],m[9],m[14],m[15] are set at the same time, because they are i differet memory baks. Fially, after l 1=2time uits, these memory requests are rocessed. Hece, the DMM takes 5 time uits to comlete the memory access. We ext defie the Uified Memory Machie (UMM)) of idth as follos. Let A[j] ={m[j ],m[j + 1],...,m[(j +1) 1]} deote the j-th address grou. We assume that memory cells at the same address grou are rocessed at the same time. Hoever, if they are i the differet grous, oe time uit is ecessary for each of the grous. Also, similarly to the DMM, threads are artitioed ito ars ad each ar accesses the memory i tur. Agai, let us evaluate the time for memory access usig Figure 3 o the UMM for = 8, = 4,adl = 3. The memory access requests by W (0) are i three address

5 UMM DMM T (0) T (1) T (2) T (3) T (4) T (5) T (76 T (7) Figure 3. A examle of memory access

6 T (0) T (1) T (2) T (3) Figure 4. Cotiguous memory access for =20ad =4. grous. Thus, three time uits are ecessary to sed them to the first stage. Next, to time uits are ecessary to sed memory access requests by W (1), because they are i to address grous. After that, it takes l 1=2time uits to rocess the memory access requests. Hece, totally =7 time uits are ecessary to comlete all memory access. III. CONTIGUOUS MEMORY ACCESS The mai urose of this sectio is to sho the cotiguous memory access o the asychroous DMM ad the asychroous UMM. The evaluatio of the comutig time for the cotiguous access o the sychroous DMM ad the sychroous UMM is ot difficult [15], [17]. Hoever, that for the asychroous versio is more comlicated. This sectio shos the comutig time o the asychroous DMM ad the sychroous UMM is the same as that o the sychroous versio. Suose that a array a of size ( ) isgive.weuse threads to access all of memory cells i a such that each thread accesses memory cells. Note that accessig ca be readig from or ritig i. Let a[i] (0 i 1) deote the i-th memory cells i a. We ca cosider that a is a 2-dimesioal array of size ( ros ad colums). Each a[i][j] (0 i 1, 0 j 1) corresods to a[i +j]. The cotiguous memory access ca be erformed as follos: [Cotiguous memory access] for i 0 to 1 do i arallel for t 0 to 1 do T (i) accesses a[t][i]. Figure 4 illustrates the cotiguous memory access for = 20 ad =4. Let evaluate the comutig time. Each ar W (j) (0 j 1) ith threads access to memory cells a[t][j ],a[t][j +1],...,a[t][(j +1) 1] for each t (0 t ). I other ords, each ar W (i) reeatedly access memory cells at the same address grou times. We ill evaluate the comutig time for the folloig to cases: Case 1: < l First, oe of the ars is radomly disatched ad seds memory access requests. After a ar seds requests, it ill ot be selected at least l time uits. Thus, all of the ars are disatched i the first time uits. Each ar takes l time uits to comlete the memory access, Thus, the secod memory access is started at time l. Figure 5 illustrates ho cotiguous memory access is erformed he <l. Cotiguous memory access requests by ars are reeatedly set times. Thus, it takes + l l = O( ) time uits for the cotiguous memory access. Case 2: l Each of the ars seds memory access requests times. Hece, totally they sed memory access requests = O( ) times. Clearly, if at least l ars have ot comleted memory access, they ca sed memory access request cotiuously. O the other had, if o ar sed memory access request i a time uit, the less tha l ars still have memory access requests to be set. Hece each ar i less tha l such ars ca sed memory access requests at least oce i l time uits. Sice each ars sed memory access l times, it takes l = O( ) time uits for less tha l such ars to comlete the memory access requests. Therefore, the cotiguous memory access ca be comleted i O( + l ) time uits. Thus, e have, Lemma 1: The cotiguous access to a array of size ca be doe i O( + l ) time uits ith 0 barrier sychroizatio ste usig threads o the UMM ad the DMM ith idth ad latecy l. IV. A SIMPLE SUMMING ALGORITHM AND THE TIME LOWER BOUND The mai urose of this sectio is to sho a simle arallel algorithm for comutig the sum o the memory machie models. The summig algorithm reseted i this sectio is the essetially same as oe reseted i [17] o the sychroous DMM ad the sychroous UMM. Let a be a array of =2 m umbers. Let us sho a algorithm to comute the sum a[0]+a[1]+ +a[ 1]. The algorithm uses a ell-ko arallel comutig techique hich reeatedly comutes the sums of airs. We imlemet this techique to erform cotiguous memory access usig threads. The details are selled out as follos: 2 [Algorithm Simle] for t m 1 doto0do begi for i 0 to 2 t 1 do i arallel T(i) erforms a[i] a[i]+a[i +2 t ] if(2 t >) sycthreads ed

7 ars sed requests Case 1: <l access comelted l l l l time at least l ars have ot comleted sedig memory access requests less tha l ars have ot comleted sedig memory access requests Case 2: l O( ) O( l ) time Figure 5. Cotiguous memory access he <l Figure 6. Illustratig the summig algorithm for umbers Figure 6 illustrates ho the sums of airs are comuted. From the figure, it should be clear that this algorithm comute the sum correctly. Let us evaluate the comutig time. For each t (0 t m 1), 2 t oeratios a[i] a[i] +a[i +2 t ] are erformed. These oeratio ivolve the folloig memory access oeratios: readig from a[0],a[1],...,a[2 t 1], readig from a[2 t ],a[2 t +1],...,a[2 2 t 1], ad ritig i a[0],a[1],...,a[2 t 1], Sice these memory access oeratios are cotiguous, they ca be doe i O( 2t + 2t l 2 ) = O( 2t t + l) time usig 2t threads both o the DMM ad o the UMM ith idth ad latecy l from Lemma 1. Thus, the total comutig time is m 1 t=0 O( 2t + l) = O(2m + lm) = O( + l log ). Barrier sychroizatio sycthreads is executed m log = O(log ) times. Thus, e have, Lemma 2: Algorithm Simle comutes the sum of umbers i O( + l log ) time uits ad O(log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. Note that if, the oly oe ar is used ad thus sycthreads is ot ecessary. Let us discuss the loer boud of the time ecessary to comute the sum o the DMM ad the UMM to sho that our arallel summig algorithm for Lemma 2 is otimal. We ill sho to loer bouds, Ω( )-time badidth limitatio, Ω(l log )-time reductio limitatio. Sice the idth of the memory is, atmost umbers i the memory ca be read i a time uit. Clearly, all of the umbers must be read to comute the sum. Hece, Ω( )

8 time uits are ecessary to comute the sum. We call the Ω( )-time loer boud the badidth limitatio. Each thread ca erform a biary oeratio such as additio i a time uit. If at least oe of the to oerads of a biary oeratio is stored i the shared memory, it takes at least l time uits to obtai the resultig value. Clearly, additio oeratio must be erformed 1 times to comute the sum of umbers. The comutatio of the sum usig additio is rereseted usig a biary tree ith leaves ad 1 iteral odes. The root of the biary tree corresods to the sum. From basic grah theory, there exists a ath from the root to a leaf, hich has at least log iteral odes. The additio corresods to each iteral ode takes l time uits. Thus, it takes at least Ω(l log ) time to comute the sum, regardless of the umber of threads. We call the Ω(l log )-time loer boud the reductio limitatio. From the discussio above, e have, Theorem 3: Both the DMM ad the UMM ith idth, ad latecy l takes at least Ω( + l log ) time uits to comute the sum of umbers. From Theorem 3, Algorithm Simle for Lemma 2 is otimal. V. A SUMMING ALGORITHM USING ZERO BARRIER SYNCHRONIZATION STEP This sectio sectio shos a summig algorithm usig zero barrier sychroizatio ste. Clearly,ifeuseasiglearof threads, the o barrier sychroizatio is ecessary. Let us cosider that the iut is give i a a-dimesioal array a of size ( ros ad colums). First, the sum of each colum is comuted usig a thread. After that, the sum of the columise sum is comuted usig Algorithm Simle (Lemma 2). The details of the algorithm are selled out as follos: [Algorithm Oe-War] for i 0 to 1 do i arallel for t 1 to 1 do T(i) erforms a[0][i] a[0][i]+a[t][i] Comute a[0][0] + a[0][1] + + a[0][ 1] usig Algorithm Simle. The comutatio of the colum-ise sum erforms cotiguous access. Thus, from Lemma 1, it takes O( l ) time. After that, Algorithm Simle comutes the sum of umbers i O(l log ) time. Thus, e have, Lemma 4: Algorithm Oe-ar comutes the sum of umbers i O( l + l log ) time uits ad 0 barrier sychroizatio ste usig threads o the DMM ad o the UMM ith idth ad latecy l. Clearly, the comutig time has a overhead of factor l, ad hece Algorithm Oe-ar is ot time otimal. VI. A SUMMING ALGORITHM BASED ON A 2-ARY TREE We eed to use more tha threads to obtai a timeotimal summig algorithm. Hoever, if e use more tha threads, barrier sychroizatio is ecessary. This sectio shos a summig algorithm usig more tha threads. The goal of the summig algorithm sho i this sectio is to miimize the umber of barrier sychroizatio stes. For simlicity, e assume that = (2) k for some iteger k. We ca build 2-ary tree ith leaves, each of hich corresods to a iut umber. The leaves are artitioed ito 2 grous ad each grou is coected to a first-level iteral ode. Thus, e have 2 first-level iteral odes. The first-level iteral odes are artitioed (2) 2 ito grous ad each grou is coected to a secodlevel iteral ode. Cotiuig similarly, e ca build a 2ary tree ith k-levels. The comutatio of the sum is erformed from leaves to the root. The sum of each grou of the leaves is comuted by a ar. The resultig sum is stored i secod-level iteral odes. After that, the sum of each grou i the secodlevel is comuted by a ar, ad the resultig sum is stored i third-level iteral odes. Cotiuig similarly, e ca obtai the sum. Let a 0 deote the iut array, ad a 1,a 2,...,a k be orkig sace each of hich corresods to iteral odes (2) i of the tree. Each a i (1 i k) ca store umbers. Algorithm Tree comutes the resultig sum i a k [0] as follos: [Algorithm Tree] for t 1 to k do for i 0 to (2) t 1 do i arallel begi W(i) comutes a t [i] a t 1 [i 2]+ a t 1 [i 2 +1]+ + a t 1 [(i +1) 2 1] usig Algorithm Oe-ar. sycthreads ed Let us evaluate the comutig time for each t. First, he t = k, oe ar is used to comute the sum of 2 umbers. From Lemma 4, it takes O(l log ) time uits. Whe t = k 1, ars ith threads each are used. Sice each of the ars accesses oe ro, the cotiguous access is erformed log times. Thus, from Lemma 1, each cotiguous access takes O( (2)2 + (2)2 l 2 )=O( + l) time uits. 2 Hece, the comutig time for t = k 1 is O((+l)log). Let us cosider the geeral case for t = k j (0 j k 1). The cotiguous access for (2) j+1 umbers is erformed by (2) j ars of (2) j threads. Thus, the cotiguous access takes O( (2)j+1 + (2)j+1 l (2) j )=O((2)j +l) time uits. Sice the cotiguous access is reeated O(log ) times, the total comutig time for t = k j is O(((2) j + l)log). Hece, the total comutig time of Algorithm Tree is: k O(((2) j + l)log) t=1

9 2 first level secod level Figure 7. A summig algorithm based o a 2-ary tree = O(((2) k + kl)log) = O( log log + l log ) From k = log(2). Also, Algorithm Tree erforms sycthreads k = log log(2) times. Thus, e have, Lemma 5: The sum of umbers ca be comuted i O( log log + l log ) time uits ith O( log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. From Theorem 3, the algorithm for this lemma is ot time otimal. VII. A TIME-OPTIMAL ALGORITHM FOR COMPUTING THE SUM USING FEW BARRIER SYNCHRONIZATION STEPS We ca obtai time otimal summig algorithms ith comesatio of fe additioal barrier sychroizatio stes. This sectio is devoted to sho such time otimal summig algorithms. Suose that Algorithm Simle is executed for t = m 1,m 2,...,m log log. It should be clear that the iterim sum are stored i a[0],a[1],...,a[ log 1]. After that, the sum of these umbers are comuted by Algorithm Tree. The details are selled out as follos: [Algorithm Simle-Tree] for t m 1 do to m log log do begi for i 0 to 2 t 1 do i arallel T(i) erforms a[i] a[i]+a[i +2 t ] sycthreads ed Use Algorithm Tree to comute the sum a[0] + a[1]+ + a[ log 1]. Let us evaluate the comutig time. As e have discussed, Algorithm Simle takes O( 2t + l) time uits for each t. Thus, the executio of Algorithm Simle for t = m 1,m 2,...,m log log takes m 1 t=m log log O( 2t + l) = O(2m + l log log ) = O( + l log log ). Also, it has log log barrier sychroizatio stes. After that, Algorithm Tree is executed for the iut of size log. From Lemma 5, it takes O( log log + l log log )) = O( log + l log ) time uits. Further, it has O( log log log )=O( log ) sychroizatio stes. Thus, e have, Theorem 6: Algorithm Simle-Tree comutes the sum of umbers i O( log + l log ) time uits ith O( log + log log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. We ill sho that, the umber of sycthreads ca be ideedet of the umber of iut umbers. Suose that iut ( l) umbers are stored i a array a of size l l ( l ros ad l colums). First, e assig oe thread to each ro ad comute the colum-ise sum. After that, the sum of the colum-ise sums usig Algorithm Simle-Tree. The details are selled out as follos: [Algorithm Hybrid] for i 0 to l 1 do i arallel for t 1 to l 1 do T(i) erforms a[0][i] a[0][i]+a[t][i] Use Algorithm Simle-Tree to comute the sum a[0][0]+ a[0][1] + + a[0][l 1]. Let us evaluate the comutig time. The colum-ise sum erforms cotiguous access. Thus, from Lemma 1, it takes O( + l l )=O( ) time uits. After that, Algorithm Tree is executed for l umbers. From Theorem 6, it

10 takes O( l + l log(l)) = O(l log ) time uits usig O( log(l) log l log +loglog) =O( log +loglog) barrier sychroizatio stes. Thus, e have, Theorem 7: Algorithm Hybrid comutes the sum of umbers i O( log l + l log ) time uits ith O( log + log log ) barrier sychroizatio stes usig l threads o the DMM ad o the UMM ith idth ad latecy l. VIII. CONCLUSION The mai cotributio of this aer is to itroduce the asychroous versio of the memory machie models, the DMM ad the UMM. We also reseted time-otimal arallel summig algorithm ruig i O( + l log ) time uits ad O( log l log +loglog) barrier sychroizatio stes. It is a iterestig oe roblem to further reduce the umber of barrier sychroizatio stes of time-otimal arallel summig comutatio. REFERENCES [1] A. V. Aho, J. D. Ullma, ad J. E. Hocroft, Data Structures ad Algorithms. Addiso Wesley, [2] A. Gibbos ad W. Rytter, Efficiet Parallel Algorithms. Cambridge Uiversity Press, [3] A.Grama,G.Karyis,V.Kumar,adA.Guta,Itroductio to Parallel Comutig. Addiso Wesley, [4] M. J. Qui, Parallel Comutig: Theory ad Practice. McGra-Hill, [5] W. W. Hu, GPU Comutig Gems Emerald Editio. Morga Kaufma, [6] Y. Ito, K. Ogaa, ad K. Nakao, Fast ellise detectio algorithm usig Hough trasform o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [12] NVIDIA Cororatio, NVIDIA CUDA C rogrammig guide versio 4.0, [13] D. Ma, K. Uda, H. Ueyama, Y. Ito, ad K. Nakao, Imlemetatios of a arallel algorithm for comutig euclidea distace ma i multicore rocessors ad GPUs, Iteratioal Joural of Netorkig ad Comutig, vol. 1, , July [14] NVIDIA Cororatio, NVIDIA CUDA C best ractice guide versio 3.1, 20. [15] K. Nakao, Simle memory machie models for GPUs, i Proc. of Iteratioal Parallel ad Distributed Processig Symosium Workshos, May 2012, [16] M. J. Fly, Some comuter orgaizatios ad their effectiveess, IEEE Trasactios o Comuters, vol. C-21, , [17] K. Nakao, A otimal arallel refix-sums algorithm o the memory machie models for GPUs, i Proc. of Iteratioal Coferece o Algorithms ad Architectures for Parallel Processig (ICA3PP, LNCS 7439), Set. 2012, [18] A. Gottlieb, R. Grishma, C. P. Kruskal, K. P. McAuliffe, L. Rudolh, ad M. Sir, The yu ultracomuter desigig a MIMD shared memory arallel comuter, IEEE Tras. o Comuters, vol. C-32, o. 2, , Feb [19] D. H. Larie, Access ad aligmet of data i a array rocessor, IEEE Tras. o Comuters, vol. C-24, o. 12, , Dec [20] S. G. Akl, Parallel Sortig Algorithms. Academic Press, [21] K. E. Batcher, Sortig etorks ad their alicatios, i Proc. AFIPS Srig Joit Comut. Cof., vol. 32, 1968, [7] D. Ma, K. Uda, Y. Ito, ad K. Nakao, A GPU imlemetatio of comutig euclidea distace ma ith efficiet memory access, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [8] A. Uchida, Y. Ito, ad K. Nakao, Fast ad accurate temlate matchig usig ixel rearragemet o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [9] K. Ogaa, Y. Ito, ad K. Nakao, Efficiet cay edge detectio usig a gu, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Nov. 20, [] K. Nishida, Y. Ito, ad K. Nakao, Acceleratig the dyamic rogrammig for the matrix chai roduct o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [11], Acceleratig the dyamic rogrammig for the otial oygo triagulatio o the GPU, i Proc. of Iteratioal Coferece o Algorithms ad Architectures for Parallel Processig (ICA3PP, LNCS 7439), Set. 2012,

Simple Memory Machine Models for GPUs

Simple Memory Machine Models for GPUs 2012 IEEE 2012 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symosium Symosium Workshos Workshos & PhD Forum Simle Memory Machine Models