Asynchronous Memory Machine Models with Barrier Synchronization

Size: px
Start display at page:

Download "Asynchronous Memory Machine Models with Barrier Synchronization"

Transcription

1 Asychroous Memory Machie Models ith Barrier Sychroizatio Koji Nakao Deartmet of Iformatio Egieerig Hiroshima Uiversity Kagamiyama 1-4-1, Higashi Hiroshima, Jaa Abstract The Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM) are theoretical arallel comutig models that cature the essece of the shared memory ad the global memory of GPUs. It as assumed that ars (i.e. grous of threads) o the DMM ad the UMM ork sychroously i the roud-robi maer. Hoever, ars ork asychroously i the actual GPUs, i the sese that ars may be radomly (or arbitrarily) disatched for executio. The first cotributio of this aer is to itroduce a asychroous versio of the DMM ad the UMM, i hich ars are arbitrarily disatched. Istead, e assume that threads ca execute the sycthreads istructio for barrier sychroizatio. Sice the barrier sychroizatio oeratio is costly, e should evaluate ad miimize the umber of barrier sychroizatio oeratios erformed by arallel algorithms. The secod cotributio of this aer is to sho a arallel algorithm to comute the sum of umbers i otimal comutig time ad fe barrier sychroizatio stes. Our arallel algorithm comutes the sum of umbers i O( + l log ) time uits ad O(log l +loglog) barrier sychroizatio stes usig l threads both o the asychroous DMM ad o the asychroous UMM ith idth ad latecy l. Wealso rove that the comutig time is otimal because it matches the theoretical loer boud. Quite surrisigly, the umber of barrier sychroizatio stes ad the umber of threads are ideedet of. Eve if the iut size is quite large, our arallel algorithm comutes the sum i otimal time uits ad a fixed umber of sycthreads usig a fixed umber of threads. Keyords-arallel comutig models, arallel algorithms, cotiguous memory access, asychroous models, GPU, CUDA I. INTRODUCTION The research of arallel algorithms has a log history of more tha 40 years. Sequetial algorithms have bee develoed mostly o the Radom Access Machie (RAM) [1]. I cotrast, sice there are a variety of coectio methods ad atters betee rocessors ad memories, may arallel comutig models have bee reseted ad may arallel algorithmic techiques have bee sho o them. The most ell-studied arallel comutig model is the Parallel Radom Access Machie (PRAM) [2], [3], [4], hich cosists of rocessors ad a shared memory. Each rocessor o the PRAM ca access ay address of the shared memory i a time uit. The PRAM is a good arallel comutig model i the sese that arallelism of each roblem ca be revealed by the erformace of arallel algorithms o the PRAM. Hoever, sice the PRAM requires a shared memory that ca be accessed by all rocessors at the same time, it is ot feasible. The GPU (Grahical Processig Uit), is a secialized circuit desiged to accelerate comutatio for buildig ad maiulatig images [5], [6], [7], [8], [9]. Latest GPUs are desiged for geeral urose comutig ad ca erform comutatio i alicatios traditioally hadled by the CPU. Hece, GPUs have recetly attracted the attetio of may alicatio develoers [5], [], [11]. NVIDIA rovides a arallel comutig architecture called CUDA (Comute Uified Device Architecture) [12], the comutig egie for NVIDIA GPUs. CUDA gives develoers access to the virtual istructio set ad memory of the arallel comutatioal elemets i NVIDIA GPUs. I may cases, GPUs are more efficiet tha multicore rocessors [13], sice they have hudreds of rocessor cores ad very high memory badidth. CUDA uses to tyes of memories i the NVIDIA GPUs: the shared memory ad the global memory [12]. The shared memory is a extremely fast o-chi memory ith loer caacity, say, Kbytes. The global memory is imlemeted as a off-chi DRAM, ad has large caacity, say, Gbytes, but its access latecy is very log. The efficiet usage of the shared memory ad the global memory is a key for CUDA develoers to accelerate alicatios usig GPUs. I articular, e eed to cosider the bak coflict of the shared memory access ad the coalescig of the global memory access [7], [13], [14]. The address sace of the shared memory is maed ito several hysical memory baks. If to or more threads access the same memory baks at the same time, the access requests are rocessed sequetially. Hece, to maximize the memory access erformace, threads of CUDA should access distict memory baks to avoid the bak coflicts of the memory accesses. To maximize the badidth betee the GPU ad the DRAM chis, the cosecutive addresses of the global memory must be accessed at the same time. Thus, CUDA threads should erform coalesced access he they access the global memory. I our revious aer [15], e have itroduced to models, the Discrete Memory Machie (DMM) ad the

2 Uified Memory Machie (UMM), hich reflect the essetial features of the shared memory ad the global memory of NVIDIA GPUs. The outlie of the architectures of the DMM ad the UMM is illustrated i Figure 1. I both architectures, a sea of threads (Ts) is coected to the memory baks (MBs) through the memory maagemet uit (MMU). Each thread is a Radom Access Machie (RAM) [1], hich ca execute oe of the fudametal oeratios i a time uit. We do ot discuss the architecture of the sea of threads i this aer, but e ca imagie that it cosists of a set of multi-core rocessors hich ca execute may threads i arallel ad/or i time-sharig maer. Threads are executed i SIMD [16] fashio, ad the rocessors ru o the same rogram ad ork o the differet data. a sea of threads MMU MB MB MB MB Figure 1. DMM address lie a sea of threads MMU MB MB MB MB UMM data lie The architectures of the DMM ad the UMM MBs costitute a sigle address sace of the memory. A sigle address sace of the memory is maed to the MBs i a iterleaved ay such that the ord of data of address i is stored i the (i mod )-th bak, here is the umber of MBs. The mai differece of the to architectures is the coectio of the address lie betee the MMU ad the MBs, hich ca trasfer a address value. I the DMM, the address lies coect the MBs ad the MMU searately, hile a sigle address lie from the MMU is coected to the MBs i the UMM. Hece, i the UMM, the same address value is broadcast to every MB, ad the same address of the MBs ca be accessed i each time uit. O the other had, differet addresses of the MBs ca be accessed i the DMM. Sice the memory access of the UMM is more restricted tha that of the DMM, the UMM is less oerful tha the DMM. The erformace of algorithms o the PRAM is usually evaluated usig to arameters: the size of the iut ad the umber of rocessors. For examle, it is ell ko that the sum of umbers ca be comuted i O( +log) time o the PRAM [2]. We ill use four arameters, the size of the iut, the umber of threads, the idth ad the latecy l of the memory he e evaluate the erformace of algorithms o the DMM ad o the UMM. The idth is the umber of memory baks ad the latecy l is the umber of time uits to comlete the memory access. Hece, the erformace of algorithms o the DMM ad the UMM is evaluated as a fuctio of (the size of a roblem), (the umber of threads), (the idth of a memory), ad l (the latecy of a memory). Further, r (the umber of local registers used by each thread) may be additioally used. Note that idth ad latecy l deed o the architecture. They are fixed values ad caot be chaged. O the other had, the umber of threads ca be chaged. Users ca choose otimal value of to get the best erformace. Thus, the comutig time of algorithms o the DMM ad the UMM ca be evaluated ithout usig. For examle, i our revious aer [17], e have sho that the refixsums of umbers ca be comuted i O( + l + l log ) time uits o the DMM ad the UMM. To get the best erformace, should choose = l. If this is the case, the refix-sums ca be comuted i O( + l log ) time uits. Suose that e use threads T (0),T(1),...,T( 1). Threads o the DMM ad the UMM are artitioed ito grous of threads called ars. Let W (0),W(1),...,W( 1) deote the grous. I our revious aer, it is assumed that threads the DMM ad the UMM orks sychroously i the sese that ars are activated for memory access from W (0) to W ( 1) i tur by the roud-robi maer. The first cotributio of this aer is to exted memory machie models reseted i our revious aer [15] for more realistic arallel comutig models. More secifically, e assume that threads orks asychroously i the sese that ars are disatched for memory access arbitrarily. The scheduler arbitrarily selects oe of the ars i hich at least oe thread tries to access the memory, ad disatches it for memory access. Istead, e assume that threads ca execute a istructio sycthreads for the urose of barrier sychroizatio. I NVIDIA GPUs, sycthreads() istructio is suorted for threads i a block, hich takes 16 clock cycles [12]. Also, for the urose of sychroizatio of threads i multile blocks e eed to searate algorithm ito differet kerel calls [12]. Hece, barrier sychroizatio is costly. I this aer, he e evaluate the erformace of arallel algorithm o the asychroous DMM ad the asychroous UMM, e also evaluate the umber of sycthreads oeratios erformed. Note that, arallel algorithms o the asychroous versios of the DMM ad the UMM must ork correctly for ay orst choice of ars by a malicious scheduler. Also, the erformace icludig the comutig time must be evaluated for the case of orst choice of ars. The secod cotributio of this aer is to sho efficiet

3 summig algorithm o the asychroous versio of the DMM ad the UMM ith idth ad latecy l. Wefirst sho that a simle algorithm sho i [17] ca comute the sum of umbers i O( +l log ) time uits ad O(log ) barrier sychroizatio stes (Algorithm Simle). We the go o to rove that Ω( + l log ) time uits are ecessary to comute the sum of umbers. Thus, Algorithm Simle is time otimal. We also sho that the sum of umbers ca be comuted i O( l + l log ) time uits ad 0 barrier sychroizatio ste (Algorithm Oe-War). Although this algorithm does ot erform barrier sychroizatio, it is ot time otimal ad has large overhead of factor l. Next, e ill sho that a arallel algorithm based o a 2-ary tree ca comute the sum of umbers i O( log +l log ) time uits ad O( log log ) barrier sychroizatio stes (Algorithm Tree). By combiig Algorithm Simle ad Algorithm Tree, e sho that the sum of umbers ca be comuted i O( log + l log ) time uits ad O( log +loglog) barrier sychroizatio stes (Algorithm Simle-Tree). Clearly, Algorithm Sum-Tree is time otimal. Fially, e ill sho that the barrier sychroizatio stes ca be reduced to +loglog) (Algorithm Hybrid). Quite surrisigly, the umber of barrier sychroizatio stes ad the umber of threads of Algorithm Hybrid are ideedet of. Eve if the iut size is quite large, our arallel algorithm comutes the sum i otimal time uits ad a fixed umber of sycthreads usig a fixed umber of threads. Table I summarizes our summig algorithms reseted i this aer. O( log l log This aer is orgaized as follos. Sectio II defies the Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM) itroduced i our revious aer [15] ad defie the asychroous versio of the DMM ad the UMM. I Sectio III, e evaluate the comutig time of the cotiguous memory access to the memory of the asychroous DMM ad the asychroous UMM. The cotiguous memory access is a key igrediet of arallel algorithm develomet o the memory machie models. Usig the cotiguous access, e sho that Algorithm Simle ca comute the sum of umbers i O( log +l log ) time uits ad O( log ) barrier sychroizatio stes i Sectio IV. We also discuss the loer boud of the time comlexity ad sho to loer bouds, Ω( )-time badidth limitatio ad Ω(l log )- time reductio limitatio. Sectio V shos Algorithm Oe- War that comutes the sum of umbers i O( l+l log ) time uits ad 0 sychroizatio stes. I Sectio VI shos a tree-based summig algorithm Algorithm Tree that comutes the sum of umbers i O( log + l log ) time uits ad O( log log ) barrier sychroizatio stes. Fially, SectioVII shos time-otimal summig algorithm. Algorithm Simle-Tree, hich is a combiatio of Algorithm Simle ad Algorithm Tree, uses O( log log +log log ) barrier sychroizatio stes. By a aroriate recomutatio, e sho that the barrier sychroizatio stes ca be reduced to O( log l log +log log ). Sectio VIII offers cocludig remarks. II. PARALLEL MEMORY MACHINES: DMMAND UMM The mai urose of this sectio is to defie the Discrete Memory Machie (DMM) ad the Uified Memory Machie (UMM). itroduced i our revious aer [15], [17]. We first defie the Discrete Memory Machie (DMM) of idth ad latecy l, Letm[i] (i 0) deote a memory cell of address i i the memory. Let B[j] ={m[j],m[j + ],m[j +2],m[j +3],...} (0 j 1) deote the j-th bak of the memory. Clearly, a memory cell m[i] is i the (i mod )-th memory bak. We assume that memory cells i differet baks ca be accessed i a time uit, but o to memory cells at the same bak ca be accessed i a time uit. Also, e assume that l time uits are ecessary to comlete a access request ad cotiuous requests are rocessed i a ielie fashio through the MMU. Thus, it takes k + l 1 time uits to comlete k access requests to a articular bak. We assume that threads are artitioed ito grous of threads called ars. More secifically, threads are artitioed ito ars W (0),W(1),..., W ( 1) such that W (i) ={T (i ), T (i +1),...,T ((i +1) 1)} (0 i 1). Wars are disatched for memory access i tur ad threads i a ar try to access the memory at the same time. We defie to assumtios sychroous maer ad asychroous maer i terms of disatchig of ars. I the sychroous maer, W (0),W(1),...,W( 1) are disatched i a roud-robi maer if at least oe thread i a ar requests memory access. More secifically, suose that every thread executes T istructios. I the sychroous maer, ars ork equally as follos: [Sychroous Model] for t 0 to T do for i 0 to 1 do Every thread i W (i) executes a istructio. O the other had, i asychroous oeratios, oe of the ars is disatched ad executed as follos: [Asychroous Model] for t 0 to T 1 do Arbitrarily select a ar W (i) to be executed. Each thread i W (i) executes a istructio. Note that, i asychroous maer, if all threads i a ar W (i) have o istructio to be executed, such ar W (i) is ot selected. For examle, if threads i W (i) have just set memory access requests ad they are aitig for comletio of memory access, W (i) is ot selected. Such ar W (i) ill be selected after the comletio of memory access. We also assume that, for the urose of barrier sychroizatio, all threads ca execute the sycthreads istructio. Suose that at least oe of the threads executes sycthreads. After that, all threads that have executed

4 Table I PERFORMANCE OF PARALLEL ALGORITHM FOR COMPUTING THE SUM algorithms time uits threads sycthreads time otimality Simle O( + l log ) O(log 2 ) otimal Oe-War O( l + l log ) 0 overhead of factor l Tree O( log log + l log ) O( log ) overhead of factor log Simle-Tree O( + l log ) O( log 2 log +loglog) otimal Hybrid O( log l + l log ) l O( log +loglog) otimal B[0] B[1] B[2] B[3] A[0] A[1] A[2] A[3] memory baks of DMM address grous of UMM Figure 2. Baks ad address grous for =4 sycthreads have bee blocked util all threads execute sycthreads. Oce all threads execute sycthreads, they restart executig istructios. We assume that a thread caot sed a e memory access request util the revious memory access request is comleted. Hece, if a thread sed a memory access request, it must ait l time uits to sed a e memory access request. For the reader s beefit, let us evaluate the time for memory access usig Figure 3 o the DMM for = 8, =4,adl =3. I the figure, =8threads are artitioed ito =2ars W (0) = {T (0),T(1),T(2),T(3)} ad W (1) = {T (4), T (5), T (6), T (7)}. As illustrated i the figure, 4 threads i W (0) try to access m[0],m[1],m[6], ad m[], ad those i W (1) try to access m[8],m[9],m[14], ad m[15]. The time for the memory access are evaluated uder the assumtio that memory access are rocessed by imagiary l ielie stages ith registers each as illustrated i the figure. Each ielie register i the first stage receives memory access request from threads i a activated ar. Each i-th (0 i 1) ielie register receives the request to the i-th memory bak. I each time uit, a memory request i a ielie register is moved to the ext oe. We assume that the memory access comletes he the request reaches the last ielie register. Note that, the architecture of ielie registers illustrated i Figure 3 are imagiary, ad it is used oly for evaluatig the comutig time. The actual architecture should ivolves a multistage itercoectio etork [18], [19] or sortig etork [20], [21], to route memory access requests. Let us evaluate the time for memory access o the DMM. First, access request for m[0],m[1],m[6] are set to the first stage. Sice m[6] ad m[] are at the same bak B[2], their memory requests caot be set to the first stage at the same time. Next, the m[] is set to the first stage. After that, memory access requests for m[8],m[9],m[14],m[15] are set at the same time, because they are i differet memory baks. Fially, after l 1=2time uits, these memory requests are rocessed. Hece, the DMM takes 5 time uits to comlete the memory access. We ext defie the Uified Memory Machie (UMM)) of idth as follos. Let A[j] ={m[j ],m[j + 1],...,m[(j +1) 1]} deote the j-th address grou. We assume that memory cells at the same address grou are rocessed at the same time. Hoever, if they are i the differet grous, oe time uit is ecessary for each of the grous. Also, similarly to the DMM, threads are artitioed ito ars ad each ar accesses the memory i tur. Agai, let us evaluate the time for memory access usig Figure 3 o the UMM for = 8, = 4,adl = 3. The memory access requests by W (0) are i three address

5 UMM DMM T (0) T (1) T (2) T (3) T (4) T (5) T (76 T (7) Figure 3. A examle of memory access

6 T (0) T (1) T (2) T (3) Figure 4. Cotiguous memory access for =20ad =4. grous. Thus, three time uits are ecessary to sed them to the first stage. Next, to time uits are ecessary to sed memory access requests by W (1), because they are i to address grous. After that, it takes l 1=2time uits to rocess the memory access requests. Hece, totally =7 time uits are ecessary to comlete all memory access. III. CONTIGUOUS MEMORY ACCESS The mai urose of this sectio is to sho the cotiguous memory access o the asychroous DMM ad the asychroous UMM. The evaluatio of the comutig time for the cotiguous access o the sychroous DMM ad the sychroous UMM is ot difficult [15], [17]. Hoever, that for the asychroous versio is more comlicated. This sectio shos the comutig time o the asychroous DMM ad the sychroous UMM is the same as that o the sychroous versio. Suose that a array a of size ( ) isgive.weuse threads to access all of memory cells i a such that each thread accesses memory cells. Note that accessig ca be readig from or ritig i. Let a[i] (0 i 1) deote the i-th memory cells i a. We ca cosider that a is a 2-dimesioal array of size ( ros ad colums). Each a[i][j] (0 i 1, 0 j 1) corresods to a[i +j]. The cotiguous memory access ca be erformed as follos: [Cotiguous memory access] for i 0 to 1 do i arallel for t 0 to 1 do T (i) accesses a[t][i]. Figure 4 illustrates the cotiguous memory access for = 20 ad =4. Let evaluate the comutig time. Each ar W (j) (0 j 1) ith threads access to memory cells a[t][j ],a[t][j +1],...,a[t][(j +1) 1] for each t (0 t ). I other ords, each ar W (i) reeatedly access memory cells at the same address grou times. We ill evaluate the comutig time for the folloig to cases: Case 1: < l First, oe of the ars is radomly disatched ad seds memory access requests. After a ar seds requests, it ill ot be selected at least l time uits. Thus, all of the ars are disatched i the first time uits. Each ar takes l time uits to comlete the memory access, Thus, the secod memory access is started at time l. Figure 5 illustrates ho cotiguous memory access is erformed he <l. Cotiguous memory access requests by ars are reeatedly set times. Thus, it takes + l l = O( ) time uits for the cotiguous memory access. Case 2: l Each of the ars seds memory access requests times. Hece, totally they sed memory access requests = O( ) times. Clearly, if at least l ars have ot comleted memory access, they ca sed memory access request cotiuously. O the other had, if o ar sed memory access request i a time uit, the less tha l ars still have memory access requests to be set. Hece each ar i less tha l such ars ca sed memory access requests at least oce i l time uits. Sice each ars sed memory access l times, it takes l = O( ) time uits for less tha l such ars to comlete the memory access requests. Therefore, the cotiguous memory access ca be comleted i O( + l ) time uits. Thus, e have, Lemma 1: The cotiguous access to a array of size ca be doe i O( + l ) time uits ith 0 barrier sychroizatio ste usig threads o the UMM ad the DMM ith idth ad latecy l. IV. A SIMPLE SUMMING ALGORITHM AND THE TIME LOWER BOUND The mai urose of this sectio is to sho a simle arallel algorithm for comutig the sum o the memory machie models. The summig algorithm reseted i this sectio is the essetially same as oe reseted i [17] o the sychroous DMM ad the sychroous UMM. Let a be a array of =2 m umbers. Let us sho a algorithm to comute the sum a[0]+a[1]+ +a[ 1]. The algorithm uses a ell-ko arallel comutig techique hich reeatedly comutes the sums of airs. We imlemet this techique to erform cotiguous memory access usig threads. The details are selled out as follos: 2 [Algorithm Simle] for t m 1 doto0do begi for i 0 to 2 t 1 do i arallel T(i) erforms a[i] a[i]+a[i +2 t ] if(2 t >) sycthreads ed

7 ars sed requests Case 1: <l access comelted l l l l time at least l ars have ot comleted sedig memory access requests less tha l ars have ot comleted sedig memory access requests Case 2: l O( ) O( l ) time Figure 5. Cotiguous memory access he <l Figure 6. Illustratig the summig algorithm for umbers Figure 6 illustrates ho the sums of airs are comuted. From the figure, it should be clear that this algorithm comute the sum correctly. Let us evaluate the comutig time. For each t (0 t m 1), 2 t oeratios a[i] a[i] +a[i +2 t ] are erformed. These oeratio ivolve the folloig memory access oeratios: readig from a[0],a[1],...,a[2 t 1], readig from a[2 t ],a[2 t +1],...,a[2 2 t 1], ad ritig i a[0],a[1],...,a[2 t 1], Sice these memory access oeratios are cotiguous, they ca be doe i O( 2t + 2t l 2 ) = O( 2t t + l) time usig 2t threads both o the DMM ad o the UMM ith idth ad latecy l from Lemma 1. Thus, the total comutig time is m 1 t=0 O( 2t + l) = O(2m + lm) = O( + l log ). Barrier sychroizatio sycthreads is executed m log = O(log ) times. Thus, e have, Lemma 2: Algorithm Simle comutes the sum of umbers i O( + l log ) time uits ad O(log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. Note that if, the oly oe ar is used ad thus sycthreads is ot ecessary. Let us discuss the loer boud of the time ecessary to comute the sum o the DMM ad the UMM to sho that our arallel summig algorithm for Lemma 2 is otimal. We ill sho to loer bouds, Ω( )-time badidth limitatio, Ω(l log )-time reductio limitatio. Sice the idth of the memory is, atmost umbers i the memory ca be read i a time uit. Clearly, all of the umbers must be read to comute the sum. Hece, Ω( )

8 time uits are ecessary to comute the sum. We call the Ω( )-time loer boud the badidth limitatio. Each thread ca erform a biary oeratio such as additio i a time uit. If at least oe of the to oerads of a biary oeratio is stored i the shared memory, it takes at least l time uits to obtai the resultig value. Clearly, additio oeratio must be erformed 1 times to comute the sum of umbers. The comutatio of the sum usig additio is rereseted usig a biary tree ith leaves ad 1 iteral odes. The root of the biary tree corresods to the sum. From basic grah theory, there exists a ath from the root to a leaf, hich has at least log iteral odes. The additio corresods to each iteral ode takes l time uits. Thus, it takes at least Ω(l log ) time to comute the sum, regardless of the umber of threads. We call the Ω(l log )-time loer boud the reductio limitatio. From the discussio above, e have, Theorem 3: Both the DMM ad the UMM ith idth, ad latecy l takes at least Ω( + l log ) time uits to comute the sum of umbers. From Theorem 3, Algorithm Simle for Lemma 2 is otimal. V. A SUMMING ALGORITHM USING ZERO BARRIER SYNCHRONIZATION STEP This sectio sectio shos a summig algorithm usig zero barrier sychroizatio ste. Clearly,ifeuseasiglearof threads, the o barrier sychroizatio is ecessary. Let us cosider that the iut is give i a a-dimesioal array a of size ( ros ad colums). First, the sum of each colum is comuted usig a thread. After that, the sum of the columise sum is comuted usig Algorithm Simle (Lemma 2). The details of the algorithm are selled out as follos: [Algorithm Oe-War] for i 0 to 1 do i arallel for t 1 to 1 do T(i) erforms a[0][i] a[0][i]+a[t][i] Comute a[0][0] + a[0][1] + + a[0][ 1] usig Algorithm Simle. The comutatio of the colum-ise sum erforms cotiguous access. Thus, from Lemma 1, it takes O( l ) time. After that, Algorithm Simle comutes the sum of umbers i O(l log ) time. Thus, e have, Lemma 4: Algorithm Oe-ar comutes the sum of umbers i O( l + l log ) time uits ad 0 barrier sychroizatio ste usig threads o the DMM ad o the UMM ith idth ad latecy l. Clearly, the comutig time has a overhead of factor l, ad hece Algorithm Oe-ar is ot time otimal. VI. A SUMMING ALGORITHM BASED ON A 2-ARY TREE We eed to use more tha threads to obtai a timeotimal summig algorithm. Hoever, if e use more tha threads, barrier sychroizatio is ecessary. This sectio shos a summig algorithm usig more tha threads. The goal of the summig algorithm sho i this sectio is to miimize the umber of barrier sychroizatio stes. For simlicity, e assume that = (2) k for some iteger k. We ca build 2-ary tree ith leaves, each of hich corresods to a iut umber. The leaves are artitioed ito 2 grous ad each grou is coected to a first-level iteral ode. Thus, e have 2 first-level iteral odes. The first-level iteral odes are artitioed (2) 2 ito grous ad each grou is coected to a secodlevel iteral ode. Cotiuig similarly, e ca build a 2ary tree ith k-levels. The comutatio of the sum is erformed from leaves to the root. The sum of each grou of the leaves is comuted by a ar. The resultig sum is stored i secod-level iteral odes. After that, the sum of each grou i the secodlevel is comuted by a ar, ad the resultig sum is stored i third-level iteral odes. Cotiuig similarly, e ca obtai the sum. Let a 0 deote the iut array, ad a 1,a 2,...,a k be orkig sace each of hich corresods to iteral odes (2) i of the tree. Each a i (1 i k) ca store umbers. Algorithm Tree comutes the resultig sum i a k [0] as follos: [Algorithm Tree] for t 1 to k do for i 0 to (2) t 1 do i arallel begi W(i) comutes a t [i] a t 1 [i 2]+ a t 1 [i 2 +1]+ + a t 1 [(i +1) 2 1] usig Algorithm Oe-ar. sycthreads ed Let us evaluate the comutig time for each t. First, he t = k, oe ar is used to comute the sum of 2 umbers. From Lemma 4, it takes O(l log ) time uits. Whe t = k 1, ars ith threads each are used. Sice each of the ars accesses oe ro, the cotiguous access is erformed log times. Thus, from Lemma 1, each cotiguous access takes O( (2)2 + (2)2 l 2 )=O( + l) time uits. 2 Hece, the comutig time for t = k 1 is O((+l)log). Let us cosider the geeral case for t = k j (0 j k 1). The cotiguous access for (2) j+1 umbers is erformed by (2) j ars of (2) j threads. Thus, the cotiguous access takes O( (2)j+1 + (2)j+1 l (2) j )=O((2)j +l) time uits. Sice the cotiguous access is reeated O(log ) times, the total comutig time for t = k j is O(((2) j + l)log). Hece, the total comutig time of Algorithm Tree is: k O(((2) j + l)log) t=1

9 2 first level secod level Figure 7. A summig algorithm based o a 2-ary tree = O(((2) k + kl)log) = O( log log + l log ) From k = log(2). Also, Algorithm Tree erforms sycthreads k = log log(2) times. Thus, e have, Lemma 5: The sum of umbers ca be comuted i O( log log + l log ) time uits ith O( log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. From Theorem 3, the algorithm for this lemma is ot time otimal. VII. A TIME-OPTIMAL ALGORITHM FOR COMPUTING THE SUM USING FEW BARRIER SYNCHRONIZATION STEPS We ca obtai time otimal summig algorithms ith comesatio of fe additioal barrier sychroizatio stes. This sectio is devoted to sho such time otimal summig algorithms. Suose that Algorithm Simle is executed for t = m 1,m 2,...,m log log. It should be clear that the iterim sum are stored i a[0],a[1],...,a[ log 1]. After that, the sum of these umbers are comuted by Algorithm Tree. The details are selled out as follos: [Algorithm Simle-Tree] for t m 1 do to m log log do begi for i 0 to 2 t 1 do i arallel T(i) erforms a[i] a[i]+a[i +2 t ] sycthreads ed Use Algorithm Tree to comute the sum a[0] + a[1]+ + a[ log 1]. Let us evaluate the comutig time. As e have discussed, Algorithm Simle takes O( 2t + l) time uits for each t. Thus, the executio of Algorithm Simle for t = m 1,m 2,...,m log log takes m 1 t=m log log O( 2t + l) = O(2m + l log log ) = O( + l log log ). Also, it has log log barrier sychroizatio stes. After that, Algorithm Tree is executed for the iut of size log. From Lemma 5, it takes O( log log + l log log )) = O( log + l log ) time uits. Further, it has O( log log log )=O( log ) sychroizatio stes. Thus, e have, Theorem 6: Algorithm Simle-Tree comutes the sum of umbers i O( log + l log ) time uits ith O( log + log log ) barrier sychroizatio stes usig 2 threads o the DMM ad o the UMM ith idth ad latecy l. We ill sho that, the umber of sycthreads ca be ideedet of the umber of iut umbers. Suose that iut ( l) umbers are stored i a array a of size l l ( l ros ad l colums). First, e assig oe thread to each ro ad comute the colum-ise sum. After that, the sum of the colum-ise sums usig Algorithm Simle-Tree. The details are selled out as follos: [Algorithm Hybrid] for i 0 to l 1 do i arallel for t 1 to l 1 do T(i) erforms a[0][i] a[0][i]+a[t][i] Use Algorithm Simle-Tree to comute the sum a[0][0]+ a[0][1] + + a[0][l 1]. Let us evaluate the comutig time. The colum-ise sum erforms cotiguous access. Thus, from Lemma 1, it takes O( + l l )=O( ) time uits. After that, Algorithm Tree is executed for l umbers. From Theorem 6, it

10 takes O( l + l log(l)) = O(l log ) time uits usig O( log(l) log l log +loglog) =O( log +loglog) barrier sychroizatio stes. Thus, e have, Theorem 7: Algorithm Hybrid comutes the sum of umbers i O( log l + l log ) time uits ith O( log + log log ) barrier sychroizatio stes usig l threads o the DMM ad o the UMM ith idth ad latecy l. VIII. CONCLUSION The mai cotributio of this aer is to itroduce the asychroous versio of the memory machie models, the DMM ad the UMM. We also reseted time-otimal arallel summig algorithm ruig i O( + l log ) time uits ad O( log l log +loglog) barrier sychroizatio stes. It is a iterestig oe roblem to further reduce the umber of barrier sychroizatio stes of time-otimal arallel summig comutatio. REFERENCES [1] A. V. Aho, J. D. Ullma, ad J. E. Hocroft, Data Structures ad Algorithms. Addiso Wesley, [2] A. Gibbos ad W. Rytter, Efficiet Parallel Algorithms. Cambridge Uiversity Press, [3] A.Grama,G.Karyis,V.Kumar,adA.Guta,Itroductio to Parallel Comutig. Addiso Wesley, [4] M. J. Qui, Parallel Comutig: Theory ad Practice. McGra-Hill, [5] W. W. Hu, GPU Comutig Gems Emerald Editio. Morga Kaufma, [6] Y. Ito, K. Ogaa, ad K. Nakao, Fast ellise detectio algorithm usig Hough trasform o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [12] NVIDIA Cororatio, NVIDIA CUDA C rogrammig guide versio 4.0, [13] D. Ma, K. Uda, H. Ueyama, Y. Ito, ad K. Nakao, Imlemetatios of a arallel algorithm for comutig euclidea distace ma i multicore rocessors ad GPUs, Iteratioal Joural of Netorkig ad Comutig, vol. 1, , July [14] NVIDIA Cororatio, NVIDIA CUDA C best ractice guide versio 3.1, 20. [15] K. Nakao, Simle memory machie models for GPUs, i Proc. of Iteratioal Parallel ad Distributed Processig Symosium Workshos, May 2012, [16] M. J. Fly, Some comuter orgaizatios ad their effectiveess, IEEE Trasactios o Comuters, vol. C-21, , [17] K. Nakao, A otimal arallel refix-sums algorithm o the memory machie models for GPUs, i Proc. of Iteratioal Coferece o Algorithms ad Architectures for Parallel Processig (ICA3PP, LNCS 7439), Set. 2012, [18] A. Gottlieb, R. Grishma, C. P. Kruskal, K. P. McAuliffe, L. Rudolh, ad M. Sir, The yu ultracomuter desigig a MIMD shared memory arallel comuter, IEEE Tras. o Comuters, vol. C-32, o. 2, , Feb [19] D. H. Larie, Access ad aligmet of data i a array rocessor, IEEE Tras. o Comuters, vol. C-24, o. 12, , Dec [20] S. G. Akl, Parallel Sortig Algorithms. Academic Press, [21] K. E. Batcher, Sortig etorks ad their alicatios, i Proc. AFIPS Srig Joit Comut. Cof., vol. 32, 1968, [7] D. Ma, K. Uda, Y. Ito, ad K. Nakao, A GPU imlemetatio of comutig euclidea distace ma ith efficiet memory access, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [8] A. Uchida, Y. Ito, ad K. Nakao, Fast ad accurate temlate matchig usig ixel rearragemet o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [9] K. Ogaa, Y. Ito, ad K. Nakao, Efficiet cay edge detectio usig a gu, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Nov. 20, [] K. Nishida, Y. Ito, ad K. Nakao, Acceleratig the dyamic rogrammig for the matrix chai roduct o the GPU, i Proc. of Iteratioal Coferece o Netorkig ad Comutig, Dec. 2011, [11], Acceleratig the dyamic rogrammig for the otial oygo triagulatio o the GPU, i Proc. of Iteratioal Coferece o Algorithms ad Architectures for Parallel Processig (ICA3PP, LNCS 7439), Set. 2012,

Simple Memory Machine Models for GPUs

Simple Memory Machine Models for GPUs 2012 IEEE 2012 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symosium Symosium Workshos Workshos & PhD Forum Simle Memory Machine Models

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING

A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING Chater 26 A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING A. Cha ad F. Dehe School of Comuter Sciece Carleto Uiversity Ottawa, Caada K1S 5B6 æ {acha,dehe}@scs.carleto.ca Abstract Keywords: We observe

More information

2009 International Conference on Parallel and Distributed Computing, Applications and Technologies

2009 International Conference on Parallel and Distributed Computing, Applications and Technologies 2009 Iteratioal Coferece o Parallel ad Distributed Comutig, Alicatios ad Techologies A ef ciet arallel sortig comatible with the stadard qsort Duhu Ma, Yasuai Ito ad Koji Naao Deartmet of Iformatio Egieerig,

More information

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming Sequential Memory Access on the Unified Memory Machine ith Alication to the Dynamic Programming Koji Nakano Deartment of Information Engineering Hiroshima University Kagamiyama --, Higashi Hiroshima, 79-87

More information

Simple and Fast Parallel Algorithms for the Voronoi Maps and the Euclidean Distance Map, with GPU implementations

Simple and Fast Parallel Algorithms for the Voronoi Maps and the Euclidean Distance Map, with GPU implementations Simple ad Fast Parallel Algorithms for the Vorooi Maps ad the Euclidea Distace Map, with GPU implemetatios Takumi Hoda, Shiosuke Yamamoto, Hiroaki Hoda, Koji Nakao, Yasuaki Ito Departmet of Iformatio Egieerig

More information

islerp: An Incremental Approach to Slerp

islerp: An Incremental Approach to Slerp isler: A Icremetal Aroach to Sler Xi Li Comuter Sciece Deartmet Digie Istitute of Techology xli@digie.edu Abstract I this aer, a icremetal uaterio iterolatio algorithm is itroduced. With the assumtio of

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Sorting on Clusters of SMPs (Extended Abstract)

Sorting on Clusters of SMPs (Extended Abstract) Sortig o Clusters of SMPs (Exteded Abstract) David R. Helma Joseh JáJá Istitute for Advaced Comuter Studies & Deartmet of Electrical Egieerig, Uiversity of Marylad, College Park, MD 20742. fhelma, joseh

More information

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Design of efficient, virtual non-blocking optical switches

Design of efficient, virtual non-blocking optical switches Desig of efficiet, virtual o-blockig otical switches Larry F. Lid, Michael Sratt Mobile Systems ad Services Laboratory HP Laboratories Bristol HPL-200-239 March 3 th, 2002* otical switchig, switch desig

More information

New Analytical Model of Distributed Coordination Function

New Analytical Model of Distributed Coordination Function IJCSNS Iteratioal Joural of Comuter Sciece ad Netork Security, VOL.8 No.2, December 2008 25 Ne Aalytical Model of Distributed Coordiatio Fuctio Petr Kovar ad Novotý Vít, Bro Uiversity of Techology, Purkyova

More information

Minimum Rank of Graphs Powers Family

Minimum Rank of Graphs Powers Family Oe Joural of Discrete Mathematics 0 65-69 htt://dxdoiorg/046/odm00 Published Olie Aril 0 (htt://wwwscirporg/oural/odm) Miimum Rak of rahs Powers Family Alimohammad Nazari Marzieh Karimi Radoor Deartmet

More information

Figure 1. Illustration of proximity set and interaction set in two dimensions.

Figure 1. Illustration of proximity set and interaction set in two dimensions. A Provably Otimal, Distributio-Ideedet Parallel Fast Multiole Method Λ Fatih E. Sevilge Syracuse Uiversity School of EECS Syracuse, NY 13244 sevilge@ecs.syr.edu Sriivas Aluru Iowa State Uiversity Det.

More information

Streaming PRAM. Abstract. 1. Introduction Simple PRAM

Streaming PRAM. Abstract. 1. Introduction Simple PRAM Streamig PRAM Darrell R. Ulm Deartmet of Comuter Sciece Uiversity of Akro dulm@cs.uakro.edu Michael Scherger Deartmet of Comuter Sciece Ket State Uiversity mscherge@cs.ket.edu Abstract Parallel radom access

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

1/27/12. Vectors: Outline and Reading. Chapter 6: Vectors, Lists and Sequences. The Vector ADT. Applications of Vectors. Array based Vector: Insertion

1/27/12. Vectors: Outline and Reading. Chapter 6: Vectors, Lists and Sequences. The Vector ADT. Applications of Vectors. Array based Vector: Insertion Chater 6: ectors, Lists ad Sequeces ectors: Outlie ad Readig The ector ADT ( 6.1.1) Array-based imlemetatio ( 6.1.2) Nacy Amato Parasol Lab, Det. CSE, Texas A&M Uiversity Ackowledgemet: These slides are

More information

found that now considerable work has been done in this started with some example, which motivates the later results.

found that now considerable work has been done in this started with some example, which motivates the later results. 8 Iteratioal Joural of Comuter Sciece & Emergig Techologies (E-ISSN: 44-64) Volume, Issue 4, December A Study o Adjacecy Matrix for Zero-Divisor Grahs over Fiite Rig of Gaussia Iteger Prajali, Amit Sharma

More information

On (K t e)-saturated Graphs

On (K t e)-saturated Graphs Noame mauscript No. (will be iserted by the editor O (K t e-saturated Graphs Jessica Fuller Roald J. Gould the date of receipt ad acceptace should be iserted later Abstract Give a graph H, we say a graph

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Combination Labelings Of Graphs

Combination Labelings Of Graphs Applied Mathematics E-Notes, (0), - c ISSN 0-0 Available free at mirror sites of http://wwwmaththuedutw/ame/ Combiatio Labeligs Of Graphs Pak Chig Li y Received February 0 Abstract Suppose G = (V; E) is

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

INTERSECTION CORDIAL LABELING OF GRAPHS

INTERSECTION CORDIAL LABELING OF GRAPHS INTERSECTION CORDIAL LABELING OF GRAPHS G Meea, K Nagaraja Departmet of Mathematics, PSR Egieerig College, Sivakasi- 66 4, Virudhuagar(Dist) Tamil Nadu, INDIA meeag9@yahoocoi Departmet of Mathematics,

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs CHAPTER IV: GRAPH THEORY Sectio : Itroductio to Graphs Sice this class is called Number-Theoretic ad Discrete Structures, it would be a crime to oly focus o umber theory regardless how woderful those topics

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(2): Research Article

Journal of Chemical and Pharmaceutical Research, 2014, 6(2): Research Article Available olie www.ocr.com Joural of Chemical ad Pharmaceutical Research 204 6(2):250-255 Research Article ISSN : 0975-784 CODEN(USA) : JCPRC5 Based fuzzy atter recogitio methodology for the DDos evaluatio

More information

Access path support for referential integrity in SQL2

Access path support for referential integrity in SQL2 The VLDB Joural 5: 196 214 (1996) The VLDB Joural c Sriger-Verlag 1996 Access ath suort for referetial itegrity i SQL2 Theo Härder, Joachim Reiert Deartmet of Comuter Sciece, Uiversity of Kaiserslauter,

More information

Efficient Hough transform on the FPGA using DSP slices and block RAMs

Efficient Hough transform on the FPGA using DSP slices and block RAMs Efficiet Hough trasform o the FPGA usig DSP slices ad block RAMs Xi Zhou, Norihiro Tomagou, Yasuaki Ito, ad Koji Nakao Departmet of Iformatio Egieerig Hiroshima Uiversity Kagamiyama 1-4-1, Higashi Hiroshima,

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

quality/quantity peak time/ratio

quality/quantity peak time/ratio Semi-Heap ad Its Applicatios i Touramet Rakig Jie Wu Departmet of omputer Sciece ad Egieerig Florida Atlatic Uiversity oca Rato, FL 3343 jie@cse.fau.edu September, 00 . Itroductio ad Motivatio. relimiaries

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

arxiv: v2 [cs.ds] 24 Mar 2018

arxiv: v2 [cs.ds] 24 Mar 2018 Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves

More information

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms Chapter 4 Sortig 1 Objectives 1. o study ad aalyze time efficiecy of various sortig algorithms 4. 4.7.. o desig, implemet, ad aalyze bubble sort 4.. 3. o desig, implemet, ad aalyze merge sort 4.3. 4. o

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 18 May 2018 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

Formal Datapath Representation and Manipulation for Implementing DSP Transforms

Formal Datapath Representation and Manipulation for Implementing DSP Transforms Formal Datapath Represetatio ad Maipulatio for Implemetig DSP Trasforms Peter A. Milder, Fraz Frachetti, James C. Hoe, ad Markus Püschel Electrical ad Computer Egieerig Departmet Caregie Mello Uiversity

More information

Σ P(i) ( depth T (K i ) + 1),

Σ P(i) ( depth T (K i ) + 1), EECS 3101 York Uiversity Istructor: Ady Mirzaia DYNAMIC PROGRAMMING: OPIMAL SAIC BINARY SEARCH REES his lecture ote describes a applicatio of the dyamic programmig paradigm o computig the optimal static

More information

Counting the Number of Minimum Roman Dominating Functions of a Graph

Counting the Number of Minimum Roman Dominating Functions of a Graph Coutig the Number of Miimum Roma Domiatig Fuctios of a Graph SHI ZHENG ad KOH KHEE MENG, Natioal Uiversity of Sigapore We provide two algorithms coutig the umber of miimum Roma domiatig fuctios of a graph

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

CSE 2320 Notes 8: Sorting. (Last updated 10/3/18 7:16 PM) Idea: Take an unsorted (sub)array and partition into two subarrays such that.

CSE 2320 Notes 8: Sorting. (Last updated 10/3/18 7:16 PM) Idea: Take an unsorted (sub)array and partition into two subarrays such that. CSE Notes 8: Sortig (Last updated //8 7:6 PM) CLRS 7.-7., 9., 8.-8. 8.A. QUICKSORT Cocepts Idea: Take a usorted (sub)array ad partitio ito two subarrays such that p q r x y z x y y z Pivot Customarily,

More information

Distributed Power-law Graph Computing: Theoretical and Empirical Analysis

Distributed Power-law Graph Computing: Theoretical and Empirical Analysis Distributed Power-law Grah Comutig Distributed Power-law Grah Comutig: Theoretical ad Emirical Aalysis Cog Xie Deartmet of Comuter Sciece ad Egieerig Shaghai Jiao Tog Uiversity 800 Dog Chua Road, Shaghai,

More information

1 Graph Sparsfication

1 Graph Sparsfication CME 305: Discrete Mathematics ad Algorithms 1 Graph Sparsficatio I this sectio we discuss the approximatio of a graph G(V, E) by a sparse graph H(V, F ) o the same vertex set. I particular, we cosider

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

New Results on Energy of Graphs of Small Order

New Results on Energy of Graphs of Small Order Global Joural of Pure ad Applied Mathematics. ISSN 0973-1768 Volume 13, Number 7 (2017), pp. 2837-2848 Research Idia Publicatios http://www.ripublicatio.com New Results o Eergy of Graphs of Small Order

More information

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali

More information

4th IEEE International Symposium on Electronic Design, Test & Applications

4th IEEE International Symposium on Electronic Design, Test & Applications 4th EEE teratioal Symosium o Electroic Desig, Test & Alicatios Abstract terolatio Models for mage Suer-resolutio Adrew Gilma, Doald G. Bailey, Stehe R. Marslad stitute of formatio Scieces ad Techology

More information

Homework 1 Solutions MA 522 Fall 2017

Homework 1 Solutions MA 522 Fall 2017 Homework 1 Solutios MA 5 Fall 017 1. Cosider the searchig problem: Iput A sequece of umbers A = [a 1,..., a ] ad a value v. Output A idex i such that v = A[i] or the special value NIL if v does ot appear

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation Flexible Colorig Xiaozhou (Steve) Li, Atri Rudra, Ram Swamiatha HP Laboratories HPL-2010-177 Keyword(s): graph colorig; hardess of approximatio Abstract: Motivated b y reliability cosideratios i data deduplicatio

More information

Improved Random Graph Isomorphism

Improved Random Graph Isomorphism Improved Radom Graph Isomorphism Tomek Czajka Gopal Paduraga Abstract Caoical labelig of a graph cosists of assigig a uique label to each vertex such that the labels are ivariat uder isomorphism. Such

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

A Comparative Study of Positive and Negative Factorials

A Comparative Study of Positive and Negative Factorials A Comparative Study of Positive ad Negative Factorials A. M. Ibrahim, A. E. Ezugwu, M. Isa Departmet of Mathematics, Ahmadu Bello Uiversity, Zaria Abstract. This paper preset a comparative study of the

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1 COSC 1P03 Ch 7 Recursio Itroductio to Data Structures 8.1 COSC 1P03 Recursio Recursio I Mathematics factorial Fiboacci umbers defie ifiite set with fiite defiitio I Computer Sciece sytax rules fiite defiitio,

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Data Structures Week #9. Sorting

Data Structures Week #9. Sorting Data Structures Week #9 Sortig Outlie Motivatio Types of Sortig Elemetary (O( 2 )) Sortig Techiques Other (O(*log())) Sortig Techiques 21.Aralık.2010 Boraha Tümer, Ph.D. 2 Sortig 21.Aralık.2010 Boraha

More information

Stability Measures of Some Chordal Graphs With Restricted Dominating Numbers

Stability Measures of Some Chordal Graphs With Restricted Dominating Numbers Stability Measures of Some Chordal Grahs With Restricted Domiatig Numbers PINAR DÜNDAR Deartmet of Math.&Comt.Sciece of Scieces Faculty, Uiv. of Ege, 35100 Borova-ÝZMÝR, TURKEY. Abstract:-A commuicatio

More information

! Given the following Structure: ! We can define a pointer to a structure. ! Now studentptr points to the s1 structure.

! Given the following Structure: ! We can define a pointer to a structure. ! Now studentptr points to the s1 structure. Liked Lists Uit 5 Sectios 11.9 & 18.1-2 CS 2308 Fall 2018 Jill Seama 11.9: Poiters to Structures! Give the followig Structure: struct Studet { strig ame; // Studet s ame it idnum; // Studet ID umber it

More information

A Parallel DFA Minimization Algorithm

A Parallel DFA Minimization Algorithm A Parallel DFA Miimizatio Algorithm Ambuj Tewari, Utkarsh Srivastava, ad P. Gupta Departmet of Computer Sciece & Egieerig Idia Istitute of Techology Kapur Kapur 208 016,INDIA pg@iitk.ac.i Abstract. I this

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract O Ifiite Groups that are Isomorphic to its Proper Ifiite Subgroup Jaymar Talledo Baliho Abstract Two groups are isomorphic if there exists a isomorphism betwee them Lagrage Theorem states that the order

More information

Throughput-Delay Scaling in Wireless Networks with Constant-Size Packets

Throughput-Delay Scaling in Wireless Networks with Constant-Size Packets Throughput-Delay Scalig i Wireless Networks with Costat-Size Packets Abbas El Gamal, James Mamme, Balaji Prabhakar, Devavrat Shah Departmets of EE ad CS Staford Uiversity, CA 94305 Email: {abbas, jmamme,

More information

1. SWITCHING FUNDAMENTALS

1. SWITCHING FUNDAMENTALS . SWITCING FUNDMENTLS Switchig is the provisio of a o-demad coectio betwee two ed poits. Two distict switchig techiques are employed i commuicatio etwors-- circuit switchig ad pacet switchig. Circuit switchig

More information

On Mean Shift Clustering for Directional Data on a Hypersphere

On Mean Shift Clustering for Directional Data on a Hypersphere O Mea Shift Clusterig for Directioal Data o a Hyershere Mii-She Yag,*, Shou-Je Chag-Chie, ad Hsu-Chih Kuo Deartmet of Alied Mathematics, Chug Yug Christia Uiversity, Chug-Li, aiwa Deartmet of Statistics,

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

Examples and Applications of Binary Search

Examples and Applications of Binary Search Toy Gog ITEE Uiersity of Queeslad I the secod lecture last week we studied the biary search algorithm that soles the problem of determiig if a particular alue appears i a sorted list of iteger or ot. We

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

Behavioral Modeling in Verilog

Behavioral Modeling in Verilog Behavioral Modelig i Verilog COE 202 Digital Logic Desig Dr. Muhamed Mudawar Kig Fahd Uiversity of Petroleum ad Mierals Presetatio Outlie Itroductio to Dataflow ad Behavioral Modelig Verilog Operators

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

Trajectory Improves Data Delivery in Urban Vehicular Networks

Trajectory Improves Data Delivery in Urban Vehicular Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS Trajectory Imroves Data Delivery i Urba Vehicular Networks Yami Zhu, Member, IEEE, Yuche Wu, ad Bo Li, Fellow, IEEE Abstract Efficiet data delivery

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS) CSC165H1, Witer 018 Learig Objectives By the ed of this worksheet, you will: Aalyse the ruig time of fuctios cotaiig ested loops. 1. Nested loop variatios. Each of the followig fuctios takes as iput a

More information

Media Access Protocols. Spring 2018 CS 438 Staff, University of Illinois 1

Media Access Protocols. Spring 2018 CS 438 Staff, University of Illinois 1 Media Access Protocols Sprig 2018 CS 438 Staff, Uiversity of Illiois 1 Where are We? you are here 00010001 11001001 00011101 A midterm is here Sprig 2018 CS 438 Staff, Uiversity of Illiois 2 Multiple Access

More information

BASED ON ITERATIVE ERROR-CORRECTION

BASED ON ITERATIVE ERROR-CORRECTION A COHPARISO OF CRYPTAALYTIC PRICIPLES BASED O ITERATIVE ERROR-CORRECTIO Miodrag J. MihaljeviC ad Jova Dj. GoliC Istitute of Applied Mathematics ad Electroics. Belgrade School of Electrical Egieerig. Uiversity

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

On Nonblocking Folded-Clos Networks in Computer Communication Environments

On Nonblocking Folded-Clos Networks in Computer Communication Environments O Noblockig Folded-Clos Networks i Computer Commuicatio Eviromets Xi Yua Departmet of Computer Sciece, Florida State Uiversity, Tallahassee, FL 3306 xyua@cs.fsu.edu Abstract Folded-Clos etworks, also referred

More information

why study sorting? Sorting is a classic subject in computer science. There are three reasons for studying sorting algorithms.

why study sorting? Sorting is a classic subject in computer science. There are three reasons for studying sorting algorithms. Chapter 5 Sortig IST311 - CIS65/506 Clevelad State Uiversity Prof. Victor Matos Adapted from: Itroductio to Java Programmig: Comprehesive Versio, Eighth Editio by Y. Daiel Liag why study sortig? Sortig

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information