Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Size: px
Start display at page:

Download "Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations"

Transcription

1 Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima University Kagamiyama 1-4-1, Higashi Hiroshima, Jaan Abstract The Hierarchical Memory Machine (HMM) is a theoretical arallel comuting model that catures the essence of comuting on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of comuter vision hich can be obtained by comuting the column-ise refix-sums and then the roise refix-sums. The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), hich suorts asynchronous execution of CUDA blocks, and sho a global-memory-access-otimal arallel algorithm for comuting the SAT on the asynchronous HMM. A straightforard algorithm (2R2W SAT algorithm) on the asynchronous HMM, hich comutes the refix-sums in every column using one thread each and then comutes the refix-sums in every ro, erforms 2 read oerations and 2 rite oerations er element of a matrix. The reviously ublished best algorithm (2R1W SAT algorithm) erforms 2 read oerations and 1 rite oeration er element. We resent a more efficient algorithm (1R1W SAT algorithm) hich erforms 1 read oeration and 1 rite oeration er element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be ritten, our 1R1W SAT algorithm is otimal in terms of the global memory access. We also sho a combined algorithm ((1 + r)r1w SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better erformance. We have imlemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)r1w SAT algorithms on GeForce GTX 78 Ti. The exerimental results sho that our (1+r)R1W SAT algorithm runs faster than any other SAT algorithms for large inut matrices. Also, it runs more than 1 times faster than the best SAT algorithm using a single CPU. Keyords-memory machine models, refix-sums, summed area table, image rocessing, GPU, CUDA I. INTRODUCTION The GPU (Grahics Processing Unit), is a secialized circuit designed to accelerate comutation for building and maniulating images [1], [2], [3]. Latest GPUs are designed for general urose comuting and can erform comutation in alications traditionally handled by the CPU. Hence, GPUs have recently attracted the attention of many alication develoers [1]. NVIDIA rovides a arallel comuting architecture called CUDA (Comute Unified Device Architecture) [4], the comuting engine for NVIDIA GPUs. CUDA gives develoers access to the virtual instruction set and memory of the arallel comutational elements in NVIDIA GPUs. CUDA-enabled GPUs have streaming multirocessors (SMs) each of hich executes multile threads in arallel. CUDA can use to tyes of memories in the NVIDIA GPUs: the shared memory and the global memory [4]. The efficient usage of the shared memory and the global memory is a key for CUDA develoers to accelerate alications using GPUs. In articular, e need to consider bank conflicts of shared memory access and coalescing of global memory access [5], [6], [7]. The address sace of shared memory is maed into several hysical memory banks. If to or more threads access the same memory banks at the same time, the access requests are rocessed in turn. Hence, to maximize the memory access erformance, CUDA threads should access distinct memory banks to avoid the bank conflicts of the memory accesses. To maximize the bandidth of the global memory, CUDA threads should erform coalesced access. In our revious aer [8], e have introduced to models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), hich reflect the essential features of the shared memory and the global memory of NVIDIA GPUs. The outline of the architectures of the DMM and the UMM is illustrated in Figure 1. In both architectures, asea of threads (Ts) are connected to the memory banks (MBs) through the memory management unit (MMU). Each thread is a Random Access Machine (RAM) [9], hich can execute fundamental oerations in a time unit. MBs constitute a single address sace of the memory. A single address sace of the memory is maed to the MBs in an interleaved ay such that the ord of data of address i is stored in the (i mod )-th bank, here is the number of MBs. The main difference of the to architectures is the connection of the address line beteen the MMU and the MBs, hich can transfer an address value. In the DMM, the address lines connect the MBs and the MMU searately, hile a single address line from the MMU is connected to the MBs in the UMM. Hence, in the UMM, the same address value is broadcast to every MB, and the same address of the MBs can be accessed in each time unit. On the other hand, different addresses of the MBs can be accessed in the DMM.

2 a sea of threads MMU MB MB MB MB DMM a sea of threads MMU MB MB MB MB UMM address line data line Figure 1. The architectures of the DMM and the UMM ith =4 Quite recently, e have introduced the Hierarchical Memory Machine (HMM) [1], hich is a hybrid of the DMM and the UMM. The HMM is a more ractical arallel comuting model that reflects the hierarchical architecture of CUDA-enabled GPUs. Figure 2 illustrates the architecture of the HMM. The HMM consists of d DMMs and a single UMM. Each DMM has memory banks and the UMM has memory banks. We call the memory banks of each DMM the shared memory and those of the UMM the global memory after CUDA-enabled GPUs. Each DMM can ork indeendently and can erform the comutation using its shared memory. Also, all threads of DMMs ork as a single UMM and can access to the global memory. While the memory access latency of the shared memory of GPUs is very lo, that of the global memory is several hundred clock cycles [4]. Hence, e assume that the latency of the shared memory is 1, and e use arameter l to denote the latency of the global memory in the HMM. Figure 2. =4 DMM MB MB MB MB MMU T T NoC and MMU UMM DMM MB MB MB MB MMU T T MB MB MB MB shared memory global memory The architecture of the HMM ith d =2DMMs and idth Suose that a matrix a of size n n is given. The summed area table (SAT) [11] is a matrix b of the same size such that b[i][j] = X»i»i;»j»j a[i ][j ]: It should have no difficulty to confirm that the SAT can be obtained by comuting the column-ise refix-sums and the ro-ise refix-sums as illustrated in Figure 3. Once e have the summed area table, the sum of any rectangular area of a can be comuted by evaluating X u<i»d;l<j»r a[i][j] =b[d][r] b[u][r] b[d][l] +b[u][l]: Since the sum of any rectangular area can be comuted in O(1) time the summed area table has a lot of alications in the area of image rocessing and comuter vision [12]. In our revious aer [13], e have resented a arallel algorithm that comutes the SAT in O( n + nl + l log n) time units using threads on the UMM ith idth and latency l. This algorithm is otimal in the sense that any SAT algorithm takes at least Ω( n + nl + l log n) time units. Hoever, this algorithm reeats airise addition and has a large constant factor in the comuting time and it is not ractically efficient. The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), hich suorts asynchronous execution of CUDA blocks, and sho a global-memory-access-otimal arallel algorithm for comuting the summed area table stored in the global memory of the asynchronous HMM. A straightforard algorithm (2R2W SAT algorithm) on the asynchronous HMM, hich comutes the column-ise refix-sums and then the ro-ise refix-sums, erforms 2 read oerations and 2 rite oerations er element of a matrix. The best knon algorithm (2R1W SAT algorithm) so far erforms 2 read oerations and 1 rite oeration er element [14]. We resent a more efficient algorithm (1R1W SAT algorithm) on the asynchronous HMM, hich erforms only 1 read oeration and 1 rite oeration er element. Clearly, since every element in a matrix must be read at least once and all resulting values must be ritten, our 1R1W SAT algorithm is otimal in terms of the global memory access. We also sho a combined algorithm ((1 + r)r1w SAT algorithm) of 2R1W and 1R1W SAT algorithms that can run faster than any other algorithms for large matrices. Table I shos the total number of memory access oerations to the global memory and the shared memory, the number of barrier synchronization stes, and the global memory access cost. The global memory access cost, hich is comuted from the number of global memory access oerations and the number of barrier synchronization stes, aroximates the comuting time on the HMM. For simlicity, in the table, e omit small terms to focus on dominant terms. For examle, 2R2W SAT algorithm erforms n n coalesced rite oerations, but e simly rite n in the corresonding entry.

3 inut matrix summed area table (SAT) Figure 3. The summed area table (SAT) of a 9 9 matrix and 2R2W SAT algorithm Table I THE PERFORMANCE OF SAT ALGORITHMS ON THE HMM global memory access shared memory access barrier global memory SAT algorithms Coalesced Stride synchronization access cost Read/Write Read/Write Read/Write stes 2R2W n / n n / n - 1 2n +2 n +2l 4R4W 4n / 4n - n / n 3 8 n +4l 4R1W - 4n / n - 2 n 5n +2 nl 2R1W 2n / n - 4n / 4n 2k +2 3 n +(2k +3)l 1R1W n / n - 2n / n 2n 2 2 n +2 n l 1.25R1W 1:25n / n - 2:5n / n 2:5n +4k +4 1:25 n +( n +4k +5)l 2 n +4k +4 (1 + r) n +(2(1 r) n +4k +5)l k is the deth of recursion, hich takes value no more than 1 from the ractical oint of vie. r can take any value in the range [; 1] (1 + r)r1w (1 + r)n / n - (2 + r)n / (2 + r)n 2 (1 r) We have also imlemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)r1w SAT algorithms on GeForce GTX 78 Ti. The exerimental results sho that our (1 + r)r1w SAT algorithm runs faster than any other SAT algorithms for large inut matrices. Further, it runs more than 1 times faster than the best SAT algorithm using a single CPU. II. THE DMM, THE UMM, THE HMM AND THE ASYNCHRONOUS HMM We first define the Discrete Memory Machine (DMM) of idth and latency l. LetB[j] = fj; j + ; j +2; j + 3;:::g (» j» 1) beasetofaddressofthe j- th memory bank of the memory. In other ords, address i is in the (i mod )-th memory bank. We assume that addresses in different banks can be accessed in a time unit, but no to addresses in the same bank can be accessed in a time unit. Also, e assume that l time units are necessary to comlete an access request and continuous requests are rocessed in a ieline fashion. Thus, it takes k + l 1 time units to comlete memory access requests to k addresses in a articular bank. We assume that threads are artitioned into grous of threads called ars. More secifically, threads T (), T (1), :::, T ( 1) are artitioned into ars W ();W(1), :::, W ( 1) such that W (i) =ft (i );T(i +1);:::;T((i +1) 1)g (» i» 1). Wars are disatched for memory access in turn, and threads in a ar try to access the memory at the same time. In other ords, W ();W(1);:::;W( 1) are disatched in a round-robin manner if at least one thread in a ar requests memory access. If no thread in a ar needs memory access, such ar is not disatched for memory access. When W (i) is disatched, threads in W (i) send memory access requests, at most one request er thread, to the memory. We also assume that a thread cannot send a ne memory access request until the revious memory access request is comleted. Hence, if a thread sends a memory access request, it must ait at least l time units to send a ne memory access request. We next define the Unified Memory Machine (UMM) of idth and latency l as follos. Let A[j] =fj ; j + 1;:::;(j +1) 1g denote a set of addresses in the j-th address grou. We assume that addresses in the same address grou are rocessed at the same time. Hoever, if they are in the different grous, one time unit is necessary for each of the grous. Also, similarly to the DMM, threads are artitioned into ars and each ar accesses the memory

4 in turn. Figure 4 shos examles of memory access on the DMM and the UMM. We assume that each memory access request is comleted hen it reaches the last ieline stage. To ars W () and W (1) access to h7; 5; 15; i and h1; 11; 12; 9i, resectively. In the DMM, memory access requests by W () are searated into to ieline stages, because addresses 7 and 15 are in the same bank B(3). Those by W (1) occuies 1 stage, because all requests are in distinct banks. Thus, the memory requests occuy three stages, it takes = 7 time units to comlete the memory access. In the UMM, memory access requests by W () are destined for three address grous. Hence the memory requests occuy three stages. Similarly those by W (1) occuy to stages. Hence, it takes 5+5 1=9time units to comlete the memory access. W () W (1) l-stage ieline registers Figure 4. DMM B[] B[1] B[2] B[3] UMM A[] A[1] A[2] A[3] Examles of memory access on the DMM and the UMM Next, e define the Hierarchical Memory Machine (HMM) [1]. The HMM consists of d DMMs and a single UMM as illustrated in Figure 2. Each DMM has memory banks and the UMM also has memory banks. We call the memory banks of each DMM the shared memory and those of the UMM the global memory. Each DMM orks indeendently. Threads are artitioned into ars of threads, and each ar are disatched for the memory access for the shared memory in turn. Further, each ar of threads in all DMMs can send memory access requests to the global memory. Figure 2 illustrates the architecture of the HMM ith d =2DMMs. Each DMM and the UMM has = 4 memory banks. The shared memory of each DMM and the global memory of the UMM corresond to the shared memory of each streaming multirocessor and the global memory of GPUs. We also assume that the shared memory in each DMM of the HMM can store u to O( 2 ) numbers. The caacity of the shared memory of latest CUDA-enabled GPUs is u to 48KBytes and the number of the banks is 32 [4]. Since an array of 32 2 double (64-bit) numbers occuy 8KBytes, each shared memory can store at most 6 such matrices. Thus, it is reasonable to assume that DMM can store O( 2 ) numbers in the shared memory. For more realistic model for GPUs, e introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM). In the asynchronous HMM, DMMs ork asynchronously in the sense that some DMMs may ork slightly sloer or faster than the others. Instead, all threads in all DMMs can execute a barrier synchronization instruction. If a thread in a DMM executes the barrier synchronization instruction, it must ait until all the other threads in all DMMs execute it. Also, after all threads execute the barrier synchronization instruction, all DMMs are reset, thatis,the shared memory of all DMMs are initialized and all data stored in it are lost. The reader may think that this reset assumtion of all DMMs is not reasonable. Hoever, this assumtion is mandatory for rogram scalability of DMMs in the HMM. More secifically, suose that a rogrammer rite a rogram of the HMM ith d DMMs. It may be ossible to execute this rogram in the HMM ith d DMMs such that d < d. If this is the case, the rogram of d DMMs are executed until a barrier synchronization instruction is executed. After that, the d DMMs are reset and the rogram of next d DMMs are executed. The same rocedure is reeated until all threads execute the barrier synchronization instruction. Hence, it makes sense to assume that all DMMs are reset after each barrier synchronization ste. The revious DMMs is resonsible for coying the data stored in the shared memory to the global memory before barrier synchronization if they are used after the synchronization. Actually, e need to terminate a CUDA kernel call for a GPU hen barrier synchronization of all threads is necessary [4]. When a CUDA kernel call is terminated for barrier synchronization, the data stored in the shared memory by a CUDA block are lost. This is because CUDA blocks are executed in streaming multirocessors ith small shared memory one by one in turn. III. THE GLOBAL MEMORY ACCESS COST ON THE HMM AND THE DIAGONAL ARRANGEMENT ON THE DMM Let C, S, and B be the total number of coalesced global memory access oerations, the total number of stride global memory access oerations, and the number of barrier synchronization stes erformed on the HMM. The global memory access cost is defined to be C +S +(B +1)(l 1). We ill sho that the global memory access cost aroximates the comuting time on the HMM if the comutation erformed in each DMM is negligible.

5 n barrier synchronization stes n1 l 1 l 1 l 1 n2 (; ) (; 1) (; 2) (; 3) time (1; 3) (1; ) (1; 1) (1; 2) Figure 5. Timing chart of coalesced memory access to the global memory ith to barrier synchronization stes Suose that an algorithm erforms n coalesced memory access oerations and to barrier synchronization stes. Clearly, by to barrier synchronization stes, the memory access is artitioned into three stages as illustrated in Figure 5. Let n, n 1,andn 2 such that n = n + n 1 + n 2 be the numbers of memory access oerations erformed in the three stages. Since coalesced memory access oerations by a ar of threads occuy one ieline stage, the three stages takes n + l 1, n 1 + l 1, and n2 + l 1 time units resectively. Hence, the algorithm runs n +3(l 1) time units. Also, if n memory access oerations are stride, each memory access oeration occuy one ieline stage, the algorithm runs in n +3(l 1) time units. In general, if an algorithm erforms C coalesced memory access oerations, S stride memory access oerations, and B barrier synchronization stes, it runs C +S +(B +1)(l 1), hich is equal to the global memory access cost. Suose that e have a matrix of size in the shared memory of a DMM in the HMM. Since a column of the matrix is in the same bank, column-ise access by threads has bank conflicts, hile ro-ise access is conflict-free. In our revious aer [15], e have resented a diagonal arrangement of a matrix such that each (i; j) element is arranged in a[i][(i + j) mod]. Figure 6 illustrates the diagonal arrangement of a 4 4 matrix. We can confirm that both a ro-ise access to (1; ); (1; 1); (1; 2); (1; 3) and a column-ise access to (; 1); (1; 1); (2; 1); (3; 1) are conflict-free. Thus, e have, Lemma 1: In the diagonal arrangement of a matrix, both a ro-ise access and a column-ise access are conflict-free. The diagonal arrangement is used to comute the SAT of a matrix in a shared memory and transose of a matrix in the global memory of the HMM. IV. 2R2W AND 4R4W SAT ALGORITHMS Let s i denote a local register of thread T (i) (» i» n 1). As illustrated in Figure 3, the summed area table (SAT) of a n n matrix a can be comuted by the column-ise refix-sums and the ro-ise refix-sums: [2R2W SAT algorithm] for i ψ to n do in arallel // column-ise refix-sums T (i) erforms s i ψ a[][i] for j ψ 1 to n 1 do Figure 6. (2; 2) (2; 3) (2; ) (2; 1) (3; 1) (3; 2) (3; 3) (3; ) Diagonal arrangement of a 4 4 matrix T (i) erforms s i ψ s i + a[j][i] T (i) erforms a[j][i] ψ s i barrier synchronization for i ψ to n do in arallel // ro-ise refix-sums T (i) erforms s i ψ a[i][] for j ψ 1 to n 1 do T (i) erforms s i ψ s i + a[i][j] T (i) erforms a[i][j] ψ s i In the comutation of the column-ise refix-sums, a[][];a[][1];:::;a[][ n 1] are read. After that, for each j (1» j» n 1), a[j][];a[j][1];:::;a[j][ n 1] are read and ritten. Clearly, memory access to these elements are coalesced. In the comutation of the column-ise refixsums, a[][];a[1][];:::;a[ n 1][] are read. After that, for each j (1» j» n 1), a[][j];a[1][j];:::;a[ n 1][j] are read and ritten. Memory access to these elements are coalesced. Hence, 2R2W SAT algorithm erforms 2n n coalesced memory access oerations and 2n n stride memory access oerations. Since 2R2W SAT algorithm has one barrier synchronization ste, e have, Lemma 2: The global memory access cost of 2R2W SAT algorithm is at most 2n +2 n +2(l 1). We can avoid stride memory access hen e comute the ro-ise refix-sums by transosing a matrix. More secifically, the ro-ise refix-sums can be obtained by transose, column-ise refix-sums, and transose. It has been shon in [16] that transose of a matrix of size n n in the global memory of the HMM can be done in 2n coalesced memory access oerations ith no barrier synchronization ste. The idea of the transose is to artition the matrix into n n blocks ith elements each. We can transose a block via a matrix ith diagonal arrangement in a shared memory of a DMM efficiently. First, a block in the global memory is read in ro-ise and it is ritten in a matrix ith diagonal arrangement in ro-ise. After that, the matrix is read in columnise and it is ritten in a a block in the global memory in ro-ise. The reader should refer to Figure 7 illustrating transose of a block using a 4 4 matrix ith diagonal

6 arrangement. By executing this block transose for all blocks in arallel so that a air of corresonding to blocks are saed aroriately, the transose of a n n can be done. The reader should refer to [16] for the details of the transose Figure 7. Transose of a block using a 4 4 matrix ith diagonal arrangement By executing matrix transose tice, e can design 4R4W SAT algorithm as follos: [4R4W SAT algorithm] Ste 1: Comute the column-ise refix-sums Ste 2: Transose Ste 3: Comute the column-ise refix-sums Ste 4: Transose After Stes 1, 2, and 3, barrier synchronization is necessary. Also, each ste needs no more than 2n coalesced global memory access. Thus, e have, Lemma 3: The global memory access cost of 4R4W SAT algorithm is at most 8 n +4(l 1). V. 2R1W SAT ALGORITHM The main urose of this section is to revie a SAT algorithm for GPU shon in [14]. Since this SAT algorithm erforms 2n read and n rite oerations to the global memory, e call it 2R1W SAT algorithm. 2R1W SAT algorithm that e ill exlain is slightly different from that in [14] for easy understanding of the algorithm. Suose that a n n matrix a is artitioned into n blocks of elements each. 2R1W SAT algorithm n has three stes as follos: [2R1W SAT algorithm] Ste 1: Each DMM reads a block in the global memory and rite it in the shared memory. The column-ise sums, the ro-ise sums, and the sum of the block are comuted. More secifically, for a block a of size, ffl column-ise sums: C[i] = P 1 j= a [j][i] for all i (» i» 1), ffl ro-ise sums: R[i] = P 1 j= a [i][j] for all i (» i» P 1 j= a [i][j], 1), and P 1 ffl sum: S = i= are comuted. The column-ise sums of all blocks excluding the bottom blocks are ritten in the global memory such that they constitute a matrix of size n ( 1) n in the global memory. Similarly, the ro-ise sums of all blocks constitute a matrix of size n ( 1), and the sums n constitute a matrix of size n ( 1) ( n 1). The reader should refer to Figure 8 for illustrating the resulting values for a matrix in Figure 3 ith = 3. LetC, R, ands denote matrices of the resulting values for the column-ise sums, the ro-ise sums, and the sums, resectively. In Figure 8, the sizes of C, R, ands are 2 9, 9 2, and 2 2, resectively. Ste 2: The column-ise refix-sums of C and the ro-ise refix-sums of R arecomutedinthesameayas2r2w SAT algorithm. If S is no larger than then e comute the SAT of S using a single DMM. Otherise, e execute 2R1W SAT algorithm recursively for S. The reader should refer to Figure 8 for the resulting values of C, R, ands. Ste 3-1: Each DMM reads a block from the global memory and rite it in the shared memory. Let a denote a block read by a articular DMM. It reads elements in C, and adds them to the to ro of a so that each of the resulting sums is the sum of all elements above it, inclusive, in the same column. Similarly, it reads elements in R, and adds them to the leftmost column of a so that each of the resulting sums is the sum of all elements to the left-side of it, inclusive, in the same ro. Further, an element in S is added to the to left corner of a so that the resulting sum is the the sum of all blocks above and to the left of it, inclusive. The reader should refer to Figure 9 illustrating these oerations for a block. Also, Figure 8 illustrates the resulting values of all blocks. Ste 3-2: Each DMM comutes the SAT of a block obtained in Ste 3-1 and the resulting values are ritten in the global memory. Figure 9 illustrates the values of a block before and after this ste. The reader should have no difficulty to confirm that the block thus obtained stores the SAT of the inut matrix correctly Figure S R C Ste Ste 3-1 Ste 3 of 2R1W SAT algorithm Let us evaluate the global memory access cost. In Ste 1, all elements in a are read, and C, R, ands are ritten. Thus, n elements are read from the global memory and less than 2 n + n 2 elements are ritten in the global memory.

7 After Ste 1 After Ste 2 After Ste column-ise sums ro-ise sums sums C R S Figure 8. 2R1W SAT algorithm executed for a matrix in Figure 3 ith =3 Note that e should rite R T, that is, transosed R in the global memory for the urose of coalesced memory access for the ro-ise refix-sums comutation in Ste 2. If this is the case, the ro-ise refix-sums of R corresonds to the column-ise refix-sums of R T, hich can be comuted by coalesced memory access to the global memory. Also, in Ste 1, the column-ise sums, the ro-ise sums, and the SAT of a block in a shared memory are comuted. This comutation can be done ithout bank conflicts by diagonal arrangement of a block. Each of the d DMMs erforms the n comutation of the column-ise sums for blocks of size 2 d. Since the memory access is conflict-free, this takes only n time units, hich is so small that it can be hidden d by latency overhead. Ste 2 erforms the comutation of the the column-ise refix-sums and the ro-ise refix-sums for matrices of n elements. It also comutes the SAT of n elements, 2 recursively, hich erforms global memory access to O( n 2 ) elements. In Ste 3-1, all elements in a and the resulting values of C, R, ands are read from the global memory. In Ste 3-2 the resulting SAT is ritten in the global memory. Thus, 2R1W SAT algorithm erforms at most 3n+8 n +O( n 2 ) coalesced memory access oerations. This includes the memory access by recursive comutation of S. Let us evaluate the number of barrier synchronization stes. Barrier synchronization ste is necessary after Stes 1 and 2 if the SAT of S is comuted ithout recursion. If the SAT of S is comuted recursively, additional to barrier synchronization stes is necessary for each recursion. Hence, if 2R1W SAT algorithm involves deth k recursions, it erforms 2k +2 barrier synchronization stes. Thus, e have, Lemma 4: The global memory access cost of 2R1W SAT algorithm ith recursion deth k is 3 n +8 n 2 + O( n 3 )+ (2k + 3)(l 1). Since =32in current GPUs, d =if n» 2 2 and d =1 if n» 2 3.Thus,d is no more than 1 from the ractical oint of vie. VI. OUR 1R1W SAT ALGORITHM The main urose of this section is to sho our novel SAT algorithm called 1R1W SAT algorithm. This algorithm erforms only n + O( n ) read oerations and n + O( n ) rite oerations to the global memory. Before shoing 1R1W SAT algorithm, e resent 4R1W SAT algorithm. By combining techniques used in 4R1W SAT and 2R1W SAT algorithms, e can obtain 1R1W SAT algorithm. Let b be the SAT of an inut n n matrix a. Suose that the values of b[i 1][j 1], b[i][j 1], andb[i 1][j] are already comuted. We can obtain the value of b[i][j] by evaluating the folloing formula: b[i][j] =a[i][j] b[i][j 1] b[i 1][j]+b[i 1][j 1] (1) From this formula, 4R1W SAT algorithm comutes the SAT in a diagonal scan order from the to left to the bottom right. More secifically, 4R1W algorithm has 2 n 1 stages and each Stage k (» k» 2 n 2) comutes Formula (1) for all i and j such that i + j = k. Figure 1 illustrates the comutation erformed in Stage 7. Let us evaluate the erformance of 4R1W SAT algorithm. To comute each b[i][j], 3 elements in b and 1 element in a are read. Also, the resulting value is ritten in b. Thus, 4n reading oerations and n riting oerations are erformed. Unfortunately, all memory access are stride. Further, barrier

8 Figure 1. Stage 7 of 4R1W SAT algorithm synchronization ste is necessary after every stage from to 2 n 3. Thus,ehave, Lemma 5: The global memory access cost of 4R1W SAT algorithm is 5n +(2 n 1)l. We are no in a osition to sho our ne 1R1W SAT algorithm. The idea is to extend 4R1W SAT algorithm to erform SAT comutation in block-ise. In each block-ise comutation, a similar comutation to 2R1W SAT algorithm is erformed. Again, an inut matrix a of size n n is artitioned into n n blocks of size each. Let A(i; j) (» i; j» n 1) denote a block in the i-th ro and in the j-th column. 1R1W SAT algorithm has n 2 1 stages. Each Stage k (» k» n 2 2) comutes the SAT of block A(i; j) ith i + j = k. The reader should refer to Figure 11 for illustrating the comutation erformed in Stage 3. In Stage 3, A(2; 1) and A(1; 2) are comuted using the resulting values in A(2; ), A(1; 1), anda(; 2). From these resulting values, e can obtain C, R, ands for each of A(2; 1) and A(1; 2). For examle, in Figure 11, C for A(2; 1) is (9; 11; 9). This can be obtained by the resulting value 13 of the bottom to corner in A(1; ) and the resulting values (22; 33; 42) in the bottom ro of A(1; 1). More secifically, e can obtain C by comuting airise subtraction (22; 33; 42) (13; 22; 33) = (9; 11; 9). Afterthe values of C, R, ands, these values are added to A(2; 1) in the same ay as Ste 3-1 of 2R1W SAT algorithm. Finally, e comute the SAT of A(2; 1) in the same ay as Ste 3-2 of 2R1W SAT algorithm. Let us evaluate the erformance of 1R1W SAT algorithm. In each stage, the values of a block in the global memory are read and the resulting values are ritten to the global memory. Also, the values necessary to comute C, R, ands are read and ritten to the global memory. For each block, 2 +1 elements are read from the global memory for this task. Since e have n blocks, n +(2 +1) n elements 2 2 n are read and n +(2 +1) elements are ritten in all 2 stages. Barrier synchronization is necessary after each of Stages from to n 2 3. Thus,ehave, Theorem 6: The global memory access cost of 4R1W C R S Figure 11. Stage 3 of 1R1W SAT algorithm for a matrix in Figure 3 SAT algorithm is 2 n + O( n n 2 )+(2 2)l. VII. (1 + r)r1w SAT ALGORITHM The main urose of this section is to accelerate the SAT comutation further by combining 1R1W and 2R1W SAT algorithms. The idea of further acceleration is to use 2R1W SAT algorithm in early and late stages of 1R1W SAT algorithms to reduce the latency overhead. Again, suose that a n n matrix a is artitioned into n n blocks of size each. As illustrated in Figure 12, for any fixed arameter r ( < r < 1), e artition blocks into (A) to left triangle, (B) bottom right triangle, and (C) remaining blocks. Clearly, (A) and (B) have r n +( r n 1)+ +1 = r n ( r n +1)=2 ß rn 2 2 blocks each. We first use 2R1W SAT algorithm to comute the SAT of (A). After that, e use 1R1W SAT algorithm for (C). Finally, 2R1W SAT algorithm is used for the comutation of the SAT in (B). Let us evaluate the erformance. Since (A) and (B) have aroximately rn elements in rn blocks each, 2R1W SAT algorithm for (A) and (B) erforms rn + O( rn ) read oerations and rn 2 + O( rn ) rite oerations each. Also, since (C) has (1 r)n elements, 1R1W SAT algorithm for (C) erforms (1 r)n + O( (1 r)n ) read oerations and (1 r)n + O( (1 r)n ) rite oerations. Hence, this SAT algorithm erforms (1 + r)n + O( n ) read oerations and n+o( n ) rite oerations. Thus, e call this SAT algorithm (1 + r)r1w SAT algorithm. Further, 2R1W SAT algorithm for (A) and (B) needs 2+2k barrier synchronization stes each, here k is the deth of the recursion of 2R1W SAT algorithm. Since 1R1W SAT algorithm for (C) has 2 (1 r) n 1 stages, it needs 2 (1 r) n 2 barrier synchronization stes. Also, after the comutation of the SAT for (A) and (B), 1 barrier synchronization ste each

9 is necessary. Totally, (1 + r)r1w SAT algorithm executes 2 (1 r) n +4+4k barrier synchronization stes. Thus, e have, Theorem 7: The global memory access cost of (1 + r)r1w SAT algorithm is (2+r) n +O( n 2 )+(2 (1 r) n + 5+4k)l. Figure 12. r n n (A) 2R1W (C) 1R1W (B) 2R1W Partition of a matrix for (1 + r)r1w SAT algorithm When r = :25, the global memory access cost of 1.25R1W SAT algorithm is 2:5 n +( n +5+4d)l time units. Since 2R1W and 1R1W SAT algorithms run aroximately 3 n +(3+2d)l and 2 n +2 n l time units, resectively, 1.25R1W SAT algorithm may run faster than these algorithms. Further, e can select the best value r that minimize the running time of (1 + r)r1w SAT algorithm. VIII. EXPERIMENTAL RESULTS We have imlemented all SAT algorithms resented so far in this aer on GeForce GTX 78 Ti. Since the number of memory banks and the number of threads in a ar is 32 [4], e have imlemented SAT algorithms ith =32. Barrier synchronization of all threads is imlemented by invoking searate CUDA kernel calls. For examle, one CUDA kernel call is invoked for each of n 2 1 stages of 4R1W SAT algorithm. We have tested several configuration in terms of the number of threads in a CUDA block, and selected the best configuration. For examle, in 2R2W, 4R4W, and 4R1W SAT algorithms, CUDA blocks ith 64 threads each are invoked. Since each Stage i and each Stage 2 1 i (» i» n 1) of 4R1W SAT algorithm comutes i +1 values of the SAT, it uses i +1 threads in i+1 CUDA 64 blocks. In 2R1W and 1R1W SAT algorithm, a block in a matrix is coied to the shared memory in a streaming multirocessor and the column-ise sums, the ro-ise sums, and/or the SAT of it is comuted. After that, the resulting values are coied to the global memory. For this oeration, e use one CUDA block ith 128 threads for each block. All 128 threads in a CUDA block areusedtocoya block in the shared memory and 32 threads out of 128 threads are used to comute the column-ise sums, the ro-ise sums, and/or the SAT. n Table II shos the running time of SAT algorithms for a double (64-bit) matrix of size from 1K 1K (= ) to 18K 18K (= ). Since a 18K 18K 64-bit matrix uses 2.53GBytes, it is hard to store a matrix larger than it in the global memory of GeForce GTX 78 Ti of size 3GBytes. The running time of the best SAT algorithm for each value of n is highlighted in boldface. Since 4R1W SAT algorithm erforms a lot of kernel calls and stride memory access, and has large memory access latency overhead, it needs much more comuting time than the other algorithms. Recall that 4R4W SAT algorithm corresonds to 2R2W SAT algorithm ith transose and 4R4W SAT algorithm erforms much more memory access oerations than 2R2W SAT algorithm. Since 2R2W SAT algorithm erforms stride memory access, it is much sloer than 4R4W SAT algorithm. These exerimental results imly that stride memory access imoses a large enalty on the comuting time. Recall that 2R1W and 1R1W SAT algorithms are blockbased algorithms, that erform 3n + O( n ) and 2n + O( n ) global memory access oerations, resectively. Hence, they are faster than 4R4W SAT algorithm, hich erforms aroximately 8n global memory access oerations. Although 1R1W SAT algorithm erforms feer global memory access oerations than 2R1W SAT algorithm, it runs sloer hen n»6k. The reason is that 1R1W SAT algorithm has a larger latency overhead than 2R1W SAT algorithm and the latency overhead dominates the bandidth overhead hen the size of inut is small. 1.25R1W SAT algorithm runs faster than both 2R1W and 1R1W SAT algorithms henever n 5K. We have evaluated the comuting time for all ossible values r = 2 ; (2)2 ;:::; ( n ) 2 n n n to find the best value r that minimize the running time of (1 + r)r1w. Table II also shos the values of r ( < r < 1) that minimize the running time of (1 + r)r1w SAT algorithm. From the table, e can see that (1 + r)r1w SAT algorithm attain the best erformance hen n 5K. Also, the value of r that gives the best erformance decreases as the size of a matrix increases. This is because the memory bandidth overhead of 1R1W SAT algorithm dominates the latency overhead for larger matrices and 1R1W SAT algorithm has better erformance than 2R1W algorithm. We can conjecture that 1R1W SAT algorithm could be the best if an inut matrix as much larger than 18K 18K. To see a seed-u factor of SAT algorithms running on the GPU over a conventional CPU, e have evaluated the erformance of several sequential SAT algorithms on Intel Xeon X746 (2.66GHz). Table II shos the running time of to to sequential algorithms as follos: 2R2W(CPU): The column-ise refix-sums are comuted in a raster scan order from the to ro to the bottom ro. More secifically, a[i +1][j] ψ a[i + 1][j] +a[i][j] is executed in a raster scan order of (i; j). The ro-ise

10 Table II THE RUNNING TIME OF SAT ALGORITHM (IN MILLISECONDS) AND THE VALUE OF r THAT MINIMIZE THE RUNNING TIME OF (1 + r)r1w SAT ALGORITHM FOR MATRICES OF SIZES FROM 1K 1K TO 18K 18K SAT Algorithms 1K 2K 3K 4K 5K 6K 7K 8K 1K 12K 14K 16K 18K 2R2W R4W R1W R1W R1W R1W fastest (1 + r)r1w r ( < r < 1) R2W(CPU) R1W(CPU) refix-sums are also comuted in a raster scan order, that is, a[i][j +1]ψ a[i][j +1]+a[i][j] is executed in a raster scan order of (i; j). 4R1W(CPU): Formula (1) is evaluated in a raster scan order of (i; j). From the table, e can see that 4R1W(CPU) SAT algorithm runs faster than 2R2W(CPU) SAT algorithm, because of the memory access locality. Also, (1 + r)r1w SAT algorithm runs more than 1 times faster than 4R1W(CPU) SAT algorithm hen n 5K. IX. CONCLUSION The main contribution of this aer is to introduce the asynchronous Hierarchical Memory Machine, hich cature the essence of CUDA-enabled GPUs. We have also resented a global-memory-access-otimal arallel algorithm for comuting the summed area table on the asynchronous HMM. The exerimental results on GeForce GTX 78 Ti shoed that our best algorithm, (1 +r)r1w SAT algorithm, runs faster than any other algorithms for an inut matrix of size 5K 5K or larger. It also runs at least 1 times faster than the best sequential algorithm running on a single CPU. REFERENCES [1] W. W. Hu, GPU Comuting Gems Emerald Edition. Morgan Kaufmann, 211. [2] K. Ogaa, Y. Ito, and K. Nakano, Efficient Canny edge detection using a GPU, in Proc. of International Conference on Netorking and Comuting. IEEE CS Press, Nov. 21, [3] A. Uchida, Y. Ito, and K. Nakano, Fast and accurate temlate matching using ixel rearrangement on the GPU, in Proc. of International Conference on Netorking and Comuting. IEEE CS Press, Dec. 211, [4] NVIDIA Cororation, NVIDIA CUDA C rogramming guide version 5., 212. [5] D. Man, K. Uda, H. Ueyama, Y. Ito, and K. Nakano, Imlementations of a arallel algorithm for comuting euclidean distance ma in multicore rocessors and GPUs, International Journal of Netorking and Comuting, vol. 1, no. 2, , July 211. [6] NVIDIA Cororation, NVIDIA CUDA C best ractice guide version 3.1, 21. [7] Y. Ito and K. Nakano, A GPU imlementation of dynamic rogramming for the otimal olygon triangulation, IEICE Transactions on Information and Systems, vol. E96-D, no. 12, , Dec [8] K. Nakano, Simle memory machine models for GPUs, International Journal of Parallel, Emergent and Distributed Systems, vol. 29, no. 1, , 214. [9] A. V. Aho, J. D. Ullman, and J. E. Hocroft, Data Structures and Algorithms. Addison Wesley, [1] K. Nakano, The hierarchical memory machine model for GPUs, in Proc. of International Parallel and Distributed Processing Symosium Workshos, May 213, [11] F. Cro, Summed-area tables for texture maing, in Proc. of the 11th annual conference on Comuter grahics and interactive techniques, 1984, [12] A. Lauritzen, Chater 8: Summed-area variance shado mas, in GPU Gems 3. Addison-Wesley, 27. [13] K. Nakano, Otimal arallel algorithms for comuting the sum, the refix-sums, and the summed area table on the memory machine models, IEICE Trans. on Information and Systems, vol. E96-D, no. 12, , 213. [14] D. Nehab, A. Maximo, R. S. Lima, and H. Hoe, GPUefficient recursive filtering and summed-area tables, ACM Trans. Grah., vol. 3, no. 6,. 176, 211. [15] K. Nakano, Simle memory machine models for GPUs, in Proc. of International Parallel and Distributed Processing Symosium Workshos, May 212, [16] A. Kasagi, K. Nakano, and Y. Ito, An otimal offline ermutation algorithm on the hierarchical memory machine, ith the GPU imlementation, in Proc. of International Conference on Parallel Processing. IEEE CS Press, Oct. 213,

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

Simple Memory Machine Models for GPUs

Simple Memory Machine Models for GPUs 2012 IEEE 2012 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symosium Symosium Workshos Workshos & PhD Forum Simle Memory Machine Models

More information

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming Sequential Memory Access on the Unified Memory Machine ith Alication to the Dynamic Programming Koji Nakano Deartment of Information Engineering Hiroshima University Kagamiyama --, Higashi Hiroshima, 79-87

More information

2010 First International Conference on Networking and Computing

2010 First International Conference on Networking and Computing First International Conference on Networking and Comuting Imlementations of Parallel Comutation of Euclidean Distance Ma in Multicore Processors and GPUs Duhu Man, Kenji Uda, Hironobu Ueyama, Yasuaki Ito,

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation Yasuaki ITO and Koji NAKANO, Members SUMMARY This paper presents a GPU (Graphics

More information

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation 2596 IEICE TRANS. INF. & SYST., VOL.E96 D, NO.12 DECEMBER 2013 PAPER Special Section on Parallel and Distributed Computing and Networking A GPU Implementation of Dynamic Programming for the Optimal Polygon

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution Texture Maing with Vector Grahics: A Nested Mimaing Solution Wei Zhang Yonggao Yang Song Xing Det. of Comuter Science Det. of Comuter Science Det. of Information Systems Prairie View A&M University Prairie

More information

Modified Bloom filter for high performance hybrid NoSQL systems

Modified Bloom filter for high performance hybrid NoSQL systems odified Bloom filter for high erformance hybrid NoSQL systems A.B.Vavrenyuk, N.P.Vasilyev, V.V.akarov, K.A.atyukhin,..Rovnyagin, A.A.Skitev National Research Nuclear University EPhI (oscow Engineering

More information

Randomized Selection on the Hypercube 1

Randomized Selection on the Hypercube 1 Randomized Selection on the Hyercube 1 Sanguthevar Rajasekaran Det. of Com. and Info. Science and Engg. University of Florida Gainesville, FL 32611 ABSTRACT In this aer we resent randomized algorithms

More information

A Reconfigurable Architecture for Quad MAC VLIW DSP

A Reconfigurable Architecture for Quad MAC VLIW DSP A Reconfigurable Architecture for Quad MAC VLIW DSP Sangwook Kim, Sungchul Yoon, Jaeseuk Oh, Sungho Kang Det. of Electrical & Electronic Engineering, Yonsei University 132 Shinchon-Dong, Seodaemoon-Gu,

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

CMSC 425: Lecture 16 Motion Planning: Basic Concepts

CMSC 425: Lecture 16 Motion Planning: Basic Concepts : Lecture 16 Motion lanning: Basic Concets eading: Today s material comes from various sources, including AI Game rogramming Wisdom 2 by S. abin and lanning Algorithms by S. M. LaValle (Chats. 4 and 5).

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

Tiling for Performance Tuning on Different Models of GPUs

Tiling for Performance Tuning on Different Models of GPUs Tiling for Performance Tuning on Different Models of GPUs Chang Xu Deartment of Information Engineering Zhejiang Business Technology Institute Ningbo, China colin.xu198@gmail.com Steven R. Kirk, Samantha

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

A Warp-synchronous Implementation for Multiple-length Multiplication on the GPU

A Warp-synchronous Implementation for Multiple-length Multiplication on the GPU A Warp-synchronous Implementation for Multiple-length Multiplication on the GPU Takumi Honda, Yasuaki Ito, Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1, Higashi-Hiroshima,

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY

AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY AUTOMATIC EXTRACTION OF BUILDING OUTLINE FROM HIGH RESOLUTION AERIAL IMAGERY Yandong Wang EagleView Technology Cor. 5 Methodist Hill Dr., Rochester, NY 1463, the United States yandong.wang@ictometry.com

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

Synthesis of FSMs on the Basis of Reusable Hardware Templates

Synthesis of FSMs on the Basis of Reusable Hardware Templates Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Comutation, Elounda, Greece, August -3, 006 (09-4) Synthesis of FSMs on the Basis of Reusable Hardware Temlates VALERY SKLYAROV, IOULIIA

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices

GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices GPU-Accelerated Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices Hiroki Tokura, Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima

More information

CASCH - a Scheduling Algorithm for "High Level"-Synthesis

CASCH - a Scheduling Algorithm for High Level-Synthesis CASCH a Scheduling Algorithm for "High Level"Synthesis P. Gutberlet H. Krämer W. Rosenstiel Comuter Science Research Center at the University of Karlsruhe (FZI) HaidundNeuStr. 1014, 7500 Karlsruhe, F.R.G.

More information

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU

Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Accelerating Ant Colony Optimization for the Vertex Coloring Problem on the GPU Ryouhei Murooka, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University Kagamiyama 1-4-1,

More information

Gradient Compared l p -LMS Algorithms for Sparse System Identification

Gradient Compared l p -LMS Algorithms for Sparse System Identification Gradient Comared l -LMS Algorithms for Sarse System Identification Yong Feng,, Jiasong Wu, Rui Zeng, Limin Luo, Huazhong Shu. School of Biological Science and Medical Engineering, Southeast University,

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

A Morphological LiDAR Points Cloud Filtering Method based on GPGPU

A Morphological LiDAR Points Cloud Filtering Method based on GPGPU A Morhological LiDAR Points Cloud Filtering Method based on GPGPU Shuo Li 1, Hui Wang 1, Qiuhe Ma 1 and Xuan Zha 2 1 Zhengzhou Institute of Surveying & Maing, No.66, Longhai Middle Road, Zhengzhou, China

More information

split split (a) (b) split split (c) (d)

split split (a) (b) split split (c) (d) International Journal of Foundations of Comuter Science c World Scientic Publishing Comany ON COST-OPTIMAL MERGE OF TWO INTRANSITIVE SORTED SEQUENCES JIE WU Deartment of Comuter Science and Engineering

More information

Distributed Algorithms

Distributed Algorithms Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH

AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH AUTOMATIC 3D SURFACE RECONSTRUCTION BY COMBINING STEREOVISION WITH THE SLIT-SCANNER APPROACH A. Prokos 1, G. Karras 1, E. Petsa 2 1 Deartment of Surveying, National Technical University of Athens (NTUA),

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

CLUSTER ANALYSIS AND NEURAL NETWORK

CLUSTER ANALYSIS AND NEURAL NETWORK LUSTER ANALYSIS AND NEURAL NETWORK P.Dostál,P.Pokorný Deartment of Informatics, Faculty of Business and Management, Brno University of Technology Institute of Mathematics, Faculty of Mechanical Engineering,

More information

Coarse grained gather and scatter operations with applications

Coarse grained gather and scatter operations with applications Available online at www.sciencedirect.com J. Parallel Distrib. Comut. 64 (2004) 1297 1310 www.elsevier.com/locate/jdc Coarse grained gather and scatter oerations with alications Laurence Boxer a,b,, Russ

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

Complexity analysis and performance evaluation of matrix product on multicore architectures

Complexity analysis and performance evaluation of matrix product on multicore architectures Comlexity analysis and erformance evaluation of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal

More information

Submission. Verifying Properties Using Sequential ATPG

Submission. Verifying Properties Using Sequential ATPG Verifying Proerties Using Sequential ATPG Jacob A. Abraham and Vivekananda M. Vedula Comuter Engineering Research Center The University of Texas at Austin Austin, TX 78712 jaa, vivek @cerc.utexas.edu Daniel

More information

Improved Gaussian Mixture Probability Hypothesis Density for Tracking Closely Spaced Targets

Improved Gaussian Mixture Probability Hypothesis Density for Tracking Closely Spaced Targets INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 7, VOL., NO., PP. 7-5 Manuscrit received October 7, 5; revised June, 7. DOI:.55/eletel-7- Imroved Gaussian Mixture Probability Hyothesis Density for

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Argo Programming Guide

Argo Programming Guide Argo Programming Guide Evangelia Kasaaki, asmus Bo Sørensen February 9, 2015 Coyright 2014 Technical University of Denmark This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International

More information

Asynchronous Memory Machine Models with Barrier Synchronization

Asynchronous Memory Machine Models with Barrier Synchronization Asychroous Memory Machie Models ith Barrier Sychroizatio Koji Nakao Deartmet of Iformatio Egieerig Hiroshima Uiversity Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Jaa Email: akao@cs.hiroshima-u.ac.j

More information

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch Sub-logarithmic Deterministic Selection on Arrays with a Recongurable Otical Bus 1 Yijie Han Electronic Data Systems, Inc. 750 Tower Dr. CPS, Mail Sto 7121 Troy, MI 48098 Yi Pan Deartment of Comuter Science

More information

Modeling Reliable Broadcast of Data Buffer over Wireless Networks

Modeling Reliable Broadcast of Data Buffer over Wireless Networks JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 3, 799-810 (016) Short Paer Modeling Reliable Broadcast of Data Buffer over Wireless Networs Deartment of Comuter Engineering Yeditee University Kayışdağı,

More information

Simulating Ocean Currents. Simulating Galaxy Evolution

Simulating Ocean Currents. Simulating Galaxy Evolution Simulating Ocean Currents (a) Cross sections (b) Satial discretization of a cross section Model as two-dimensional grids Discretize in sace and time finer satial and temoral resolution => greater accuracy

More information

CS2 Algorithms and Data Structures Note 8

CS2 Algorithms and Data Structures Note 8 CS2 Algorithms and Data Structures Note 8 Heasort and Quicksort We will see two more sorting algorithms in this lecture. The first, heasort, is very nice theoretically. It sorts an array with n items in

More information

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER GEOMETRIC CONSTRAINT SOLVING IN < AND < 3 CHRISTOPH M. HOFFMANN Deartment of Comuter Sciences, Purdue University West Lafayette, Indiana 47907-1398, USA and PAMELA J. VERMEER Deartment of Comuter Sciences,

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

level 0 level 1 level 2 level 3

level 0 level 1 level 2 level 3 Communication-Ecient Deterministic Parallel Algorithms for Planar Point Location and 2d Voronoi Diagram? Mohamadou Diallo 1, Afonso Ferreira 2 and Andrew Rau-Chalin 3 1 LIMOS, IFMA, Camus des C zeaux,

More information

Stereo Disparity Estimation in Moment Space

Stereo Disparity Estimation in Moment Space Stereo Disarity Estimation in oment Sace Angeline Pang Faculty of Information Technology, ultimedia University, 63 Cyberjaya, alaysia. angeline.ang@mmu.edu.my R. ukundan Deartment of Comuter Science, University

More information

Theoretical Analysis of Graphcut Textures

Theoretical Analysis of Graphcut Textures Theoretical Analysis o Grahcut Textures Xuejie Qin Yee-Hong Yang {xu yang}@cs.ualberta.ca Deartment o omuting Science University o Alberta Abstract Since the aer was ublished in SIGGRAPH 2003 the grahcut

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

Graph Cut Matching In Computer Vision

Graph Cut Matching In Computer Vision Grah Cut Matching In Comuter Vision Toby Collins (s0455374@sms.ed.ac.uk) February 2004 Introduction Many of the roblems that arise in early vision can be naturally exressed in terms of energy minimization.

More information

Interactive Image Segmentation

Interactive Image Segmentation Interactive Image Segmentation Fahim Mannan (260 266 294) Abstract This reort resents the roject work done based on Boykov and Jolly s interactive grah cuts based N-D image segmentation algorithm([1]).

More information

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s Sensitivity two stage EOQ model 1 Sensitivity of multi-roduct two-stage economic lotsizing models and their deendency on change-over and roduct cost ratio s Frank Van den broecke, El-Houssaine Aghezzaf,

More information

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree The Priority R-Tree: A Practically Efficient and Worst-Case Otimal R-Tree Lars Arge Deartment of Comuter Science Duke University, ox 90129 Durham, NC 27708-0129 USA large@cs.duke.edu Mark de erg Deartment

More information

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method ITB J. Eng. Sci. Vol. 39 B, No. 1, 007, 1-19 1 Leak Detection Modeling and Simulation for Oil Pieline with Artificial Intelligence Method Pudjo Sukarno 1, Kuntjoro Adji Sidarto, Amoranto Trisnobudi 3,

More information

Layered Switching for Networks on Chip

Layered Switching for Networks on Chip Layered Switching for Networks on Chi Zhonghai Lu zhonghai@kth.se Ming Liu mingliu@kth.se Royal Institute of Technology, Sweden Axel Jantsch axel@kth.se ABSTRACT We resent and evaluate a novel switching

More information

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 21 RESEARCH ARTICLE Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the

More information

Construction of Irregular QC-LDPC Codes in Near-Earth Communications

Construction of Irregular QC-LDPC Codes in Near-Earth Communications Journal of Communications Vol. 9, No. 7, July 24 Construction of Irregular QC-LDPC Codes in Near-Earth Communications Hui Zhao, Xiaoxiao Bao, Liang Qin, Ruyan Wang, and Hong Zhang Chongqing University

More information

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems Cheng-Yuan Chang and I-Lun Tseng Deartment of Comuter Science and Engineering Yuan Ze University,

More information

Folded Structures Satisfying Multiple Conditions

Folded Structures Satisfying Multiple Conditions Journal of Information Processing Vol.5 No.4 1 10 (Oct. 017) [DOI: 1197/isjji.5.1] Regular Paer Folded Structures Satisfying Multile Conditions Erik D. Demaine 1,a) Jason S. Ku 1,b) Received: November

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

Pivot Selection for Dimension Reduction Using Annealing by Increasing Resampling *

Pivot Selection for Dimension Reduction Using Annealing by Increasing Resampling * ivot Selection for Dimension Reduction Using Annealing by Increasing Resamling * Yasunobu Imamura 1, Naoya Higuchi 1, Tetsuji Kuboyama 2, Kouichi Hirata 1 and Takeshi Shinohara 1 1 Kyushu Institute of

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

CS2 Algorithms and Data Structures Note 8

CS2 Algorithms and Data Structures Note 8 CS2 Algorithms and Data Structures Note 8 Heasort and Quicksort We will see two more sorting algorithms in this lecture. The first, heasort, is very nice theoretically. It sorts an array with n items in

More information

Face Recognition Using Legendre Moments

Face Recognition Using Legendre Moments Face Recognition Using Legendre Moments Dr.S.Annadurai 1 A.Saradha Professor & Head of CSE & IT Research scholar in CSE Government College of Technology, Government College of Technology, Coimbatore, Tamilnadu,

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Real Time Compression of Triangle Mesh Connectivity

Real Time Compression of Triangle Mesh Connectivity Real Time Comression of Triangle Mesh Connectivity Stefan Gumhold, Wolfgang Straßer WSI/GRIS University of Tübingen Abstract In this aer we introduce a new comressed reresentation for the connectivity

More information

PRO: a Model for Parallel Resource-Optimal Computation

PRO: a Model for Parallel Resource-Optimal Computation PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design

More information