[KV99] M. Kaul and R, Vemuri. Integrated Block-Processing and Design-Space Exploration in Temporal Partitioning for RTR Architectures.

Size: px
Start display at page:

Download "[KV99] M. Kaul and R, Vemuri. Integrated Block-Processing and Design-Space Exploration in Temporal Partitioning for RTR Architectures."

Transcription

1 [KV99] M. Kaul and R, Vemur. Integrated Block-Processng and Desgn-Space Exploraton n Temporal Parttonng for RTR Archtectures. Reconfgurable Archtectures Workshop Proceedngs, Puerto Rco, Aprl [OCS98] M. Oskn, F. Chong, and T. Sherwood. Actve Pages: A Model of Computaton for Intellgent Memory. IN Proceedngs of the 25th Internatonal Symposum on Computer Archtecture (ISCA), June [PK87] C. Polychronopoulos and D. Kuck. Guded-Self-Schedulng A Practcal Schedulng Scheme for Parallel Computers. ACM Transactons on Computers, 12(36): , December [RSP98] N. Ramasubramanan, R. Subramanan, and S. Pande. Automatc Analyss of Loops to Explot Operator Parallelsm on Reconfgurable Systems. In Proceedngs of the 11th Internatonal Workshop on Languages and Complers for Parallel Computng. Sprnger-Verlag, August [SCC+99] B. Schott, S. Crago, C. Chen, J. Czarnask, M. French, I. Hom, T. Tho and T. Valent. Reconfgurable Archtectures for Systems Level Applcatons of Adaptve Computng. Submtted for Publcaton (see slde_ndex.htm). [T95] C. Tseng. Compler Optmzatons for Elmnatng Barrer Synchronzaton. In Proceedngs of the ACM SIGPLAN Symposum on Prncples and Practce of Parallel Programmng, July [WL91] M. E. Wolf and M. S. Lam. A Data Localty Optmzng Algorthm. In Proceedngs of the ACM SIGPLAN Conference on Programmng Language Desgn and Implementaton, June [WTS+97] E. Wangold, M. Taylor, D. Srkrshna, V. Sarkar, W. Lee, V. Lee, J. Km, M. Frank, P. Fnch, R. Barua, J. Babb, S. Amarasnghe, and A. Agarwal. Barng t all to Software: Raw Machnes. IEEE Computer, Pages , September [ZAM98] P. Zhong, P. Ashar, S. Malk, and M. Martonos. Usng Reconfgurable Computng Technques to Accelerate Problems n the CAD Doman: A Case Study wth Bookean Satsfablty. In Desgn Automaton Conference, June 1998.

2 5 Conclusons Ths paper has presented an overvew of complaton technques DEFACTO, a desgn envronment for mplementng applcatons for confgurable archtectures. The DEFACTO system unquely combnes parallelzng compler technology wth synthess to automate the effectve mappng of applcatons to reconfgurable computng platforms. The focus of ths paper has been on space-senstve optmzatons for managng logc and data requrements on CCUs. We descrbe functonal reuse, data reuse and value reuse. We are currently mplementng these technques n the DEFACTO system-level compler, usng SUIF 1.0 as an mplementaton foundaton for our system. Acknowledgements. Ths research has been supported by DARPA contract F and a Hughes Space and Communcatons Company Fellowshp. 6 References Table 3. Task Precedence Communcaton To/From Task # , 2.1 { X: 11 <= x 1 <= 20, 1 <= x 2 <= 40 } from Task 2.1 { X: 10 <= x 1 <= 10, 1 <= x 2 <= 40 } from Task 1.2 { X: 20 <= x 1 <= 20, 1 <= x 2 <= 40 } to Task None { X: 21 <= x 1 <= 30, 1 <= x 2 <= 40 } to Task , 3.1 { X: 21 <= x 1 <= 30, 1 <= x 2 <= 40 } from Task 3.1 { X: 20 <= x 1 <= 20, 1 <= x 2 <= 40 } from Task 2.2 { X: 30 <= x 1 <= 30, 1 <= x 2 <= 40 } to Task None { X: 31 <= x 1 <= 40, 1 <= x 2 <= 40 } to Task { X: 31 <= x 1 <= 40, 1 <= x 2 <= 40 } from Task 4.1 { X: 30 <= x 1 <= 30, 1 <= x 2 <= 40 } from Task 3.2 [A97] S. Amarasnghe. Parallelzng Compler Technques Based on Lnear Inequaltes. Ph.D. thess, Dept. of Electrcal Engneerng, Stanford Unversty, January [AL93] J. Anderson and M. Lam. Global Optmzatons for Parallelsm and Localty on Scalable Parallel Machnes. In Proceedngs of the ACM SIGPLAN Conference on Programmng Language Desgn and Implementaton (PLDI 93). ACM Press, July [ANN] Annapols Mcro Systems, Inc. [BAK96] D. Buell, J. Arnold, and W. Klenfelder. Splash 2: FPGAs n a Custom Computng Machne. In IEEE Symposum on FPGAs for Custom Computng Machnes. Computer Socety Press, Aprl [BDD+] K. Bondalapat, P. Dnz, P. Duncan, J. Granack, M. Hall, R. Jan, and H. Zegler. DEFACTO: A Desgn Envronment for Adaptve Computng. In Proceedngs of the Reconfgurable Archtectures Workshop, Aprl [BLA99] R. Barua, W. Lee, S. Amarasnghe, and A Agarwal. Maps: A Compler-Managed Memory System for Raw Machnes. In Proceedngs of the Twenty-Sxth Internatonal Symposum on Computer Archtecture (ISCA-26), Atlanta, GA, June [ECF96] C. Ebelng, D. Cronqust, and P. Frankln. RaPD - Reconfgurable Ppelned Datapath. In Proceedngs of the 6th Internatonal Workshop on Feld-Programmable Logc and Applcatons, [FOW87] J. Ferrante, K. Ottensten, and J. Warren. The Program Dependence Graph and Its Use n Optmzaton. ACM Transacton on Programmng Languages and Systems, 9(3): , July [GP92] M. Grkar and C. Polychronopoulos. Automatc Detecton of Task Parallelsm n Sequental Programs. IEEE Transactons of Parallel and Dstrbuted Systems, 3(2), March [GS97] M. Gokhale and J. Stone. NAPA C: Complng for a Hybrd RISC/FPGA Archtecture. In IEEE Symposum on FPGAs for Custom Computng Machnes. Computer Socety Press, Aprl [GSB+99] S. Goldsten, H. Schmt, M. Budu, S. Cadamb, M. Moe, R. Taylor, and R. Laufer. PpeRench: A Co-Processor for Streamng Multmeda Acceleraton. In Proceedngs of the 26th Internatonal Symposum on Computer Archtecture (ISCA), May [GSS96] M. Gupta, E. Schonberg, and H. Srnvasan. A unfed framework for optmzng communcaton n data parallel programs. Techncal Report RC 19872(87937) 12/14/94, IBM Research. To appear n IEEE Transactons on Parallel and Dstrbuted Systems. [HAA96] M. Hall, J.M. Anderson, S.P. Amarasnghe, B.R. Murphy, S. Lao, E. Bugnon, M. Lam. Maxmzng Multprocessor Performance wth the SUIF Compler. IEEE Computer, December 1996 (specal ssue on shared-memory multprocessors) [HAM95] M. Hall, S. Amarasnghe, B. Murphy, S. Lao and M. Lam. Detectng Coarse-Gran Parallelsm Usng an Interprocedural Parallelzng Compler. Proceedngs of Supercomputng, December [HMA95] M. Hall, B. Murphy, S. Amarasnghe, S. Lao and M. Lam. Interprocedural Analyss for Parallelzaton. Eghth Workshop on Languages and Complers for Parallel Computers, August [HW97] J. Hauser and J. Wawrzynek. Garp: A MIPS Processor wth a Reconfgurable Coprocessor. In IEEE Symposum on FPGAs for Custom Computng Machnes. Computer Socety Press, Aprl [KHN97] R. Kress, R. Hartensten, and U. Nageldnger. An Operatng System for Custom Computng Machnes based on the Xputer Paradgm. In Proceedngs of the 7th Internatonal Workshop on Feld-Programmable Logc and Applcatons, [KN95] K. Kennedy and N. Nedeljkovc. Combnng dependence and data-flow analyses to optmze communcaton. In Internatonal Parallel Processng Symposum. IEEE, 1995.

3 Now, by consderng reachng defntons nformaton to match a read to ts possble wrters, we compute the communcaton and precedence between tasks assgned to dfferent FPGAs by calculatng the three dependence ntersectons for each par of tasks that has possble related defntons and uses. The fnal task precedences and communcaton are summarzed n Table 3. TABLE 1. Array Task 2.1 Task 2.2 X Must Read: 11 <= x 1 <= 20, 0 <= x 2 <= 40 Must Read: 10 <= x 1 <= 20, 1 <= x 2 <= 40 Must Wrte: 11 <= x 1 <= 20, 1 <= x 2 <= 40 Must Wrte: 11 <= x 1 <= 20, 1 <= x 2 <= 40 TABLE 2. Array Task 1 Task 2 (FPGA1) X Must Read: 1 <= x 1 <= 10, 0 <= x 2 <= 40 Must Wrte: 1 <= x 1 <= 10, 1 <= x 2 <= 40 Must Read: 1 <= x 1 <= 10, 1 <= x 2 <= 40 Must Wrte: 1 <= x 1 <= 10, 1 <= x 2 <= 40 (FPGA3) X Must Read: 21 <= x 1 <= 30, 0 <= x 2 <= 40 Must Wrte: 21 <= x 1 <= 30, 1 <= x 2 <= 40 Must Read: 20 <= x 1 <= 30, 1 <= x 2 <= 40 Must Wrte: 21 <= x 1 <= 30, 1 <= x 2 <= 40 Once the precedence and specfc data to be communcated between tasks has been calculated, a dffcult schedulng task s left to be performed. The producer/consumer rates must be calculated n order to determne what, f any data bufferng on an FPGA needs to occur so as to avod wrtebacks to a local FPGA memory or even more costly wrtebacks to shared global memory (but stll takng nto account avalable area on the FPGA). The communcaton analyss must also consder "may" reads and wrtes; snce communcaton s not guaranteed, addtonal control s requred n these cases. Computaton executon tmes must be estmated and combned wth tmng estmates for the communcaton to generate the fnal schedule. 4 Related Work Work completed on the SPLASH[BAK96] and NAPA[GS97] projects has resulted n tools that support varous phases of complng for FPGA based archtectures. The SPLASH compler translates only SIMD style code and the desgn parttonng among the avalable FPGAs s performed manually most of the tme to acheve good performance. NAPA C targets only the NAPA archtecture and reles heavly on user suppled hnts to partton the computaton and the data. DEFACTO dffers from these projects several ways. DEFACTO ntegrates the compler wth the synthess tool nto one end-to-end desgn envronment. The user applcaton s automatcally parttoned among the system components wthout assstance from the user. Fnally, DEFACTO targets a number of dfferent FPGA archtectures. The Reconfgurable Archtecture Workstaton (RAW)[WTS+97], the BRASS[HW97] project, and PpeRench[GSB+99] are archtecture projects wth substantal compler development as well. The RAW compler ncorporates optmzatons to explot nstructon-level parallelsm, and to partton and map computaton onto the array of tles whle the BRASS compler explots nstructon level parallelsm and data flow graph mappng by usng lbrary of components geared for ther archtecture. Ther compler technques focus on generatng confguratons by mappng dataflow graph nodes to lbrary modules. Pperench focuses exclusvely on fne-graned ppelned computatons. Pande et.al. have developed heurstc technques for schedulng loops onto reconfgurable archtectures. The Program Dependence Graph (PDG)[FOW87] s analyzed to determne cut-sets (and correspondng confguratons) whch reduce the reconfguraton cost. Ther scope s lmted n terms of dentfyng the opportuntes for mappng computatons onto the confgurable logc, and focuses on explotng fne-graned operator parallelsm wthn loop nests. To target Run-Tme Reconfgurable systems, a group at Unversty of Cncnnat has developed an automated temporal parttonng and desgn space exploraton methodology. [KV99] They also borrow parallel compler technology and combne that wth an nteger lnear programmng based solver. Ther system targets a sngle FPGA and s amed at explotng fne-graned parallelsm. The DEFACTO system-level compler s dstngushed from these prevous approaches n several ways: (1) t automatcally derves fne and coarse-graned parallelsm and assocated communcaton and data parttonng; (2) t targets a board-level system, and not just a sngle FPGA; and, (3) t leverages memory herarchy optmzatons to optmze data accesses. Table 3. Task Precedence Communcaton To/From Task # 1.1 None { X: 1 <= x 1 <= 10, 1 <= x 2 <= 40 } to Task { X: 1 <= x 1 <= 10, 1 <= x 2 <= 40 } from Task 1.1 { X: 10 <= x 1 <= 10, 1 <= x 2 <= 40 } to Task None { X: 11 <= x 1 <= 20, 1 <= x 2 <= 40 } to Task 2.2

4 In addton to tlng for memory capacty, an addtonal decson must be made as to whch of the local memores a datum wll resde. Local data dstrbuton, data placement across local memores of a sngle FPGA, s essental to explotng the full communcaton bandwdth between FPGA and local memory. Ths memory nterleavng provdes a hgh bandwdth to memory accesses; an approach to explotng ths automatcally s dscussed n [BLA99]. As an example, let us assume there are four memores assocated wth each FPGA, as n Fgure 5. Each memory bank has ts own channel. If an nstructon requres four nput operands and each operand s placed n dfferent memores, all operands can be retreved smultaneously. Smlarly, multple teratons of the same nner loop can be executed n parallel by fetchng nput operands from dfferent memores, usng a technque called modulo unrollng [BLA99]. Wrtng back to local memores can also be executed n parallel when necessary. 3.4 Sequence, Control and Communcaton Wth the data and computaton parttonng decsons descrbed n Secton 3.2 and usng the array data-flow analyss from Secton 3.1, the system can sequence the computatons, and nsert the requred communcaton and control. We refer to each ndvdual FPGAcomputaton as a task. Inter-task communcaton and task sequencng s determned by examnng the results of array data-flow analyss. Whle the sequencng and control are generated usng global schedulng analyss, the mplementaton of the schedule, as communcaton and control code, s dstrbuted across the computaton FPGAs, the controller FPGA and the GPP. Just as on a multprocessor, the sequencng of the tasks mapped on a dstrbuted set of FPGAs must preserve sequental semantcs, realzed by takng data dependences nto account. Data access functons provde further nformaton for ppelnng tasks. Matchng producer/consumer rates results n makng decsons about how to buffer one task output to the followng task nput. Condtonal communcaton also poses dffcultes, and shared resources must be carefully managed to avod contenton. These ssues all suggest a careful global schedulng of tasks, communcatons and memory accesses, beyond the decsons made n prevous phases of the complaton. In ths secton, we focus on a subset of these ssues, namely sequencng tasks and determnng ther communcaton requrements. In a multprocessor system, communcaton analyss s a set of technques used to track the flow of data between processors [KN95][GSS96]. We vew each FPGA/CCU as a sngle processor, but we must also go a step further and treat each task runnng on the same FPGA as a separate process. By so dong, we can avod returnng computed results to local memores f they are just gong to subsequently be accessed by another FPGA computaton; nstead, the results can reman on the FPGA and the approprate outputs of the producer computaton can be mapped to the correspondng nputs of the consumer. Communcaton analyss must dentfy all data to be communcated between processes. Thus, communcaton analyss must determne addtonal communcaton requrements beyond the ntra-task, nter-fpga communcaton derved n the prevous phases of complaton: communcaton between tasks and between FPGAs, such as the analyss performed to elmnate barrers n [T95]. communcaton between tasks on the same FPGA. Communcaton analyss nvolves analyzng data dependences between tasks after parttonng across FPGAs, related to the nter-teraton dependence equatons presented n secton 3.2. Dependence nformaton suggests a task precedence lst, and captures the specfc locatons wthn an array at each communcaton event. (The data and computaton parttonng phase nserts communcaton hnts as well, for ntra-task communcaton and data reorganzaton.) Consder the code n Fgure 4(b) for FPGA 2. The two loop nests are dentfed as task one and task two and are assgned to FPGA number 2 n our system. The compler uses the loop bounds to frst determne the regons of each array that are read and wrtten by each task. Ths nformaton s summarzed n Table 1. The ndces x 1 and x 2 represent the row and column dmensons for that array. Frst we dentfy whether a true dependence exsts between the two tasks by takng the ntersecton of the set of varables wrtten n task one and the set of varables read n task two. For our example, the ntersecton set for a true dependence s {X: 11 <= x 1 <= 20, 1 <= x 2 <= 40}. Snce the set s non-empty, we say that some or all of task one must execute before task two. The members of a non-empty ntersecton set must be explctly communcated between the two tasks. A smlar analyss s employed for ant and output dependences for ths task par. For FPGA 2, these sets are {X: 11 <= x 1 <= 20, 1 <= x 2 <= 40} and {X: 11 <= x 1 <= 20, 1 <= x 2 <= 40} respectvely. In ths case, the ant and output dependences yeld the same nformaton as the true dependence. Notce the annotatons added to the FPGA code by the data and computaton parttonng phase ndcate that some communcaton must occur wth other FPGAs. The communcaton between FPGA 1 and 2 and between FPGA 2 and 3 must be analyzed n a manner smlar to what was descrbed above. Fgure 4 shows the code that executes on FPGAs 1, 2 and 3. (The tasks are renumbered for clarty of explanaton. Task 1.1 ndcates task 1 runnng on FPGA 1.) Table 2 shows the correspondng access nformaton. We determne the task precedence nternal to each FPGA frst just as we dd above for FPGA 2. For FPGA 1, task 1.2 s dependent on task 1.1 and smlarly for FPGA 3.

5 between FPGAs. The shaded rectangle represents a block whch ncludes 10*10 loop teratons. Now nearest-neghbor communcaton at the block boundary s necessary to synchronze loop teratons, but ths s much cheaper than data-reorganzaton communcaton. Although each FPGA accesses columns of array X wthn a block, all the necessary columns of array X are already present n ts local memory. Moreover, parallelsm s avalable n the dagonal drecton. Ths example shows the tradeoff between parallelsm and communcaton overhead. 1.1) for =1 to 10 X[, j] = X[, j] + X[, j-1] 1.2) for jj=1 to 40 by 10 { for j=jj to jj+9 { for =1 to 10 X[, j] = X[, j] } // boundary case // send sgnal to FPGA#2 } (a) Code to run on FPGA 1 2.1) for =11 to 20 X[, j] = X[, j] + X[, j-1] 2.2) for jj=1 to 40 by 10 { // receve sgnal from FPGA#1 for j=jj to jj+9 { for =11 to 20 X[j, ] = X[j, ] + X[j-1, ]} // send sgnal to FPGA#3 } (b) Code to run on FPGA 2 3.1) for =21 to 30 X[, j] = X[, j] + X[, j-1] 3.2) for jj=1 to 40 by 10 { // receve sgnal from FPGA#2 for j=jj to jj+9 { for =21 to 30 X[, j] = X[, j] + X[-1, j]} // send sgnal to FPGA#4 } (c) Code to run on FPGA 3 Fgure 4. Tasks for FPGA 1, 2 and Data Placement and Tlng The data parttonng resultng from the prevous phase does not take memory capacty nto account. Wth relatvely small local memores assocated wth each FPGA, some applcatons may requre more memory than s avalable. Tlng, whch was used n the prevous example to reduce communcaton costs, can also be used to address ths problem [WL91]. By tlng a loop nest, a porton(block) of the loop teraton space s assgned to each FPGA. Accordngly, the requred data s reduced by the sze of the block. The block sze s determned at run-tme dependng on the actual number of FPGAs avalable and the local memory sze. Tlng, thus, enhances data reuse (localty). As shown n Fgure 5, tlng also avods extremely expensve communcaton between local memory and shared memory (whch may even be off board). X1 X1 X X j j j Fgure 3. Data and Computaton Parttonng ch ch ch FPGA Shared Memory Local Memory Fgure 5. Data placement ch

6 OutputDependence L 1) for =1 to 40 X[, j] = X[, j] + X[, j-1] 2) for =1 to 40 X[, j] = X[, j] + X[-1, j] 1) forall =1 to 40 X[, j] = X[, j] + X[, j-1] 2) forall j=1 to 40 for =1 to 40 X[, j] = X[, j] + X[-1, j] (a) Fgure 2. Parallelzaton Example (b) Fgure 2 shows an example of the results of parallelzaton analyss. There are two loop nests n the orgnal sequental applcaton specfcaton n Fgure 2(a). The parallelzaton analyss wll dentfy only the outer loops n both loop nests are parallelzable; the nner loops carry true dependences across dfferent teratons (Fgure 2(b)). In other words, for each nner loop, data wrtten to an element of array X s read n the next teraton. If we consder each loop nest separately, ts teratons can be executed safely n parallel, wth no communcaton of values across teratons (often referred to as a doall loop). 3.2 Data and Computaton Parttonng After dentfyng the loop-level parallelsm, the next phase of complaton dstrbutes both the computaton and data across all system components. We perform data and computaton parttonng [AL93] to determne the computaton and data that are allocated to the same FPGA and ts local memory. If a loop nest s consdered n solaton, t s most desrable to explot doall loops such as the outer loops n Fgure 2(b), where processors can execute ndependently wthout communcaton; less desrable are doacross loops, whch requre communcaton for synchronzaton between teratons. However, the choce of parallelsm s a global decson whch must take overall communcaton cost nto account. Sometmes doacross loops may requre less communcaton overall, by avodng communcaton between loop nests. The compler parttons the data and computaton usng a lnear algebra framework, restrcted to the doman of affne array subscrpt expressons,.e., lnear functons of loop ndces. The framework attempts to explot the coarsest granularty of parallelsm. For example, when multple loops are parallelzable n a loop nest, the coarsest granularty s obtaned by parallelzng the outermost loop. Startng wth a soluton that has the coarsest granularty of parallelsm, a greedy algorthm s used to derve the least cost soluton, possbly tradng some degree of parallelsm to elmnate communcaton. The greedy algorthm uses weghted cost estmates that account for executon tme percentage, beneft of parallelzaton and communcaton costs. Consder the example n Fgure 2, assumng arrays are stored n row-major order. Intally, analyss attempts to dstrbute teratons of the outer loops across the FPGAs to get the coarsest granularty of parallelsm. Wth ths selecton, each FPGA accesses rows of X n the frst loop nest and columns of X n the second loop nest. Fgure 3 llustrates ths parttonng choce. A rectangle represents an array element and a crcle represents a loop teraton, assumng four FPGAs. Fgure 3(a) and (b) show how array elements are parttoned across FPGA local memores. Lnked array elements are assgned to the same FPGA. In the frst loop nest (Fgure 3(a)), the frst dmenson of array X s dstrbuted across FPGAs and the second dmenson s allocated to the same FPGA. In the second loop nest (Fgure 3(b)), the frst dmenson s local and the second dmenson s dstrbuted. Fgure 3(c) and (d) show how loop teratons are allocated to each FPGA, the computaton parttonng. All the lnked loop teratons are executed on the same FPGA and the shaded regon represents parallel executon. In both loop nests (Fgure 3(c)(d)), the nner loop s executed on the same FPGA and the outer loop s dstrbuted across FPGAs. Ths soluton s problematc because t requres very expensve data reorganzaton communcaton because an FPGA accesses a dfferent secton of array X between two loop nests. A smple soluton to avod the communcaton s to gve up doall parallelsm avalable to the second loop nest. Instead, doacross parallelsm s exploted. To reduce the communcaton overhead, tles of the outer loop are executed on each FPGA [WL91]. Fgure 4 shows the result of ths alternatve parttonng. Fgure 4(b) shows the computaton that s assgned to FPGA 1, assumng 4 FPGAs numbered 0 through 3. Because the frst outer loop s a doall loop, FPGA 1 executes loop teratons 11 through 20. Fgure 3(e) shows computaton parttonng of the second loop nest, whch s explotng doacross parallelsm

7 3 Parallelzaton and Localty Management The prevously descrbed performance of system components has led us to adopt n our compler algorthms several goals used n complng to dstrbuted-memory multprocessors. Frst, dentfyng computatons that can execute n parallel, both at the fne gran and at the coarser gran, s crtcal to obtanng performance mprovements n an adaptve computng system. However, communcaton (ncludng non-local memory accesses) can be very expensve relatve to the computaton rate; thus, t may be necessary to sacrfce some parallelsm to avod communcaton, partcularly between non-neghbors. Thrd, even local memory accesses can be somewhat costly relatve to computaton rate, so t s desrable to explot temporal reuse wthn the FPGA. In ths secton, we dscuss how these goals can be met by applyng exstng parallelzng compler technology. Addtonal ssues, not addressed by parallelzng complers, arse n complng to adaptve computng systems. Specfc ssues that wll be dscussed n ths secton nclude explct placement of data n memores and communcaton between tasks wthn an FPGA (on-chp communcaton). Many other ssues mportant to complng to adaptve computng systems, beyond the scope of ths paper, are beng addressed by DEFACTO. These nclude customzng hardware to a computaton such as fne-gran applcaton-specfc ppelnng and wth varable precson arthmetc and space-senstve optmzatons to avod reconfguraton. 3.1 Parallelzaton Analyss The analyses descrbed n ths paper augments an automatc parallelzaton system that s part of the Stanford SUIF compler [HAA96][HAM95][HMA95]. The system parallelzes loops whose teratons can be executed n parallel on dfferent processors. To meet ths crteron, the memory locatons accessed by each teraton of a loop (and thus by each processor) must be ndependent of locatons wrtten by other teratons (and other processors). The compler uses an nterprocedural array data-flow analyss to determne whch loops access ndependent memory locatons [HMA95]. The analyss computes data-flow values for each program regon, where a regon s ether a basc block, a loop body, a loop, a procedure call, or a procedure body. The data-flow value at each regon ncludes the followng four component sets: Read descrbes the portons of arrays that may be read nsde the program regon. MustRead descrbes the portons of arrays that must be wrtten nsde the program regon. Wrte descrbes the portons of arrays that may be wrtten nsde the program regon. MustWrte descrbes the portons of arrays that must be wrtten nsde the program regon. At the program regon correspondng to loop L, the portons of arrays descrbed by each of the four component sets are parameterzed by loop ndex varable (where, for clarty of presentaton, s assumed to be normalzed to start at 1 and step by 1). In the tests below performed at loop L, the notaton Wrte L 1 refers to replacng wth some other ndex 1 n the teraton space. ( ( 1, 2 I), 1 < 2) Wrte Read L 1 L 2 ( ( 1, 2 I), 1 > 2) Wrte Read L 1 L 2 ( ( 1, 2 I), 1 < 2) Wrte Wrte L 1 L 2 TrueDependence L AntDependence L

8 crossbar to shared memory SRAM 32b PE1 PE0 PE2 SRAM SRAM SRAM 64b communcaton channel Hgh-Level Vew of Wldstar board Fgure 1. ACS Archtecture shared system memory and general-purpose processor (GPP) are drectly connected to the CCU (ths connecton vares sgnfcantly across boards). The GPP s responsble for orchestratng the executon of the CCUs by managng the flow of control and data --- from local memory to shared system memory --- n the applcaton executon. We chose a board-level archtecture as a target for DEFACTO because t can be assembled wth commodty parts, and commercal systems composed of such components are already avalable. We expect the complaton approaches to form a sold foundaton for a mult-board system, and also be applcable to system-on-a-chp devces that have multple ndependent memory banks, each wth confgurable logc. The Wldstar board n Fgure 1 conssts of 3 FPGAS, all Xlnx Vrtex parts wth up to a mllon gates per chp. One FPGA serves as a controller; the other two FPGAs are connected by a 64-bt channel, and each has two local SRAMS wth 32-bt dedcated channels. All three FPGAs share addtonal system memory. PE s 1 and 2 are ndrectly connected to PE0, depcted by the dotted lnes n the fgure. We also consder a research prototype board-level system, the SLAAC2 board [SCC+99]. Smlar to the Wldstar board, SLAAC2 has two computaton FPGAs and one controller FPGA. The computaton FPGAs each have 4 local 256x18 bt SRAM memores and a 72-bt nterconnect between the FPGAs. Larger systems based on SLAAC2 can be confgured by connectng these boards wth Myrnet nterconnect. In some sense, these systems are all small-scale dstrbuted-memory multprocessors, whch can be confgured nto larger, herarchcal dstrbuted-memory mult-processors wth mult-board systems. The FPGAs can be thought of as ndependent processors, each of whch can be executng multple ndependent computatons n parallel. Each FPGA has ts own local SRAM memory(es), and the applcaton must explctly manage placement of data n memory and communcaton from one FPGA or ts assocated memory to another FPGA. Each has an external system memory that s shared, whch can be thought of as a secondary store. The analogy to dstrbuted-memory multprocessors becomes even more obvous when we consder the relatve performance of system components. These boards have a system clock, typcally around 100 MHz, that governs the rate of communcaton and memory access. From FPGA to memory, there are mult-bt dedcated channels; data can be transferred to/from memory at the rate of a bt per clock per channel pn. As an example, the SLAAC2 board has an FPGA-to-localmemory bandwdth of 7.2 Gbts/sec. Smlarly, mult-bt dedcated channels connectng one FPGA to another can transfer data at rates on the order of a few Gbts/sec (also 7.2 Gbts/sec for the SLAAC2 board). These are farly accurate estmates of the best-case communcaton and memory bandwdth avalable n these systems. Communcaton between nonadjacent FPGAs and access to non-local memores have a much lower bandwdth, dffcult to estmate, as they must pass through ntervenng FPGAs; such remote accesses also occupy space on the FPGAs and ntroduce contenton wth local communcaton (accesses) and other remote communcatons (accesses). It s nstructve to compare the nearest neghbor communcaton and local memory rates wth the computaton rate, but a realstc estmate of computaton rate s much more dffcult. FPGA cycle tme vares wth the complexty of the nternal crcutry, but t s safe to assume that t s clocked no faster than the board clock. The amount of logc used as nterconnect also vares dependng on the qualty of the layout and the complexty of the desgn. However, f we assume several gates per bt operaton and only a small fracton of the chp used for performng these operatons, we stll see computaton rates two or three orders of magntude hgher than the best communcaton or memory access rate. To summarze, the above examnaton of system components ponts to several features of adaptve computng systems that are mrrored n dstrbuted-memory multprocessors. Frst, access to local memory s sgnfcantly faster than remote accesses. Second, nearest-neghbor communcaton delvers hgh bandwdth, whle communcaton between non-neghbors s much slower. Thrd, when logc s effectvely mapped to the FPGAs, the computaton rate s sgnfcantly hgher than ether the local memory access rate or the communcaton rate.

9 Parallelzaton and Localty Analyss for Adaptve Computng Systems Ths paper presents a strategy for complng to adaptve computng archtectures systems that ncorporate confgurable logc devces such as FPGAs. As compared to conventonal nstructon set archtectures, adaptve computng systems offer the opportunty to customze the logc accordng to the requrements of each applcaton. In ths paper, we focus on a partcular aspect of customzng the logc: explotng parallelsm. Sgnfcant performance mprovements can be realzed by talorng the parallelsm of a computaton, both at the fne and coarse gran, and optmzng the memory accesses and the nterconnecton between operatons. Ths paper demonstrates that adaptve computng systems have many smlar performance tradeoff ssues as dstrbuted-memory multprocessors. We descrbe how complaton technques developed for managng parallelsm and data localty n dstrbutedmemory multprocessors can be used n adaptve computng systems. Byoungro So, Hed Zegler, and Mary Hall Informaton Scences Insttute Unversty of Southern Calforna Marna del Rey, CA Abstract 1 Introducton Adaptve computng archtectures ncorporatng confgurable computng unts (CCUs) (e.g., FPGAs) can offer sgnfcant performance advantages over conventonal processors as they can be talored to the partcular computatonal needs of a gven applcaton (e.g., template-based matchng). Unfortunately, mappng programs to confgurable archtectures s extremely cumbersome, demandng that software developers also assume the role of hardware desgners. The absence of general-purpose, hgh-level programmng tools for adaptve computng applcatons has hampered the wdespread adopton of ths technology; currently, ths area s only accessble to a very small collecton of specally traned ndvduals. To address ths lack of tools, we are developng DEFACTO, an end-to-end desgn envronment for developng an applcaton n a hgh-level language such as C, and (sem-)automatcally mappng the applcaton to confgurable archtectures. DEFACTO combnes parallelzng compler technology and synthess tools n a sngle unfed system[bdd+99]. Ths paper focuses on some of the complaton technques used n DEFACTO for managng parallelsm and data localty. Inherent features of confgurable logc nfluence ts use n acceleratng applcaton performance. Frst, confgurable logc s slower than conventonal logc. Second, f the logc requred by an applcaton exceeds the capacty of the confgurable devces, an addtonal run-tme cost assocated wth reconfgurng the logc must be taken nto account. Because of these addtonal overheads, the only way to explot confgurable archtectures to mprove the performance of an applcaton s to sgnfcantly reduce the operatons that need to be performed through customzed logc, or to explot parallelsm n the applcaton. Thus, a compler for a confgurable archtecture has as ts man responsblty dentfyng computatons that are well-suted for executon n confgurable logc. For ths purpose, t can draw on the large body of prevous work on complng to shared-memory and dstrbuted-memory multprocessors, ncludng localty and communcaton management. Ths paper ponts out the smlartes between a board-level adaptve computng system and a dstrbuted-memory multprocessor. To our knowledge, ths s the frst paper to demonstrate how parallelzng compler technology for dentfyng coarse-gran looplevel parallelsm and managng localty can be appled n ths new doman. The remander of the paper s organzed nto three sectons and a concluson. The next secton presents an overvew of DEFACTO and background on adaptve computng archtectures. In Secton 3, we demonstrate how parallelzng compler technology can be appled to an example program excerpt to map the applcaton to a board-level adaptve computng system, wth localzed data accesses and nfrequent communcaton. Secton 4 presents related work. 2 Adaptve Computng Archtectures An adaptve computng archtecture s a computng system that ncorporates confgurable logc devces such as FPGAs, usually n combnaton wth conventonal logc and memory. Adaptve computng archtectures have been proposed from small-scale systems-on-a-chp [WTS+97,HW97,GSB+99], board-level systems such as from Annapols Mcro Systems, and on up to large-scale mult-board systems[bak96]. The complaton technques desgned for DEFACTO focus on targetng board-level systems smlar to the Wldstar board from Annapols Mcro Systems, depcted n Fgure 1. Such systems consst of multple nterconnected CCUs; each can access ts own local memory. A larger

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Analysis of Continuous Beams in General

Analysis of Continuous Beams in General Analyss of Contnuous Beams n General Contnuous beams consdered here are prsmatc, rgdly connected to each beam segment and supported at varous ponts along the beam. onts are selected at ponts of support,

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits Repeater Inserton for Two-Termnal Nets n Three-Dmensonal Integrated Crcuts Hu Xu, Vasls F. Pavlds, and Govann De Mchel LSI - EPFL, CH-5, Swtzerland, {hu.xu,vasleos.pavlds,govann.demchel}@epfl.ch Abstract.

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach Dstrbuted Resource Schedulng n Grd Computng Usng Fuzzy Approach Shahram Amn, Mohammad Ahmad Computer Engneerng Department Islamc Azad Unversty branch Mahallat, Iran Islamc Azad Unversty branch khomen,

More information

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING James Moscola, Young H. Cho, John W. Lockwood Dept. of Computer Scence and Engneerng Washngton Unversty, St. Lous, MO {jmm5,

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Complaton Foundatons Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty Feb 8, 200 888., Class # Introducton: Polyhedral Complaton Foundatons

More information

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce Performance Study of Parallel Programmng on Cloud Computng Envronments Usng MapReduce Wen-Chung Shh, Shan-Shyong Tseng Department of Informaton Scence and Applcatons Asa Unversty Tachung, 41354, Tawan

More information

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier Floatng-Pont Dvson Algorthms for an x86 Mcroprocessor wth a Rectangular Multpler Mchael J. Schulte Dmtr Tan Carl E. Lemonds Unversty of Wsconsn Advanced Mcro Devces Advanced Mcro Devces Schulte@engr.wsc.edu

More information

EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION

EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION WITH THE ARMEN ARCHITECTURE C. Beaumont, B. Potter J.M. Flloque LIBr I.U.T. de Brest and LIBr Unversté de Bretagne Occdentale Télécom Bretagne BP

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Evaluation of Parallel Processing Systems through Queuing Model

Evaluation of Parallel Processing Systems through Queuing Model ISSN 2278-309 Vkas Shnde, Internatonal Journal of Advanced Volume Trends 4, n Computer No.2, March Scence - and Aprl Engneerng, 205 4(2), March - Aprl 205, 36-43 Internatonal Journal of Advanced Trends

More information

Vectorization in the Polyhedral Model

Vectorization in the Polyhedral Model Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton

More information

Topology Design using LS-TaSC Version 2 and LS-DYNA

Topology Design using LS-TaSC Version 2 and LS-DYNA Topology Desgn usng LS-TaSC Verson 2 and LS-DYNA Wllem Roux Lvermore Software Technology Corporaton, Lvermore, CA, USA Abstract Ths paper gves an overvew of LS-TaSC verson 2, a topology optmzaton tool

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , VRT012 User s gude V0.1 Thank you for purchasng our product. We hope ths user-frendly devce wll be helpful n realsng your deas and brngng comfort to your lfe. Please take few mnutes to read ths manual

More information

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III. Lecture 15: Memory Herarchy Optmzatons I. Caches: A Quck Revew II. Iteraton Space & Loop Transformatons III. Types of Reuse ALSU 7.4.2-7.4.3, 11.2-11.5.1 15-745: Memory Herarchy Optmzatons Phllp B. Gbbons

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

LLVM passes and Intro to Loop Transformation Frameworks

LLVM passes and Intro to Loop Transformation Frameworks LLVM passes and Intro to Loop Transformaton Frameworks Announcements Ths class s recorded and wll be n D2L panapto. No quz Monday after sprng break. Wll be dong md-semester class feedback. Today LLVM passes

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management. //7 Prnceton Unversty Computer Scence 7: Introducton to Programmng Systems Goals of ths Lecture Storage Management Help you learn about: Localty and cachng Typcal storage herarchy Vrtual memory How the

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL) Crcut Analyss I (ENG 405) Chapter Method of Analyss Nodal(KCL) and Mesh(KVL) Nodal Analyss If nstead of focusng on the oltages of the crcut elements, one looks at the oltages at the nodes of the crcut,

More information

3. CR parameters and Multi-Objective Fitness Function

3. CR parameters and Multi-Objective Fitness Function 3 CR parameters and Mult-objectve Ftness Functon 41 3. CR parameters and Mult-Objectve Ftness Functon 3.1. Introducton Cogntve rados dynamcally confgure the wreless communcaton system, whch takes beneft

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Overvew 2 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Introducton Mult- Smulator MASIM Theoretcal Work and Smulaton Results Concluson Jay Wagenpfel, Adran Trachte Motvaton and Tasks Basc Setup

More information

Verification by testing

Verification by testing Real-Tme Systems Specfcaton Implementaton System models Executon-tme analyss Verfcaton Verfcaton by testng Dad? How do they know how much weght a brdge can handle? They drve bgger and bgger trucks over

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

an assocated logc allows the proof of safety and lveness propertes. The Unty model nvolves on the one hand a programmng language and, on the other han

an assocated logc allows the proof of safety and lveness propertes. The Unty model nvolves on the one hand a programmng language and, on the other han UNITY as a Tool for Desgn and Valdaton of a Data Replcaton System Phlppe Quennec Gerard Padou CENA IRIT-ENSEEIHT y Nnth Internatonal Conference on Systems Engneerng Unversty of Nevada, Las Vegas { 14-16

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Petri Net Based Software Dependability Engineering

Petri Net Based Software Dependability Engineering Proc. RELECTRONIC 95, Budapest, pp. 181-186; October 1995 Petr Net Based Software Dependablty Engneerng Monka Hener Brandenburg Unversty of Technology Cottbus Computer Scence Insttute Postbox 101344 D-03013

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

Querying by sketch geographical databases. Yu Han 1, a *

Querying by sketch geographical databases. Yu Han 1, a * 4th Internatonal Conference on Sensors, Measurement and Intellgent Materals (ICSMIM 2015) Queryng by sketch geographcal databases Yu Han 1, a * 1 Department of Basc Courses, Shenyang Insttute of Artllery,

More information

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.15 No.10, October 2015 1 Evaluaton of an Enhanced Scheme for Hgh-level Nested Network Moblty Mohammed Babker Al Mohammed, Asha Hassan.

More information

Ontology Generator from Relational Database Based on Jena

Ontology Generator from Relational Database Based on Jena Computer and Informaton Scence Vol. 3, No. 2; May 2010 Ontology Generator from Relatonal Database Based on Jena Shufeng Zhou (Correspondng author) College of Mathematcs Scence, Laocheng Unversty No.34

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Advanced Computer Networks

Advanced Computer Networks Char of Network Archtectures and Servces Department of Informatcs Techncal Unversty of Munch Note: Durng the attendance check a stcker contanng a unque QR code wll be put on ths exam. Ths QR code contans

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information