Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic

Size: px

Start display at page:

Download "Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic"

Jennifer Hill
6 years ago
Views:

1 elivering Accelertion: The Potentil for Incresed HPC Appliction Performnce Using Reconfigurble Logic vid Clig SRC Computers, Inc 0 N. Nevd Ave Colordo Springs, CO dvid.clig@srccomp.com vid Peter Brker SUPERsmith PO Bo Slins, CA dbrker@supersmith.com ABSTRACT SRC Computers, Inc. hs integrted dptive computing into its SRC- high-end server, incorporting reconfigurble processors s peers to the microprocessors. Performnce improvements resulting from reconfigurble computing cn provide orders of mgnitude speedups for wide vriety of lgorithms. Reconfigurble logic in ield Progrmmble Gte Arrys (PGAs) hs shown gret dvntge to dte in specil purpose pplictions nd specilty hrdwre. SRC Computers is working to bring this technology into the generl purpose HPC world vi n dvnced system interconnect nd enhnced compiler technology. Ctegories nd Subject escriptors Compiler technology Generl Terms Algorithms, Performnce, esign, Eperimenttion Keywords Reconfigurble computing, PGA Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distributed for profit or commercil dvntge, nd tht copies ber this notice nd the full cittion on the first pge. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission nd/or fee. SC00 November 00, enver (c) 00 ACM -8-9-X/0/00 $.00

. INTROUCTION The SRC- system is unique rchitecture cpble of supporting combintion of up to Intel microprocessors nd Multi-Adptive Processors (MAP ) on common shred memory.

2 . INTROUCTION The SRC- system is unique rchitecture cpble of supporting combintion of up to Intel microprocessors nd Multi-Adptive Processors (MAP ) on common shred memory. A ptented SRC crossbr switch design supports the connection of these processors with up to memory ports with ech port contining bnks of SRAM (see igure ). The rchitecturl design of the memory subsystem provides flt ccess ltency from ny microprocessor or reconfigurble processor in the SRC- system. This combintion will offer in ecess of Tlops of theoreticl pek [ Pentium processors nd 8 MAPs on b floting-point dt]. SRC Computers is pushing forwrd to hrness this into sustined performnce. This pper eplins how SRC Computers, Inc. hs mde dvnces in the reconfigurble computing field by incorporting PGA technology t mny levels in the SRC- system rchitecture. Also described is SRC s ptented PGA-bsed MAP tht user pplictions cn utilize to deliver lgorithm-specific computtionl ccelertion.. RECONIGURABLE COMPUTING EINE In simplest terms, reconfigurble computing, bsed on PGA technology, could be defined s the cpbility of reprogrmming hrdwre to eecute logic tht is designed nd optimized for specific user s lgorithms. Associted compiling technology cn provide trnsprent method of integrting the computtionl cpbility of PGA technology nd microprocessors into single ppliction eecutble code. The use of such integrted compiling technologies enbles reconfigurble rchitectures to etend beyond the PGAs nd the gluewre tht ttches them to the host computer. Automtic compiltion of pplictions onto reconfigurble rchitectures genertes the logic for both the specific hrdwre configurtion nd lso the eecution mngement of the PGA resources. The compiltion environment must lso be etensible to wide rnge of user pplictions on single system. SRC Computers hs tken into ccount ll of these fctors in developing its unique reconfigurble computing cpbility for the SRC- system. igure. Overview of SRC- System Architecture

3 . MULTI-AAPTIVE PROCESSOR (MAP) The hert of the SRC- system is its Multi-Adptive Processing feture provided by MAP units (see igure ). These units utilize hrdwre-implemented functions, which cn gretly ccelerte ppliction lgorithms over compiler implemented instruction sets for microprocessors. Mjor rchitecturl chrcteristics of MAP include: Ech MAP eecutes independently of the generl-purpose processors, including loding nd storing needed dt, fter being provided with list of commnds to eecute. The control of the MAP is done by the ppliction vi Commnd List (COMLIST). The COMLIST contins list of controlling instructions for the Control Chip. Emples of functions performed by these instructions re irect Memory Access (MA) reds nd writes nd the eecution synchroniztion with the User Arry. MAP units hve ccess to Common Memory (CM). Common Memory ddresses specified by commnds or generted by user pplictions re virtul ddresses. Addresses re trnslted, virtul to physicl, by Trnsltion Look-side Buffer (TLB) entries with the MA logic of ech MAP. The User Arry portion of MAP unit is configured for lgorithmic requirements. This logic cn red nd write on-bord memory through multiple ports nd cn interct with the control logic nd MA Engine. By mens of chin ports, MAP units cn communicte between themselves without using ny memory bndwidth. Thus prticulr MAP cn send prtil results to nother unit, or similrly, cn receive such prtil results from nother MAP. ull system interrupt nd semphore cpbility is vilble between MAP units or the microprocessors. Multi-processor pplictions cn esily utilize the MAP by identifying the relevnt portions of the prllel computtionl lgorithms. The ppliction cn utilize N microprocessors nd MMAPs.. RECONIGURATION LOGIC PERORMANCE POTENTIAL The ttrction of using PGAs in SRC s MAP is the bility to generte lgorithm specific logic, which hs the potentil of orders of mgnitude speedups for computtionlly intensive lgorithms. There hve been mny demonstrtions showing this mgnitude of speedup in lgorithms for genetic sequencing, encryption/de-encryption, string serches nd integer forms of imge nd signl processing. The performnce improvement of these lgorithms cn come from ny or ll of the following: Memory bndwidth improvements - Si ports to memory t flow prllelism - Bit-sized dt llows for multiple prllel processing strems per bits of dt red from memory or lgorithms tht my need multiple or -bit input vlues Computtionl block level re-scheduling - Re-schedule independent computtions to be concurrent in time Instruction Set Architecture (ISA) effectiveness - Crete opertions tht re right-sized in bits reltive to the type of dt, i.e. -bit or -bit integer opertions The following emples show potentil performnce improvement contributors of n PGA over tht of microprocessor. igure. MAP Block igrm

4 Emple : This emple shows performnce comprison between n PGA nd 900-MHz microprocessor. Perform n integer dd opertion on ech bits of dt strem. The MAP reds in bits of dt from ech memory bnk. The lgorithm will use memory bnks to red dt from the input strem nd the logic will crete prllel computtion strems. eture Speedup over Microprocessor erivtion Clock rte 00/900 or /9 00 for the PGA nd 900 for the microprocessor Memory bndwidth memory bnks to red dt from input strem t flow prllelism prllel computtion strems Block level re-scheduling None ISA effectiveness 0 Number of instructions on the microprocessor required to perform the equivlent integer opertion on the -bit dt vlue Totl. /9 * * * * 0 =. Algorithm Pseudo Code Segment of Logic low Loop over -bit vlues in -bit dt vlue (isrc) nd dd -bit vlue to input vlue. Store resulting -bit vlue (ires) J=0 J=J+ len = //-bit vlue ipos = 9 //strt t position 9 o 00 j =, ibgn = (j-)* + Yes I_Add Address Genertor Mem Bnk t I I+ t_b* (0:) (0:) t_b* (:7) (:9)... I_Add Address Genertor Mem Bnk t I I+ (0:) (0:) (:7) (:9) Address Genertor Mem Bnk t I I+ (0:) (0:) (:7) (:9) mvbits (isrc,ibgn,len,itemp,ipos) ires = idest + t_b(j) J < End //-bit vlue stored in ires mvbits (ires,ipos,len,idest,ibgn) 00 Continue

5 Emple : The following emple shows comprison between n PGA nd the Alph EV. Perform -bit floting-point convolution filter on set of vector dt. The convolution filter will be points. The filter will be stored in set of registers in the PGA chip. Two vlues of the input dt will be red every clock. The output computtion rte will generte two output vlues every clock. Opertion PGA EC Alph EV Red input dt vlues vlues every clock vlue every clock Opertions every clock 8 Multiply-Adds Mult nd Add eture Speedup over Alph EV erivtion Clock rte 00/800 or /8 00 for PGA nd 800 for Alph EV Memory bndwidth vlues red every clock t flow prllelism N/A Block level re-scheduling Convolution filter is points ISA effectiveness / Number of instructions on the PGA reltive to the microprocessor per clock TotlSpeedup /8****/= A segment of the filter code is shown here: ilter Pseudo Code C Loop over output points Logic low Loop over Ilter Coeficients o 00 n =, nout sum = 0. sum = 0 j=0 j=j+ C Loop over filter coefficients o 00 j =, ncoef sum = dt_in(n+j-) * filter(j) 00 continue sum AG R j t R j+ I I+ dt_out(n) = sum Yes pmult pmult 00 continue padd padd j<ncoef

6 Emple : This emple lso shows comprison between n PGA nd the Pentium, using integer dt. Integer rithmetic function units tke much less spce within n PGA chip thn the floting-point unit. Let s chnge the previous emple to -bit integer dt. The convolution filter will be points. The filter will be stored in set of registers in the PGA chip. The input/output dt will ech be striped cross two memory bnks. our vlues of the input dt will be red every clock. The output computtion rte will generte two output vlues every clock. Opertion PGA Pentium Red input dt vlues vlues every clock vlue every clock Opertions every clock Multiply-Adds Mult or Add eture Speedup over Pentium erivtion Clock rte 00/700 or /7 00 for the PGA nd 700 for the Pentium Memory bndwidth vlues red every clock t flow prllelism N/A Block level re-scheduling Convolution filter is points ISA effectiveness Number of instructions on the PGA reltive to the microprocessor per clock TotlSpeedup 0. /7****=0.. Amdhl s Lw A generlly used mesure of performnce or speedup of n lgorithm for vrious rchitectures is Amdhl s Lw. It hs been used for vector nd prllel systems to show the benefit of optimizing portions of code. Amdhl s Lw points out tht given the percent of time spent in portion of code the overll ppliction my get only mrginl overll speedup even though the lgorithm ws mde to eecute much fster. We cn pply the sme principl to lgorithms moved into MAP. Let s emine the previous emples nd look t the ppliction speedup up given vrious percentges of time spent in the lgorithm. These emples hve shown ppliction speeds for severl types of integer nd floting-point types of problems. The benefits of using PGAs re obviously for lgorithms tht hve high performnce speedups nd high percentges of computtion time. MAP Algorithm Speedup E. E. E. 0 Percent of time spent in lgorithm Appliction Speedup

7 . PROGRAMMING LOW VS. ATA LOW Algorithms tht will be moved into the MAP should be thought of s dt flow problems from the hrdwre logic perspective. The following emple will look t the CAXPY lgorithm [def: A*X(j) + Y(j) = Z(j)]. The trditionl wy for progrmmer to think of CAXPY is s loop over the number of elements in rrys X nd Y. However, in hrdwre, it is thought of s dt flow through logic. The logic will red vlues for X nd Y every clock nd send the vlues through the logic definition to the set of pmults nd padds nd crete vlue for Z every clock. The compiler, s prt of its nlysis of code segments, cretes dt flow grph nd performs dependency nlysis. This informtion cn then be used to crete n lgorithm dt flow tht will be put into hrdwre logic for the PGAs. Algorithm Progrmming low Algorithm t low J=0 ddress0, stride, count ddress0, stride, count J=J+ M_Bnk AG AG M_Bnk / / end_of_dt_flg end_of_dt_flg X AG AG Y X r (j) X i (j) A r A i Y r (j) Y i (j) / / / / / / X r (j) X i (j) A r A i Y r (j) Y i (j) pmult pmult pmult pmult / / / / pmult pmult pmult pmult padd(neg) padd Yes padd(neg) padd padd / / padd padd padd Z r (j) Z i (j) ddress0, stride, count / / / Z r (j) Z i (j) / M_Bnk AG Z AG J<N EN MEM MEM MEM MEM MEM MEM X(j) Y(j) Z(j)

8 The rithmetic function units instntited in the PGAs hve pipeline design. This mens tht new dt vlue cn be input into the function unit every clock. There is ltency ssocited with ech function unit before finl result is produced from the input dt vlue. After the first dt vlue comes out, subsequent output vlues will come out every clock. Assume tht the ltency for n pmult nd padd re both0clocks. igure shows how dt flows through pipelined function unit. In ddition, the figure shows function unit tht hs multiple phses. loting-point units often hve pre- nd postprocessing for formt conversion, i.e., conversion from IEEE loting-point representtion into n internl representtion. Input t Input t P h s e clk unction Unit P h s e 8 clks unction Unit P h s e clk Output t Output t Going bck to the CAXPY emple, Tble shows how mny clocks it will tke for dt smples to pss through stges of the hrdwre logic. t Elpsed clock - 0 The processing time of hrdwre logic is similr to tht of very long vectors on vector processor systems. The time in clocks to process set of dt, of length Nelem, through CAXPY would be: Input t t unction Unit Output t Time (clocks) = Nelems + Ltency = Nelems + clocks t Elpsed clock - unction Unit Input t t t t Output t t Elpsed clock - unction Unit Input t t t t t t t t t t Output t t Elpsed clock - 9 unction Unit Input t t t t t t t t t t t Output t t t Elpsed clock - igure. t low through Pipelined unction Unit

9 Tble. t Smple Processing Ltency t low Logic Clock ddress0, stride, count ddress0, stride, count M_Bnk [0:] [:] Address Genertor end_of_dt_flg Address Genertor end_of_dt_flg M_Bnk [0:] [:] pmult pmult pmult pmult padd(neg) padd padd padd [0:] [:] ddress0, stride, count M_Bnk Address Genertor. ESIGNING ALGORITHMS OR MAP One of the mjor issues tht hve hindered the use of PGAs to dte for generl-purpose scientific lgorithms hs been the lck of n bility to use floting-point rithmetic. This impediment hs come from two relted fctors. The first hs been the ctul size, gte count, of PGAs. Until recently, the size of PGAs hs been under M gtes. The dvent of multi-million gte PGAs hs drmticlly chnged the possibility of the type of logic thn cn be loded into single PGA chip. The second limittion hs been the size of floting-point (- or -bit) functionl units for PGAs. The number of gtes required to define IEEE-7 complint function units is much greter thn those for integer function units. Tble shows the pproimte number of function units tht cn be defined in n SRC MAP III. Another considertion is the mount of the lgorithm logic tht will fit in single PGA chip. The MAP environment hs the cpbility to etend the lgorithm cross multiple PGA chips nd multiple MAPs. The MAP User Arry interconnect provides the bility to trnsfer bytes per clock. A trnsfer cn be configured dependent upon the dt type being trnsferred. A trnsfer could send three 8-byte elements, si - byte elements, twenty-four -byte elements, etc. The trnsfer between MAPs vi the Chin Ports cn send bytes per clock. This MAP interconnect does not use ny switch bndwidth. Tble. unction Count for MAP III unction Unit Type unction Count Integer, b Multiply 0 Add 000 b Multiply Add 70 loting Point, b Multiply 8 Add 08 b Multiply Add 8

7. SRC COMPILER TECHNOLOGY AN TOOLS SRC is etending well-known vendor s ORTRAN nd C compilers to trget compiltion of computtionlly intensive portions of pplictions for PGAs.

10 7. SRC COMPILER TECHNOLOGY AN TOOLS SRC is etending well-known vendor s ORTRAN nd C compilers to trget compiltion of computtionlly intensive portions of pplictions for PGAs. Prt of the compiltion process is the genertion of dt flow grph (G) nd the dependency nlysis of the ppliction code. The G nd dependency nlysis re used to define the lyout of progrmmble logic in the PGAs nd dt lyout in on-bord memory. Two eecution-time components will be generted. The first is the hrdwre logic defined from the dt nd control flow nlysis. The second is the definition of Commnd List, or COMLIST, for the eecution control of the User Arry environment nd issunce of direct memory ccess (MA) instructions. Current hrdwre design cn be ccomplished through the use of Hrdwre efinition Lnguges (HL). An HL is high level lnguge similr in concept to softwre high level lnguge such s C. The process of converting the HL into hrdwre design consisting of wires nd trnsistors etc. is clled synthesis. Synthesis is somewht nlogous to norml softwre compiltion. There re gret number of tools nd products vilble tht work with vrious HL. The compiltion strtegy for the MAP is to leverge the redy vilbility of synthesis tools etc. vilble for HL (see igure ). The bsic premise is tht compiltion for the MAP shll consist of trnslting the softwre lnguge of the user,( i.e. C or ORTRAN) into n HL representtion. This HL representtion cn redily be processed by eisting hrdwre design toolsets into the bitstrem used to configure the User Logic (U_Logic). The chosen HL trget lnguge is VERILOG. The VERILOG lnguge llows representtion of hrdwre design t vrious levels of bstrction. At the lowest level, it is possible to write VERILOG code tht represents individul gtes nd wires. At the opposite etreme, one cn write soclled behviorl code, which hs high degree of bstrction. In fct, it is possible to write behviorl code tht cnnot be synthesized into hrdwre. Such code is useful for the purposes of simultion. It is not the intent of the compiling strtegy tht the synthesis of VERILOG shll include the synthesis of floting-point opertions etc. One of the primry constrints for the compiler is compile time. The user of the compiler will hve epecttions tht the compiltion should be done quickly. This presents significnt chllenge in the plce nd route process. Since the compiler will be using eisting commercil tools, the compiling system must function in such wy s to mke the job of these tools s esy s possible. To fcilitte the synthesis process, the compiler-generted VERILOG code shll consist solely of instntition nd hookup of n eisting set of pre-defined VERILOG modules. A VERILOG module is similr in concept to softwre subprogrm, or subroutine. Importntly, eperienced hrdwre designers t very low design level cn crete these predefined modules, in pre-plced mnner. This mkes them efficient in terms of speed nd resource usge, nd, since they re reltionlly plced, the plce nd route portion of the finl synthesis cn proceed fster. The number nd type of pre-defined modules is reltively smll. Just s norml softwre compiltion builds the behvior of the user s progrm from smll set of bsic op-codes, so too cn the U_Logic configurtion be built up from finite set of pre-defined modules, such s integer/floting-point dd, multiply, divide, etc.. The development of domin-specific modules (e.g. Convolve, T) tht cn be used s building blocks in the design of PGA-bsed progrms is necessity. These modules provide bridge between the lgorithm developer nd the hrdwre logic designer. The module will hve counterprt in n PGA mcro librry. These mcros will be incorported into the G nd used in the hrdwre definition lnguge (HL). Emples of potentil librries for the genertion of mcros re signl nd imge processing nd liner lgebr routines. In ddition, ppliction or user-specific modules cn be defined nd dded to n PGA mcro librry set. The bility of the user to optionlly mnipulte the compilergenerted logic is etremely importnt. This mnipultion step must be vilble for those customers tht wnt to get the lst drop of lgorithmic performnce out of the hrdwre. The G generted by the high-level lnguge (HLL) compiler igure. igrm of thesrc CompilerProcess

11 will be n optionl output. This grph will be modifible in G Editor to provide potentilly higher levels of optimiztion. The on-bord memory of the MAP will be used like softwre-controlled cche for more conventionl microprocessor. An dded bonus is tht this cche cn hve different user-defined cching strtegies. The dt ccess nd compute strtegies will mimize temporl loclity with respect to the use of resident, on-bord dt nd will overlp computtion on this dt with streming in the working set of dt for the net phse of the computtion. We will generte both the Common Memory MA instructions nd the on-bord dt lyout from the dt-flow/dependence nlysis. urthermore, the provision for rpid prototyping of the lgorithm in MAP is desirble tool for the optimiztion process. We re investigting potentil tools for G trnsltion for input to commercil pckges; e.g., MATLAB/Simulink could provide esy ccess to prototyping nd optimiztions tht wide vriety of people use tody. The vilble resources of the U_logic PGA chips in the MAP re finite resource. It will be simple mtter to compile user code into finl U_Logic configurtion tht eceeds the physicl resources vilble in the chips. There will obviously hve to be some feedbck from vrious portions of the synthesis tools to the compiltion system. The compiltion system will lso hve to pply vrious heuristics in regrds to precisely wht portions of the user code should be trgeted for the MAP in the first plce, nd wht sort of trnsformtions might be pplied to the user code to llow for optiml performnce. The feedbck from the synthesis tools nd the compiltion heuristics cn only be determined from direct eperience. Initil versions of the compiler will necessrily omit this functionlity. Adding this functionlity bsed on the lerned eperience represents significnt portion of the overll effort required to crete mture compiler. 8. ALGORITHM STUIES The benefit of using reconfigurble computing hs been shown in mny domin res[][][]. Severl lgorithms will be reviewed to show the performnce improvements over RISC nd microprocessor-bsed systems. The definition of the hrdwre logic for these lgorithms strted with Gs from our HLL compiler. The MAP compiler genertes n equivlent G tht it uses for the genertion of HL. Algorithm performnce on MAP is from hrdwre simultion of the logic. In ddition, optimiztion techniques nd eperiences will be discussed reltive to compiler generted HL. The SRC-developed compiler is in prototype stge of HL genertion. The compiler hs demonstrted the bility to import pre-defined function unit modules nd generte logic with correct results. evelopment on the compiler is proceeding in directed, methodicl mnner. It does not yet fully compile the discussed cses. It cn be epected tht compiltion of the following cses will be chieved by the time of the presenttion of this pper t the SC0 conference. 8. Cse : Convolution Zero Phse ilter This lgorithm hs been discussed briefly erlier in the pper. Let s emine the b floting-point convolution problem with -point zero phse filter. The criticl chllenge for the compiler is to know how to tke simple convolution code nd define method for the scheduling of opertions to tke dvntge of the opportunity to use lrge number of multiply/dds reltive to trditionl processor implementtion. The MAP cn red up to 8b of dt from on-bord memory every clock. The eisting compiler s HL implementtion would utilize only single input dt element every clock. The performnce of this level of optimiztion would be lops/clock. The MAP logic cn consume the two b input dt vlues every clock. If the logic cn process only one element every clock, then the logic hs to stll for clock in order to consume the second input dt vlue. Therefore, the processing rte cn only crete n output dt element every other clock. In order to optimize performnce reltive to the reding of the input dt, hrdwre logic ws developed tht loded the dt ppropritely into two sets of shift registers for the even nd odd dt elements. Two processing strems were defined to process the even nd odd input dt elements. igure shows how the input dt is loded into the shift registers for the even/odd input dt vlues. The shift registers re loded up from the red of the first thirty-two b dt vlues from memory. input dt o e Shift Register for Odds Shift Register for Evens ill irection ill irection igure. t Strems Going into Shift Registers

12 After the shift registers re full, the multipliction of the dt vlues with the filter coefficients cn strt. The process is pipelined so tht the net input vlues will go into the shift registers. The genertion of the output points is shown in igure s the contribution of the two shift registers. This strtegy is similr to loop unrolling tht the compiler cn often generte. We re in the process of developing heuristics for the compiler so tht it cn utomticlly generte this level of optimiztion strtegy. The computtionl performnce on MAP for this convolution lgorithm will be flops/clock. This compres to flops/clock on the Alph EV. Clock Shift Register - Odd Numbered Vlues Shift Register - Even Numbered Vlues 0 Output Point igure 7 9. Processing Strems for0 t 8 8 igure. Processing Strems for t

13 8. Cse : Routine SCORE from MAXSEGS MAXSEGS is Smith-Wtermn implementtion of genetic sequence lignment lgorithm. The focl point of this nlysis hs been on the routine, SCORE. The routine, developed by Ale Ropelewski[] of Pittsburgh Supercomputing Center,,is fully vectorized version of the Wtermn-Eggert dynmic progrm lgorithm. The computtionlly intensive portion of SCORE is shown in the following code segment. do 00 k=,ls+ls-, i=min(k,ls) j=m(,(k+-ls)) do 0 inde = strt(k),end(k),ls left = inde-(ls+) dig = (inde-(ls+))- up = inde- etgvi(i)= m(0,etgvi(i)+gp, simils(left)+gp+newgp) etgvj(j)= m(0,etgvj(j)+gp, simils(up)+gp+newgp) simils(inde)=m(simils(dig)+ wt(s(j),s(i)), etgvi(i),etgvj(j),0) i=i- j=j+ 0 continue 00 continue The gol for ny lgorithm in the MAP is to crete dt flow tht will mimize the use of input dt nd computtionl results within single pss of the logic flow. The chllenge with this code is to crete dt flow tht would mimize the use of -bit dt coming from the input rrys EXTGVI, EXTGVJ nd SIMILS. As eplined in Ropelewski et l., the key lies in the digonl ccess ptterns of the lgorithm. Arrys re ccessed cross digonls of the mtri in this fshion: Vlues of strt nd end shown in the code segment re the strting nd ending digonl vlues (,) nd (,), respectively. The nested loops re unrolled by fctor of two to utilize the two -bit vlues red from on-bord memory in the MAP. A mcro cn be mde for the inner loop computtion of EXTGVI, EXTGVJ nd SIMILS. The logic flow for the computtionl mcro is shown in igure 7. The hrdwre implementtion of SCORE is pipelined (see igure 8). The eecution of the SIMILS mcro cn tke new vlue every clock nd generte the vlues for SIMILS, EXTGVI nd EXTGVJ every clock. The ltency for vlue in the mcro is three clocks. The lyout of the dt in on-bord memory cn drmticlly ffect the performnce of the logic in MAP. The lgorithm logic needs to red nd write dt to SIMILS with ech computtion of the inner loop. The first optimiztion pproch ws to replicte the dt in SIMILS into four memory bnks. This llows for reds nd writes of four vlues with ech red or write opertion every clock. The pproch is very effective in getting the necessry vlues for the computtion. Becuse the overll lgorithm logic is pipelined, there is conflict of reding nd writing SIMILS vlues to the memory bnks t the sme clock. In order to sustin the mimum processing rte, we hd to lter the logic so tht the red nd write opertions would occur t lternting clock cycles. The logic will process four updted vlues of SIMILS every two clocks. A second pproch for memory lloction for SIMILS ws to use the Block RAM vilble within the PGAs of MAP. There is.7 Mb vilble in 8b units. The logic used the feture tht the RAM cn be defined s dul-ported. This feture is ectly wht we needed for the reding nd writing of the SIMILS vlues. The computtion cn now generte four updted vlues of SIMILS every clock. The performnce of the originl code on microprocessor tkes clocks to process single updted vlue for SIMILS. The MAP hs performnce improvement of over 700- MHz Pentium microprocessor. Modified version of Ropelewski schemtic representtion[] nd vlues re computed in the following order: (,) then (,), (,) then (,), (,), (,) nd continuing on until the vlue in (,) is computed.

14 pss = i, j zero sim_dig i, j wt etgvi gp etgvj sim_left i, j newgp +gp sim_up i, j k k k i j i j kkk kkk i i i i j j j j Mu : M of Mu : M of Mu : M of sim_ind i,j etgvi etgvj igure 7. Compute Mcro Logic i=- i=i+ Red vlues for etgvi i,i+ j=- j=j+ Generte SIMILS Addresses for i,j, i+,j i,j+ i+,j+ Red vlues from SIMILS sim ils(left), sim ils(up), sim ils(dig) for i,j, i+,j i,j+, i+,j+ from Block RAM Red vlues for etgvj j,j+ from Block RAM Compute Mcro pss = i, j Compute Mcro pss = i+, j Compute Mcro pss = i, j+ NO NO Compute Mcro pss = i+, j+ W rite vlues from SIMILS simils(inde) for i,j, i+,j i,j+, i+,j+ to Block RAM W rite vlues for etgvj j,j+ to Block RAM j<jend Yes W rite vlues for etgvi i,i+ to Block RAM i < iend igure 8. t low Representtion of SCORE

15 8. Cse : Routine P7Viterbi from ppliction HMMER HMMER is populr ppliction for performing protein sequence nlysis. The ppliction profiles hidden Mrkov models (HMMs). There re severl compnies tht hve developed specilty ASICs tht perform key computtionl lgorithms in HMMER. We profiled severl eecutions of HMMER (hmmclibrte) nd they pointed out tht over 99.% of the time ws spent in the routine P7Viterbi. This cse will focus on the optimiztion steps tken to move P7Viterbi into the MAP. igure 9 shows the sequence of tests mde to determine the score for the lignment mtching process. [] The letters in the digrm correspond to the vrious sttes in the lgorithm definition tht follows shown in igure 0. I I I S N B M M M M E C T N igure 9. igrm of the Algorithm Stte Comprisons

16 Core of Algorithm Code Generlized low Grph for(i=;i<=l;i++){ mm[i][0] = im[i][0] = dm[i][0] = -INTY; for (k = ; k <= M; k++) { mm[i][k] = -INTY; if ((sc = mm[i-][k-] + tsc[k-][tmm]) > mm[i][k]) if ((sc = im[i-][k-] + tsc[k-][tim]) > mm[i][k]) if ((sc = m[i-][xmb] + bsc[k]) > mm[i][k]) if ((sc = dm[i-][k-] + tsc[k-][tm]) > mm[i][k]) if (msc[idsq][k]!= -INTY) else dm[i][k] = -INTY; if ((sc = mm[i][k-] + tsc[k-][tm]) > dm[i][k]) if ((sc = dm[i][k-] + tsc[k-][t]) > dm[i][k]) if (k < hmm->m) { im[i][k] = -INTY; if ((sc = mm[i-][k] + tsc[k][tmi]) > im[i][k]) if ((sc = im[i-][k] + tsc[k][tii]) > im[i][k]) if (isc[idsq][k]!= -INTY) } } else /* mtch stte */ mm[i][k] = sc; mm[i][k] = sc; mm[i][k] = sc; mm[i][k] = sc; mm[i][k] += msc[idsq][k]; mm[i][k] = -INTY; /* delete stte */ dm[i][k] = sc; dm[i][k] = sc; /* insert stte */ im[i][k] = sc; im[i][k] = sc; im[i][k] += isc[idsq][k]; im[i][k] = -INTY; mm, dm, im Mtch Stte mm I=I+ K=K+, Red Input Vlues msc, isc, bsc K Loop Computtion elete Stte Write Computed Vlues dm K<M tsc Insert Stte im m[i][xmn] = -INTY; if ((sc = m[i-][xmn] + sc[xtn][loop]) > -INTY) /* N stte */ m[i][xmn] = sc; N Stte C Stte I Loop Computtion J Stte B Stte E Stte m[i][xme] = -INTY; for(k=;k<=hmm->m;k++) if ((sc = mm[i][k] + esc[k]) > m[i][xme]) /* E stte */ m[i][xme] = sc; Write Computed Vlues m m[i][xmj] = -INTY; if ((sc = m[i-][xmj] + sc[xtj][lp]) > -INTY) if ((sc = m[i][xme] + sc[xte][lp]) > m[i][xmj]) /* J stte */ m[i][xmj] = sc; m[i][xmj] = sc; I<L m[i][xmb] = -INTY; if ((sc = m[i][xmn] + sc[xtn][mv]) > -INTY) if ((sc = m[i][xmj] + sc[xtj][mv]) > m[i][xmb] /* B stte */ m[i][xmb] = sc; m[i][xmb] = sc; m[i][xmc] = -INTY; if ((sc = m[i-][xmc] + sc[xtc][lp]) > -INTY) if ((sc = m[i][xme] + sc[xte][mv]) > m[i][xmc]) } sc = m[l][xmc] + sc[xtc][mv]; /* C stte */ m[i][xmc] = sc; m[i][xmc] = sc; /* T stte */ } igure 0. HMMER Algorithm efinition

17 The G produced by our stndrd compiler ws the strting point for the logic definition. Unfortuntely, this strting point did not tke dvntge of the MAP s bility to shift the logic for the Insert nd elete Sttes to be concurrent with the Mtch Stte. Therefore, we mnully performed this level of optimiztion. These re optimiztions tht we see the compiler eventully being ble to do. The dt type for ll vribles in the lgorithm is b integers. The first pproch to the definition of the hrdwre logic ws to tke dvntge of dt prllelism through reding in two b vlues from on-bord memory every clock for the computtionl inputs m, im, dm, msc, bsc, isc, nd tsc. The lgorithm uses previously computed points in vectors mm nd im. The inner loop will be unrolled by two. The logic hs the potentil of esily using the two vlues red in per clock. The dependency upon previous computed vlues ment tht we could not pipeline the lgorithm. Therefore, the performnce of the lgorithm is gted by the performnce of the logic for the inner loop. Upon further investigtion into the options for the logic, we determined tht we could tke the cscded sets of greter-thn tests nd put them into single MAX nd MUX construct. This llows us to reduce the ltency through this portion logic by fctor of two. igure shows the two sets of logic. The elete nd Insert Sttes lso use similr form of logic. The performnce of the lgorithm on the MAP ws 0 clocks for ech pss through the inner loop nd clocks through ech pss the logic t the end of the outer loop. Originl Logic efinition 0 clock ltency Optimized Logic -clock ltency mm[i-][k-] tsc[k-][tmm] -INTY mm[i-][k-] tsc[k-][tmm] im[i-][k-] tsc[k-][tim] m[i-][xmb] bsc[k] dm[i-][k-] tsc[k-][tm] -INTY 0 S Mu True b >b im[i-][k-] tsc[k-][tim] Mu : M of 0 S Mu True b >b msc[dsq[i]][k] -INTY m[i-][xmb] bsc[k] -INTY b!=b 0 S Mu True b >b 0 S Mu True dm[i-][k-] tsc[k-][tm] mm[i][k] 0 S Mu True b >b msc[dsq[i]][k] -INTY -INTY b!=b 0 S Mu True mm[i][k] igure. Logic efinition for Mtch Stte for the k th Vlue

18 Now tht we hve the bsic logic definition of the lgorithm, we need to investigte the memory ccess of the mm, dm, im vribles. A simple nlysis shows tht the bckwrd looking spect of the lgorithm cn use vlues in the unrolled inner loop nd the unrolled outer loop. This pttern is shown in igure. Note tht the I+, I+ Loops re shifted in time in order to consume vlues computed previously. Loop Inde i k in out i k i k mtch stte mm 0 0 im 0 0 dm 0 0 delete stte mm 0 Vlue = -INTY dm 0 insert stte mm 0 im 0 Loop Inde i k in out i k i k mtch stte mm 0 mtch stte mm 0 im 0 im 0 dm 0 dm 0 delete stte mm delete stte mm 0 dm insert stte mm 0 im 0 dm 0 insert stte mm im Loop Inde i k in out i k i k mtch stte mm 0 mtch stte mm mtch stte mm 0 im 0 im im 0 dm 0 dm dm 0 Time delete stte mm dm delete stte mm dm delete stte mm 0 dm 0 insert stte mm 0 insert stte mm insert stte mm im 0 im im mtch stte mm 0 mtch stte mm mtch stte mm im 0 im im dm 0 dm dm delete stte mm delete stte mm delete stte mm dm dm dm insert stte mm 0 insert stte mm insert stte mm im 0 im im mtch stte mm mtch stte mm im im dm dm delete stte mm delete stte mm dm dm insert stte mm insert stte mm im im mtch stte mm im dm delete stte mm dm insert stte mm im igure. Memory Access Ptterns in the Inner nd Outer Loops

19 The memory ccess pttern nlysis shows tht dditionl eecution prllelism cn be obtined through unrolling of the outer loop. The memory ccess pttern is similr to the inner loop problem. The lgorithm uses previous computed vectors in I-. The implementtion used set of temporry vlues stored in Registers for the vlues of m[i-], dm[i-] nd im[i-]. Anlysis of the problem showed tht we need to dely loop I+ by t lest clocks. This would provide the [I][K]th vlue required in the compute of the I+ loop for the Mtch, elete nd Insert Sttes. In ddition, the mm, dm nd im vlues do not hve to be stored in memory becuse the vlues re completely consumed by the logic. By not needing to write these vlues to memory the ltency of the inner loop decreses to 9 clocks nd removes ny need to consider potentil bnk conflicts to on-bord memory. We hve the bility to completely unroll the outer loop. However, for lrge counts of L, outer loop count, we could esily ehust the logic building blocks, CLBs or Slices, in the PGA. The finl definition of the logic nd plce nd route of the PGA determined tht we could unroll the outer loop by twenty. We use common guideline to not use more tht 80% of the vilble resources in order to get n efficient plce-ndrouted lgorithm. The performnce of the lgorithm ws compred with tht of 700-MHz Pentium microprocessor. The MAP functions units nd logic were operting t 00 MHz. A mesurement of the verge mount of time spent in the inner loop ws mde s the number of clocks to do the nested loops divided by the product of the loop counts (L * M). The verge ws tken over 000 runs of the routine. The following tble shows three dt cses nd the type of lgorithm speedups tht were chieved. Another feture of the MAP is tht it hs two PGAs vilble for computtion. The lgorithm ws unrolled cross both PGAs nd the dditionl speedup is shown in Tble. Given the percentge of time spent in the routine, 99.%, the ppliction speedup chieved ws over 0. The priceperformnce of the MAP eceeded tht of the microprocessor on this problem by fctor greter thn.. Problem Size (L, M) Tble. Speedup of Algorithm with MAP Avg. Time spent in inner loop per itertion (Clocks) Microprocessor MAP Prllelism (#Loops, # GPAs) Algorithm Speed-up, 7. /. 0/ 0/ 8 8, 8. /.9 0 / 9 0 /, 9. /.9 0 / 0 / 7

20 9. EXTENING COMPILER HEURISTICS OR PGAS Memory ccess promises to be the most chllenging re for optiml epression of n lgorithm on n PGA. ortuntely there is rich history of processor nd memory rchitectures tht cn be glened for vrious pproches. In this regrd, the bility to reconfigure is the best spect of the PGA, since completely different schemes cn be used by the compiler in different contets. There is lso well-understood history of compiler trnsformtions tht cn be pplied to user code in order to tke dvntge of vrious hrdwre specifics. The key to optiml compiler-generted performnce for the PGA would seem to be blnced set of pplied heuristics t the memory interfce nd code trnsformtion levels, ll the while operting within the bounds of the PGA resource limits. The PGA brings n interesting slnt to compiltion, s constrints tht re fied in trditionl processor re now merely nother set of vribles. 0. CONCLUSIONS The potentil of getting orders of mgnitude speedups from reconfigurble computing hs been shown in vriety of ppliction domin res t the component or single bord level. The question of pplicbility to generlized set of computtionlly intensive scientific lgorithms is currently being ddressed by SRC Computers. The requirement to get brod-bsed cceptnce of reconfigurble computing will be bsed on the bility to demonstrte the following: Price/Performnce improvement over eisting processors Ese of chieving this improvement SRC recognizes tht it is criticl to hve compiler technology nd tools vilble to progrmmers tht will minimize the effort to get the desired performnce from MAP.. SUMMARY This pper hs shown tht there re mjor steps tking plce to mke the first level compiler optimiztions for PGA technology nd to provide dequte performnce gins. There will still be plce for hndcrfted hrdwre logic for time criticl lgorithms. However, the optimiztion cpbilities of the compiler will evolve just s it hs for vector nd cchebsed computing systems. The promise of incresed performnce with MAP hs been estblished; it will only improve with new genertions of PGA chips.. REERENCES [] Buell, uncn; Arnold, Jeffrey; nd Kleinfelder, Wlter. Splsh: PGAs in Custom Computing Mchine. IEEE Computer Society Press, 0 Los Vsqueros Circle, PO Bo 0, Los Almitos, CA 9070-, (99). [] ehon, Andre. Compring Computing Mchines. In Configurble Computing: Technology nd Applictions, volume of Proceedings of SPIE. SPIE, (November 998). [] Eddy,Sen.HMMERUser sguide nul/min.html. (April, 99) 8. [] Ropelewski, Alender J.; Nichols Jr., Hugh B.; nd eerfield II, vid W. The Journl of Supercomputing, :, (997) 7-. [] Scott, Stephen.; Seth, Shrd; nd Sml, Ashok.. A synthesizble VHL coding of genetic lgorithm. Technicl Report UNL-CSE , University of Nebrsk-Lincoln, (997). g97sscott.

Tool Vendor Perspectives SysML Thus Far

Frontiers 2008 Pnel Georgi Tec, 05-13-08 Tool Vendor Perspectives SysML Thus Fr Hns-Peter Hoffmnn, Ph.D Chief Systems Methodologist Telelogic, Systems & Softwre Modeling Business Unit Peter.Hoffmnn@telelogic.com