Architectures for high-performance FPGA implementations of neural models

Size: px

Start display at page:

Download "Architectures for high-performance FPGA implementations of neural models"

Betty Reed
5 years ago
Views:

1 INSTITUTE OFPHYSICS PUBLISHING JOURNAL OFNEURALENGINEERING J. Neurl Eng. 3 (2006) 2 34 doi:0.088/ /3//003 Architectures for high-performnce FPGA implementtions of neurl models Rndll K Weinstein nd Roert H Lee,2 Lortory for Neuroengineering, School of Electricl nd Computer Engineering, Georgi Institute of Technology, Atlnt, GA 30332, USA 2 Wllce H Coulter Deprtment of Biomedicl Engineering, Emory University, Atlnt, GA 30322, USA E-mil: rlee2@emory.edu Received 7 August 2005 Accepted for puliction 2 Novemer 2005 Pulished 9 Decemer 2005 Online t stcks.iop.org/jne/3/2 Astrct As the complexity of neurl models continues to increse (lrger popultions, vried ionic conductnces, more detiled morphologies, etc) trditionl softwre-sed models hve difficulty scling to rech the performnce levels desired. This pper descries the use of FPGAs, or field progrmmle gte rrys, to esily implement wide vriety of neurl models with the performnce of custom nlogue circuits or computer clusters, the reconfigurility of softwre, nd t cost rivlling personl computers. FPGAs rech this level of performnce y enling the design of neurl models s prllel processed dt pths. These rchitectures provide for wide rnge of single-comprtment, multi-comprtment nd popultion models to e redily converted to FPGA implementtions. Generlized rchitectures re descried for the efficient modelling of first-order, nonliner differentil eqution in throughput mximizing or ltency minimizing dt-pth configurtions. The homogeneity of popultion nd multicomprtment models is exploited to form deep pipelines for improved performnce. Limittions of FPGA rchitectures nd future reserch res re explored.. Introduction Tody s neurl circuit modeller is often constrined y the limited performnce of conventionl personl computers. While computing power might doule every 8 months ccording to Moore s lw, neurl modellers often chnge the computtionl requirements y orders of mgnitude. Popultion models cn lwys simulte more neurons or morphologicl models cn lwys simulte more comprtments. The electrophysiologist cn lwys simulte dditionl conductnces, while the systems iologist cn dd dditionl cellulr processes to model. In generl, current neurl circuit modellers mke sustntil trdeoffs y limiting complexity to reduce processing requirements. FPGAs, or field progrmmle gte rrys, offer high performnce, configurle, sclle pltform for enling current nd next genertion neurl models. FPGAs re specilized integrted circuits tht re designed with reconfigurle computtionl primitives cple of implementing ritrry clcultions nd logic. They re widely used in consumer nd industril products for ccelerting processor intensive lgorithms. Engineers designing networking, rdr, wireless communictions, video processing, erospce, militry, nd test equipment, medicl imging nd other computtionlly intensive pplictions often utilize FPGAs s hrdwre coprocessors (Prdeep et l 2005,Xioynget l 2004), DSP processors (Kmlizd et l 2003, Wlkeet l 2000) or strem processors (Krishnmurthy et l 2002,Leeet l 999). FPGAs hve een less commonly used in io-relted fields, with severl exceptions. Protein (Oliver et l 2005)nd DNA (Brown et l 2004) sequencing re strting to use FPGAs to reduce processing time. Rel-time processing, registrtion nd other imge nlyses from confocl microscopy re enled y FPGAs (Budge et l 2004, Rest et l 2004). Most modelling pplictions on FPGAs hve een limited to studying neurl networks consisting of reduced neurons (Blke et l 997, Botros nd Adul-Aziz 994, Chngjin nd Hmmerstrom 2003, Omondi nd Rjpkse 2002). More complex models hve een implemented in nlogue VLSI (Brgg et l 2002, Jung et l 200). Recently, in our lortory, severl implementtions of conductnce-sed neurl models /06/0002+4$ IOP Pulishing Ltd Printed in the UK 2

2 R K Weinstein nd R H Lee hve emerged including Hodgkin nd Huxley s (952) nd Booth nd Rinzel s (995) models (Grs et l 2004), nd ten-comprtment motoneuron model (Weinstein nd Lee 2005). However, ech of these designs ws somewht specific to the model eing implemented. This pper expnds on those efforts y descriing generlized lgorithms nd rchitectures tht provide migrtion pth for current softwre-sed neurl models into FPGA-sed implementtions. 2. Methods 2.. FPGA hrdwre All designs were trgeted towrds Xilinx XtremeDSP Development Kit-II sed on Nlltech BenONE crrier ord consisting of one DIME-II module slot nd populted with Nlltech BenADDA expnsion ord. The BenADDA is preconfigured with Xilinx Virtex-II FPGA (XC2V3000-4fg676), two 4 it 65 MS s nlogue-to-digitl converters (ADC), two 4 it 60 MS s digitl-to-nlogue converters (DAC) nd megyte of on ord SRAM. All model designs were constructed using System Genertor (ver. 6.3i), n dd-on toolkit for Mthworks Simulink (ver. 6.2). System Genertor provides lirry of locks tht cn e converted into n HDL (hrdwre description lnguge) for synthesis. For simple locks such s multiplexors, logic gtes nd registers, the tool does direct trnsltion into the HDL. For more complex structures such s memories, multipliers nd dders, the tool relies on the Xilinx CORE Genertor (CoreGen). CoreGen comines user specified design constrints (it width, depth, etc) with timing constrints (ltency) nd re constrints (prllel versus seril). The Xilinx ISE Foundtion 6.3i pckge ws utilized for synthesis, plce nd routing, nd genertion of itstrem for progrmming. For certin results, Synplicity s Simplify Pro (ver. 7.0), n lterntive 3rd-prty synthesis tool ws employed for comprison. System Genertor provides unique interfce for FPGA digitl design through its rich lirry of synthesizle locks. Clocking is implicitly defined through the setting of smple periods (of ritrry units) for ech lock. Reset sttes re esily defined through the interfce. Additionlly, utomted support for uses nd explicit declrtion of fixed-point type formt simplify wht would tke considerle mount of effort when progrmming in trditionl HDL. System Genertor comines oth n interfce helpful to the trditionl hrdwre designer while hiding underlying detils to the neurl modeller FPGA uilding locks The vst mjority of processing within the dt pth involves four locks: ddsu, mult, cmult nd lookup tles. The former three encompss the rithmetic opertions (ddition, sutrction, multipliction, multipliction y constnt) nd the ltter is for ll other opertions. Division is sent from this list s there is yet to e high-speed, low ltency implementtion. The disdvntge is miniml s divisions cn generlly e mpped to multipliction of the inverse of the denomintor. (The denomintor is very often constnt or prmeter in neurl models, mking for trivil implementtion.) Ech lock, when mpped into the FPGA, cn tke on numer of different forms, ech with its own trdeoffs. Ech lock cn e chrcterized y its throughput, ltency, re, it width nd sign formt. The throughput is mesure of the numer of opertions tht cn e performed y the unit in given mount of time. The ltency of n opertion is the dely represented s mximum numer of cycles the lock requires to propgte n input to n output. When lock is pipelined (i.e. roken into multiple suopertions ech of one cycle durtion) or cple of completing one opertion per cycle regrdless of ltency, there is often negtive correltion etween throughput nd ltency. Resources within n FPGA re utilized in vrying wys depending on the prmeters of the lock. Additionl pipeline stges require dditionl registers per lock, while wider dt pth requires more logic. Different rchitectures cn sve re while scrificing performnce, such s in the cse of sequentil multiplier using single ccumultor to sum prtil products (Yo nd Swrtzlnder 993). Optimlly, the dt pth will exploit s much prllelism in the model s possile. In the simple eqution + cd, the multipliction of nd cn occur in prllel to the multipliction of c nd d. Then the ddition of the two terms cn follow. The metric used for determining prllelism is not simply the numer of opertions. Insted, the ltency of the opertion hs to e considered. In the eqution y = + c + d + ke + f where k is constnt, the multipliction of nd c, the constnt multipliction, nd pirwise sum re done in prllel (see figure ). Additionl rithmetic steps re done ised towrds the pths of lesser ltency. If ech of the ddition opertions hs ltency of one cycle, then the skewing of the ddition steps wy from the multipliers enles lnced tree of three to five cycles of ltency per pth. If the numer of opertions ws the metric for lncing expression trees, the dditions cn e relnced shifting the ltency per pth rnge to etween 2 cycles ( y) nd 6cycles(k, e y) Lookup tles Often, the neurl models require computtions tht re difficult, either resource hevy or too slow, for use in neurl model simultion. These cn e trigonometric functions, exponentils, squre roots or ny other expression with difficult-to-evlute closed-form solution. When the difficultto-evlute expression is function of single input, lookup tle provides n re efficient nd high performnce wy to estimte the output of the function. For n exmple, the stedy stte of gting vrile cn e estimted vi sigmoid or Boltzmnn s eqution s defined y: m = +exp ( ) () V θ σ where θ is the hlf ctivtion voltge of the gte, σ is mesure of the slope of the sigmoid nd V is the memrne potentil. This eqution is difficult to solve directly for two resons. 22

3 FPGA neurl model rchitectures () () Figure. Exmple expression trees for the eqution + c + d + ke + f where k is constnt. Two opernd multiply opertions re shown with two units of dely while constnt multiplies hve four units of dely. All dders re given one unit dely. Additionl pipeline registers cn e dded ritrrily to ech opertion. () The tree is lnced with respect to the numer of opertions per pth. () The tree is lnced with respect to the numer of cycles of ltency per pth. First, the exponentil hs no simple closed-form solution tht is efficient in n FPGA. Second, the inverse function is s difficult s division, which is generlly solved itertively, not directly like multipliction. Division y σ cn e simplified y refrming the expression s multipliction y new prmeter, /σ. Since eqution (2) is difficult to evlute directly in hrdwre, it is good cndidte for lookup tle. A ROM indexed y V cn produce suitle estimtion of m, thus removing the need for ny of the rithmetic opertions within the eqution. A simple mpping is required to convert the V input to n ddress for the ROM. For the generl cse: 2 n dder(x) = (x min(x)) (2) min(x) +mx(x) where n is the numer of its of ddressility in the lookup tle. The implementtion requires sutrction lock nd multipliction y constnt to perform the liner trnsform. The output of this multipliction should e set to sturte to void overflows when ddressing the tle. This mpping cn e shred for multiple lookup tles tht use the sme input. In Virtex-II FPGA, lookup tles re chosen to utilize SelectRAM, or lock RAM within System Genertor. A XC2V3000 FPGA contins 96 SelectRAM of which ech contins 8 kits of configurle RAM. Ech SelectRAM cn e configured s its, k 8 its, 2k 9 its, 4k 4 its, 8k 2 its or 6k it. Prtil SelectRAMs re wsted, so ech lookup tle cn expnd to fill the full RAM. In generl, lookup tles fit within the k 8 it configurtion Arithmetic performnce Generting optiml models often requires trdeoffs etween pipeline depth nd clock rte. In generl, deeper pipeline enles high clock rtes. In cses where the logic is fully utilizing the pipeline, deeper pipeline should trnslte to higher performnce, up until the point re on the FPGA ecomes limited. In the cse where the overll execution ltency is to e minimized independent of the numer of clock cycles, the clock rte is no longer the determinte of performnce, ut rther the timing delys through the pipeline. This dely is clculted s the product of the clock period (T ) nd one plus the pipeline depth ( + d). This is conservtive estimte s it ssumes tht the opertion uses the entirety of the clock period prior to nd fter the internl pipeline. In order to determine the influence of ltency on opertion throughput nd re, ech opertion ws synthesized nd mpped in the trget Xilinx FPGA, nd is shown in figure 2. Disy chined logic locks (ritrry length chin of six locks) were utilized to otin verge performnce results per opertion. To mitigte ny performnce hit (y wire or logic dely) cused y interfcing the inputs nd outputs of the opertions, doule-uffered registers were used to mp directly to input nd output pins. All opertions were set t 4 it 2 s complement signed numers with the fixed point set t the 3th it. This llows rnge of ± with pproximtely four deciml plces of ccurcy. Synthesis ws performed with the stndrd Xilinx Synthesis Technology (XST) uilt into ISE. Following plce nd route nd genertion of itstrem, re utiliztion ws ssessed nd the criticl pth from the log files ws noted. The inverse of the criticl pth period ws used s the pek clock frequency nd is shown in figure 2(). The re per opertion ws split into two components, registers, or flip-flops, nd look-up tles (LUTs). Ech vlue plotted (see figure 2()) is the verge re per opertion; the se overhed (from the doule-uffering of dt converters) ws sutrcted from totl utiliztion of registers nd LUTs nd the result divided y six to otin the verge. System Genertor provides n optimiztion flg, Pipeline to Gretest Extent Possile which ws set for the dder opertions ut not set for the constnt multiplier nd stndrd multiplier tests. We did performnce comprisons for ech opertion with this flg set nd unset nd found negligile differences. When set, however, the tool restricts the rnge of ltencies tht re progrmmle, in this cse, limiting to t lest three cycles for multiplictions nd one cycle for dditions. Tle summrizes the results for optimizing to two different design gols (pek throughput nd minimum ltency) post synthesis nd plce & route. Are is depicted s the numer of slices utilized. In the Virtex-II rchitecture, the logic fric is roken up into CLBs, or Configurle Logic Blocks. Essentilly, ech CLB cn output 8 its of informtion. Within ech CLB, there exist four slices nd two tristte uffers. Ech slice contins two it registers, two it 23

4 R K Weinstein nd R H Lee () () Numer of LUTs Numer of FFs Cycles per Opertion 4 5 (c) (d) Clock Freq (MHz) Op Dely (ns) Cycles per Opertion Adder Cmult Mult (logic) Mult (HW) 5 Figure 2. Are versus performnce trdeoff. (), () Performnce nd (c), (d) re utiliztion of sic rithmetic opertions. The enchmrks were performed with 4 it opernds nd 4 it output. The inputs nd outputs were doule uffered nd ssigned to externl pins on the FPGA. All performnce results re derived post synthesis nd plce & route. () Clock frequency is found s the mximum operting frequency when synthesized t for ech cycle ltency nd opertion. () The opertion dely here is shown s the dely (in ns) for n output to djust following chnge in n input. (c), (d) Are utiliztion is split etween flip-flops nd lookup tles, wherey two of ech mke slice. The enlrged mrkers designte the proposed delys for ech lock to minimize the re versus performnce trdeoff. These plots lso demonstrte the performnce nd re dvntge of using dders nd hrdwre emedded multipliers versus constnt multipliers nd logic-sed multipliers. When throughput is to e mximized ove ll else, then dders nd logic-sed multipliers re preferred. Tle. Pek performnce of opertions. Trget Mx throughput Min ltency Adder Depth 2 cycles 0 cycles Frequency 2.2 MHz Dely 9.5 ns 2.9 ns Are 4 slices 6 slices Cmult Depth 4 cycles 0 cycles Frequency 20.9 MHz Dely 33. ns 2.4 ns Are 98 slices 59 slices Mult Depth 5 cycles 0 cycles Frequency 69.7 MHz Dely 29.5 ns 0.5 ns Are 35 slices 0 slices The pek throughput of the multiplier is chieved when implementing in logic while the minimum ltency is chieved using n emedded multiplier lock. lookup tles (LUTs), nd dedicted rithmetic crry chins nd SOP (sum of products) logic. Ech LUT cn e configured s one 4 it ddressle lookup tle, one 6 it RAM or 6 it shift register. (The XC2V3000 hs 3584 CLB s.) These slices my e prtilly utilized in this design ut my e shred etween multiple opertions in resource-constrined design. The slice count in tle shows the sum of ll fully nd prtilly utilized slices. These dt show tht performnce cn e optimized for either throughput or ltency depending upon the requirements, nd tht different design choices will hve lrge impct on overll model performnce. When the resources nd model rchitecture llow for deep pipeline, ech opertion cn e hevily pipelined mximizing throughput, where the frequency is the mximum operting frequency sed on the criticl pth period. When pipelining is not desired (see Single-Cycle Architecture), pek performnce is chieved with no cycle ltency nd miniml dely per opertion. The lock option flg Pipeline to Gretest Extent Possile hd no noticele effect on performnce under these conditions nd ws disled. While we hve not repeted this study for dt widths greter thn 4 its, it is expected tht similr trends will follow. Throughput performnce is generlly mximized when using lrger pipelines. It is expected tht the optiml throughput would e found when ltency is set to higher vlues s the it width increses. Are ecomes especilly constrined when dt pths ecome wider. The Resources Estimtion Tool, included s prt of System Genertor is n invlule resource for the FPGA modeller when implementing re-constrined designs (Shi et l 2004). We cn utilize forml pproch for defining ech opertion s discrete-time trnsfer function sed on the cycle ltency from input to output. Bsed on the timing results depicted in figure 2, ech opertion cn e represented y the following expressions, where x, y re intermedite vlues, sttes or prmeters, k is constnt, nd p is the stge in the execution pipeline. Add(x[p],y[p]) = x[p ] + y[p ] Su(x[p],y[p]) = x[p ] y[p ] CMult(k, x[p]) = kx[p 4] Mult(x[p],y[p]) = x[p 2] y[p 2]. These mppings redefine rithmetic opertions for n FPGA implementtion tking into ccount opertionl dely. The multiplier delys re vlid for it widths 8 (the size of the uilt in multipliers). Alterntive mppings re defined for ddition nd sutrction sed on prticulr rchitecture nd design constrints. (3) 24

5 FPGA neurl model rchitectures 2.5. Externl interfcing nd dt collection FPGA models execute t extremely high throughputs. Roughly speking, the performnce level (defined s simulted time/execution time) is the time step multiplied y the numer of models nd the FPGA clock frequency then divided y the pipeline depth. This numer is mximized only when the FPGA is completely utilized. All of the models presented here re intended s exmples nd s such re not very complex. Consequently, they ll hve on-chip execution times of less thn ms (even if they were ll plced on the chip simultneously). Thus, for the ske of convenience, the dt presented here re generlly from emultion of the FPGA directly in Simulink. 3. The se rchitecture 3.. FitzHugh Ngumo For the ese of presenttion we will present the rchitecture s n exmple implementtion of simple neuron model. However, the ides cn e pplied to ny neuron model. The FitzHugh Ngumo model (FitzHugh 96, Ngumo et l 962) is reduced, dimensionless representtion of the Hodgkin nd Huxley model (Hodgkin nd Huxley 952). This model mkes the following ssumptions: () the ctivtion gte of the sodium chnnel hs extremely fst kinetics nd therefore reches the stedy-stte vlue instntneously, nd (2) the potssium chnnel gte hs similr, ut reverse kinetics (timescle nd gte chrcteristics) to the inctivting gte of the sodium chnnel. The Hodgkin nd Huxley equtions centred round four ordinry differentil equtions of voltge (V m ), sodium ctivtion (m), sodium inctivtion (h) nd potssium ctivtion (n) cn e reduced to simple potentil stte nd recovery stte. When non-dimensionlized, the following coupled differentil equtions emerge: du = u dt 3 u3 w + I (4) dw = ε( 0 + u w) (5) dt where u is the potentil of the system nd w is the recovery stte. The prmeters ε, 0 nd modulte the shpe of the spike nd I is the input (in dimensionless current) to the system. This model, despite eing drsticlly simplified, hs chrcteristics tht mke it stereotypicl of neurl circuit models. Ech eqution is first-order ordinry differentil eqution with eqution (4) hving nonliner term. The equtions re coupled nd cnnot e solved nlyticlly. This system enles demonstrtion of our techniques for FPGA model development in n esy-to-understnd exmple Generting the dt pth Differentil equtions of the stndrd form cn e split into two terms: the differentil term or the time vrying stte vrile, nd the intermedite clcultion, or eqution for the rte of chnge of the stte vrile. In equtions (4) nd (5), the left-hnd side is the differentil term nd the right-hnd side is denoted y the intermedite term. It is the intermedite term tht will e converted into dt pth for clcultion. Defining the functions f nd g from equtions (4) nd (5), respectively, s follows: f(u,w)= u 3 u3 w + I g(u, w) = ε( 0 + u w) mkes cler the delinetion etween the stte nd the intermedite; the generted dt pth ecomes independent of ny prticulr numericl solving techniques. First- nd higher order, fixed time step nd vrile time step solvers cn e implemented round the dt pth without ny modifiction to the design. Additionlly this isoltes the dt pth from ny prticulr simultion or protocol requirements, such s strting, stopping nd resetting. As will e descried lter in this pper, this seprted dt pth provides generl cse for rpid mpping into vrious rchitectures including popultion nd multicomprtment modelling ODE to difference eqution Ech eqution defined in the continuous time domin must e mpped to discrete time for numericl nlysis. Simultion on generl-purpose computer cn utilize the following discrete time representtion y mens of forwrd-euler integrtion, where n is the itertion step: u[n +]= u[n]+ t ( u[n] 3 u[n]3 w[n]+i ) w[n +]= w[n]+ t ε( 0 + u w). The ove equtions cn redily e clculted in generlpurpose processor where instructions re executed sequentilly for ech itertion of the loop. These equtions re not explicit with respect to processing in n FPGA, where ech opertion hs timing requirements tht must e considered. Additionlly, there is no explicit prllelism defined. Utilizing the commuttive nd ssocitive property of ddition nd sutrction, we reorder the rithmetic opertions to enle prllelism sed on opertion ltency s descried in the expression tree in figure. f(u,w)= [ ((u w) + I) ( (u u) ( 3 u))] g(u, w) = [ε (( 0 w) + ( u))]. In generl, multiplictions should e performed s erly s possile in the clcultion s the ltency is gretest. This will skew the resulting expression tree to force fster dditions nd constnt multiplictions outside the mximum ltency pth, when possile. To formlly define the ove expressions in the opertion spce of the FPGA, we cn use the mppings defined in eqution (3) to the redefined functions f nd g nd otin: f(u[p],w[p]) = Su(Add(Su(u[p],w[p]), I[p]),..., Mult(Mult(u[p],w[p]), CMult(/3,u[p]))) g(u[p],w[p]) = Mult(ε[p], Add(Su( 0 [p],w[p]),..., Mult( [p],u[p]))). 25

6 R K Weinstein nd R H Lee () u[p] 2 w[p] Fix_2_8 Fix_2_8 xlddsu z -- Su z -2 () Mult z -2 () Fix_2_7 3 I UFix 8 Fix 9 UFix_8_6 Fix 7 xlddsu z -+ Add Fix_2_8 z -2 () Mult xlddsu z -- Su Fix_0_7 f(u[p],w[p]) Mult2 UFix over 3 () 0 2 w[p] 3 5 u[p] UFix_8_5 Fix_2_8 UFix_8_5 Fix_2_8 xlddsu z -- Su z -2 () Mult Fix 8 Fix_7_3 Fix_2_8 xlddsu z -+ Add 4 e UFix_8_7 z -2 () Mult Fix_9_8 g(u[p],w[p]) Figure 3. Simulink lock digrm of intermedite clcultions () f(u,w)nd () g(u, w). Ech opertion is identified y the lel on the lock. All fixed-point dt types re lelled on the wires interconnecting the locks, where the first portion (Fix/UFix) defines the vlue to e signed or unsigned, respectively. The second term is the numer of totl its in the representtion nd the lst term defines the numer of frctionl its, or the numer of its to the left of the deciml plce. Dt ports for the susystem re the numered ovl locks, where the inputs re prmeters or sttes nd the output is the intermedite clcultion. Delys through the system re expressed in the z domin where the superscript is the ltency in cycles from output ck to input. Evluting these mppings yields equtions (6) nd (7). This representtion is useful to descrie the overll dely through the dt pth, in this exmple, six cycles. Since there is skew etween the shortest ltency pths nd the longest ltency pths (from one cycle to six cycles), dditionl pipeline registers cn e dded in the shorter delys without incresing the pipeline depth, possily enling incresed throughput with fster clock rte. Additions nd sutrctions re set to hve no ltency, ut this cn e redily chnged y reindexing the prmeters nd sttes y the dditionl dely. [ ] ((u[p ] w[p ]) + I[p ]) f(u[p],w[p]) = ((u[p 6] u[p 6]) (/3u[p 4])) (6) g(u[p],w[p]) [ ( )] (0 [p 3] w[p 3]) + = ε[p 3]. (7) ( [p 6] u[p 6]) These equtions define the required cycle ltency for the input to rech the output synchronously. Once implemented, this dt pth cn e used for well-performing implementtions of single-comprtment models, multi-comprtment models, popultion models, etc. The dt pth cn e constructed identiclly in ll of the ove cses. Figure 3 shows the System Genertor dt pth corresponding to equtions (6) nd (7). For these different rchitectures, only the implementtion of the stte vriles will chnge s is expounded upon in the following sections. 4. Single model, single unit cse 4.. Multi-cycle rchitecture The generl model simultor will execute single version of model ccording to set protocol. The modeller will often run simultions to mnully tune prmeters, trying to replicte prticulr ehviour. Sometimes this model will hve prticulr performnce requirements such s reltime execution. For these cses, we developed generl rchitecture for running series of differentil equtions with ritrry dely s multi-cycle processor. We employed two synchronous clock domins, one providing slower outer loop to interfce with the outside world nd fster inner loop for the dt pth processing. In this wy, it is very much like multi-cycle computer rchitecture, overclocking the internl pipeline in hidden fshion from the outside. The originl equtions for the stte vriles, u nd w cn e expressed s difference equtions round the functions f nd g. We use forwrd-euler integrtion s numericl solver due to its ese of implementtion. Additionlly, numericl ccurcy cn e improved with smller step sizes, resonle trdeoff given the high performnce of n FPGA. This rchitecture is esily extendle to higher order ODE solvers including Rung Kutt, predictor corrector, or even vrile time-step solvers if desired. du = f(u,w) u[n +]= u[n]+ t f(u[n],w[n]) (8) dt 26

7 FPGA neurl model rchitectures () () f(u,w) Fix_0_7 UFix_8_ ts f(u,w) Fix_0_7 u[n] Fix_2_8 UFix_8_ ts Fix_0_9 z -2 () Mult xlusmp 7 Up Smple xldsmp 7 z - Down Smple Fix_2_8 Fix_0_9 () Mult xlddsu Fix_2_8 + Add 2 u[p] xlddsu Fix_2_8 + xlregister d z - Fix_2_8 q Add Register u[n] Figure 4. Simulink lock digrm of stte clcultions for single unit models. () The multi-cycle pproch to single unit forwrd-euler integrtion. The intermedite clcultion is scled y the time constnt nd dded to the previous stte. The output of the stte, u[n], is clocked once for every seven cycles of u[p]. The Up Smple lock is set to copy the slower clocked smples cross ll seven fst clock cycles while the Down Smple lock is implemented s register t the slower rte. () By removing the delys within the pipeline, up nd down smpling ecomes unnecessry, requiring only register to store stte etween itertions. There is only one clock rte in this scenrio. dw = g(u, w) w[n +]= w[n]+ t g(u[n],w[n]). dt (9) Equtions (8) nd (9) representing the stte clcultion cn e comined with the dt pth formultion in equtions (6) nd (7) to generte the full expression for the model. The dt pth needs to e modified to include the time-step multipliction dding nother two to four delys to the dt pth. The totl dely for the dt pth is incresed to seven clock cycles. The trnsltion of the stte eqution into hrdwre is only single register clocked t the slower clock rte. Equtions (0) nd () re the comined dt pth nd stte expression where n is the itertion of the outer, slower clock nd p is the itertion of the pipeline, or fster clock. u[n +,p] = u[n, p]+ [ ] ((u[n, p 2] w[n, p 2]) + I[n, p 2]) t ((u[n, p 7] u[n, p 7]) (/3u[n, p 5])) (0) w[n +,p] = w[n, p]+ [ t ε[n, p 3] ( (0 [n, p 3] w[n, p 3]) + ( [n, p 6] u[n, p 6]) )]. () In this rchitecture, simplifiction emerges enled y the slower clocked stte. All execution pths within the dt pth must settle y the time the next vlue is clocked into the stte register. Therefore, for pths of fewer cycles thn the criticl pth (in this cse, the pth with the longest ltency), there is no difference if the inputs rrive erlier thn required. Therefore, ll pths cn receive their input on the previous outer cycle nd ltch the new vlue t the following outer cycle. All timing requirements of the pipeline with respect to insertion dely in the pipeline ecome unnecessry. The new equtions follow from the simplifiction: [ ] ((u[n] w[n]) + I[n]) u[n +]= u[n]+ t ((u[n] u[n]) (/3u[n])) [ ( )] (0 [n] w[n]) + w[n +]= w[n]+ t ε[n]. ( [n] u[n]) The forwrd-euler method is reltively simple to implement within System Genertor, tking only four locks s shown in figure 4(). The stte t ech itertion is stored in single register clocked t the output rte. We use comintion of n up smple nd down smple lock (the clock multiplier/divider is set to the mximum dely through the dt pth, including ny clcultion within the stte). The up smple lock is configured to copy the vlue from ech input period to ll corresponding output periods. This is not necessry for hrdwre genertion, i.e. it does not trnslte directly to hrdwre, ut does llow the softwre to verify correct clocking of dt through the system. The down smple lock is configured to copy the lst frme of the input to the output, which in hrdwre is register clocked t the output rte. The reminder of the stte consists of multiplier lock t the output of the dt pth to scle the function y the time step nd n dder to dd to the previous vlue. The previous vlue is t the output of the up smple lock. More complex integrtion lgorithms cn e implemented s n extension to this se cse. The multipliction y the time step cn lso e incorported within the dt pth depending on the prticulr integrtion lgorithm possily sving multipliction step. This cn e shown for the function g(u, w) where the constnt multipliction of ε cn e sustituted y the constnt multipliction of the product ε t. This is only possile in the trivil cse of Euler integrtion with fixed step sizes. 27

8 R K Weinstein nd R H Lee Tle 2. Single model rchitecture design comprison. Frequency Output frequency Ltency Are (MHz) (MHz) (ns) (slices) Single-cycle Multi-cycle Single-cycle rchitecture The multi-cycle rchitecture enles reduction in stte registers while still utilizing fully-pipelined dt pth. It provides mens of integrting pipeline of ritrry depth into n integrtion stte susystem producing only one output per time step. There re three drwcks with this pproch: wsted re, reduced performnce nd high power consumption. First, re is not utilized efficiently s pipeline delys within the intermedite clcultions use registers tht could e used for dditionl logic or extr dt storge. Ech dely requires numer of registers equl to the it width of tht clcultion. Significnt re svings cn e relized y removing those extr delys. Second, performnce is reduced for two resons: () extr delys contriute n dditionl time dely in the form of setup nd hold time. The register setup time, or the minimum mount of dely etween the dt ecoming stle prior to the edge of the clock signl, for n XC2V3000 speed grde 4 FPGA is 370 ps. The hold time, or the minimum durtion following the clock edge for the dt to e redy to red is 90 ps. The sum of those vlues constitutes window round the clock edge where dt must e stle nd is wsted within the smple period. Long pipelines cn ccumulte this 460 ps ded-time in the period for ech dely in the pth. (2) The overll dely through the pipeline is equl to the product of the depth nd cycle period. Ignoring setup nd hold times, the dely through the pipeline is idelly the sum of ll the rithmetic comintoril logic delys. If the totl logic cn e eqully distriuted (dely-wise) etween n ritrry numer of registers, then ltency/throughput is independent of the pipeline depth. Prcticlly, the period is function of the longest comintoril pth etween registers. Therefore, shorter comintionl pths must execute within lrger clock periods, reducing efficiency. As the numer of pipeline registers increses, the more difficult it is to mintin symmetry etween comintoril pth delys. Third, power consumption nd clock frequency re directly proportionl for given model design. The power consumption of device is not generlly n issue for the modeller, ut does constrin the design of the device itself; the pek clock frequency of n FPGA is prtilly constrined y the limits of het dissiption s power consumption increses. Achieving the sme or incresed throughput t slower clock rte is generlly preferred. Therefore, to utilize miniml re nd power while chieving pek performnce requires the reduction of the pipeline depth to the minimum chievle. The mjority of neurl models cn e modified to execute in single cycle per time step itertion y chnging ll ltencies to zero. (Note tht one register lwys remins due to integrtion, resulting in the single-cycle designtion.) For multiplier locks, the Pipeline to Gretest Extent Possile flg must e unchecked. This chnge cn e mde to ll rithmetic locks. Then the up-smple nd down-smple locks cn e chnged to single register to complete the single-cycle design pproch. The results of comprison etween the two design pproches re shown in tle 2 with the re utiliztion nd performnce results determined post synthesis nd plce & route trgeting n XC2V3000-4fg676 FPGA. The toplevel design depicted in figure 5 ws used for testing nd synthesizing the single-cycle nd multi-cycle models. In the single-cycle version, the entire dt pth is executed within ech clock cycle nd requires only 7 ns to complete. When delys re distriuted in the multi-cycle, totl of 7 for the longest pths, the ltency is tripled. The mximum clock frequency is incresed y lmost 2.5 times, not enough to compenste for the dditionl cycles required per itertion. The mximum frequency supported y the XtremeDSP-II Development Bord is 20 MHz cpping the pek multi-cycle model throughput. In this exmple, the single-cycle model enles 87% improvement in performnce with 34% reduction in re over the multi-cycle model. In generl, the single-cycle method is preferred over the multi-cycle method when ll the locks within the dt pth cn e set to zero ltency. When tht is not the cse, the multi-cycle method is suitle fllck technique. 5. Multiple model, single unit cse Deep pipelines llow for multiple simultneous processes executing within the dt pth, where the numer of processes equls the depth of the pipeline. In other words, if the dt pth requires ltency of ten cycles until the first output ppers, ten simultneous models cn e executed without loss of performnce. The dt pth would produce the ten models interleved t the inner, fster clock rte. In contrst, within the single-model, multi-cycle rchitecture, the pipeline produces n output t the outer, slower clock frequency. Ultimtely, the throughput of ech model does not chnge, ech output for the sme model will e t the slower frequency, ut the ggregte ndwidth of the system will sustntilly improve. Two scenrios re common cndidtes for multiple model, single unit simultion: () there re set numer of models you re interested in simulting, for exmple, ll the neurons in prticulr nuclei or circuit or (2) when the numer of simultneous simultions is flexile nd more is etter, such s in utomted prmeter serching or popultion modelling. The first scenrio pplies more constrints to the model nd produces very deterministic output. Only the ville re on the FPGA limits the second scenrio. Within these rchitectures, chnge in the numer of models simulted requires strightforwrd modifiction to the model design. 5.. Pipelining the dt pth The structure of the rithmetic opertions in the dt pth in the multiple model cse is identicl to tht of the single 28

9 FPGA neurl model rchitectures u[p].5 I UFix_8_6 w[p] I Fix_0_7 f(u[p],w[p]) f(u,w) f(u,w) dudt Fix_2_8 u[n] Fix_2_8 u[p] xlregister z - Fix_2_8 xlregister d z - Fix_2_8 Fix_2_ d q q xlcst force Reinterpret Register Register DAC doule u Scope k =2 0 UFix_8_5 0 w e UFix_8_5 w[p] UFix_8_7 e u[p] Fix_9_8 g(u[p],w[p]) Fix_2_8 w[n] g(u,w) Fix_2_8 w[p] dwdt xlregister z - Fix_2_8 xlregister d z - Fix_2_8 Fix_2_ d q q xlcst force Reinterpret Register2 Register3 DAC2 doule g(u,w) System Genertor Figure 5. Top-level view of the FitzHugh Ngumo model. Prmeters re defined in Xilinx constnt locks (on the left) moving into the intermedite clcultion. The sttes re evluted next, with feedck pths to the inputs of the intermedite clcultions. The outputs, u nd w re doule-uffered for performnce nd scled for nlogue output vi the on-ord dt converters (on the right). model cse. The differences lie in distriuting ltencies nd mnging the prmeters in the pipeline. Ltencies, or dditionl clock cycle delys, re inserted to mximize throughput of the system. As the longest comintoril pth etween ny two registers is the sole determinte of clock frequency nd therefore model throughput, cre must e tken to distriute the dely s uniform s possile throughout the pipeline. The following lgorithm descries n pproch to distriuting the ltency within the dt pth:. The trget numer of simultneous models will set the depth of the susequent pipeline generted. 2. The longest, weighted rithmetic pth is isolted nd delys re judiciously dded such tht the totl dely does not exceed the depth of the pipeline. The longest, weighted rithmetic pth is defined s strting from single endpoint (prmeter of the system or stte) nd terminting with the completion of the differentil of the stte. () Delys re dded in n initil pss providing rtio of 4:2: cycles (see eqution (3)) of ltency to constnt multipliers, hrdwre multipliers nd dditions/sutrctions, respectively. Nonhrdwre emedded multipliers hve similr dely requirements to constnt multipliers. () Add n dditionl dely for ech opernd of multiplier tht is greter thn 8 its. (c) Tles, oth ROM (red-only memory) nd singlend multi-port RAM (rndom ccess memory) locks require one unit of dely regrdless of it-width or ddressility when fit into one SelectRAM. Additionl glue logic is required when more thn one SelectRAM is required which could enefit from n dditionl dely. (d) When the numer of delys in the pipeline exceeds the numer of simultneous models, remove delys evenly throughout the pth until the numer of delys equls the numer of models. 3. Repet on ll other pths tking cre to never use more ltency cycles thn the criticl pth. Adding extr delys such tht ll pths re lnced is not necessry nd will wste FPGA resources. This lgorithm post processes the expression trees mking up the dt pth solving the intermedite clcultion, s defined ove, with the timing informtion required to interfce with the stte solvers nd prmeter susystems. The leves of these expression trees re the prmeters nd the stte inputs. Ech lef hs n insertion dely ssocited with it defined s the sum of the delys long the pth from the lef to the root of the tree, including ny clcultion tht my occur within the stte solver (ex. multipliction y the time constnt). For exmple, in figure (), k,,, c, d, e nd f re mpped to k[p 5], [p 3], [p 4], c[p 4], d[p 4], e[p 5] nd f [p 4], respectively. Synchroniztion of the pths within the expression trees is therefore ccomplished y providing delyed version of sttes nd prmeters to the leves consistent with the insertion dely of the prticulr lef node Pipelining the sttes Executing n models simultneously requires the continuous storge of n sets of informtion within pipeline tht is n stges deep. In the previous work (Grs et l 2004), ll informtion ws stored within the pipeline vi dely locks, requiring creful synchroniztion of the expressions. Sttes were implemented s simple dely lock with n cycles of ltency. This rchitecture recognizes the sttes nd prmeters s forming sis set of model informtion. All intermedites 29

10 R K Weinstein nd R H Lee u stte u intermedite z - z - z - z - z - z - z - z - z - z over 3 z -2 Mult z -2 Mult2 () () xlddsu z -- Su I[n,p-4] z -2 Mult xlddsu z -+ Add du/dt =f(u,w)= ((u-w)+i)-/3*u^3 () xlddsu z -- Su ts f(u[n,p-2],w[n,p-2]) z -2 Mult3 () u[n,p] xlddsu + u[n+,p] u integrte w intermedite 3 [n,p-7] 2 0[n,p-6] z -2 Mult4 xlddsu z -- Su2 () xlddsu z -+ Add2 4 e[n,p-4] z -2 () Mult ts 2 g(u[n,p-2],w[n,p-2]) z -2 Mult6 () xlddsu w[n+,p] + w integrte dw/dt =g(u,w)= e*((0-w)+*u) w[n,p] w stte z - z - z - z - z - z - z - z - z - z Pipeline Stge Figure 6. Simulink lock digrm of multiple unit model. Ten itertions of the FitzHugh Ngumo model re run simultneously through the constructed ten stge pipeline. All locks re depicted to e in scle with respect to cycle ltency. Since the dt pth requires only seven cycles per itertion, the stte register chin must e t lest seven stges within this rchitecture. It ws chosen to e ten for this exmple. cn e shown to e function of the sttes nd prmeters. Therefore, only the sttes must e explicitly stored within ech time step. The delys of the previous work re replced with chin of n registers. This hs two enefits: first, ech register cn now contin n initil vlue for the stte tht cn e unique for ech model simulted. By convention, the til of the chin (lst output) contins the initil stte of the first model nd the hed register of the chin stores the stte for the lst model. Second, the outputs of the n registers represent the set of ll delyed versions of the stte. These outputs re now ccessile to e routed ck into the expression trees t the dely required for the prticulr opertion. For exmple, if chnge on prticulr stte lef of the expression tree hs n insertion dely of six cycles, then the input of tht stte should e tpped six registers deep in the stte pipeline. This llows the stte informtion to follow cycle y cycle the intermedite logic within the expression tree. This lgorithm pplied to the FitzHugh Ngumo model is shown in figure Shred versus unique prmeters When executing set of models, prmeters cn e either sttic cross ll models or ritrrily vrying cross ll models. In the simple cse where prmeter is sttic, it cn e represented s System Genertor constnt lock nd hs no prticulr timing requirements. Unique prmeters per model cn redily e exploited through the use of circulr uffer of length n counting through ech prmeter per cycle. These prmeters must e synchronously ville reltive to the trget model t the correct insertion point within the model. Given n input I, delyed y d cycles, s represented in the difference eqution s I[p d], within pipeline of depth n, the circulr uffer must e initited with prmeters forwrd rotted y n d steps in order to mintin synchroniztion. Therefore, fter n d increments of the pipeline, the prmeter I will e inserted into the pipeline such tht d cycles lter, the output for tht model will pper t the root of the expression tree. In System Genertor, this circulr uffer is implemented s count limited counter rnging from 0 to n ddressing ROM with the prmeter vlues pre-initilized. The prerottion of the circulr uffer cn e ccomplished in two wys, either through chnge in the initil vlue of the counter or y rottion in the initil vlues in the ROM. The former method requires dedicted counter per unique prmeter set in the model. Therefore, the ltter method is preferred s rottion in the ROM requires no extr resources llowing one counter to e shred for ll prmeter tles. This pproch ws used for the current input I, for the trces illustrted in figure 7. The ROM mcro in System Genertor ecomes synthesized s synchronous memory such tht the output is registered. Rotting the uffer y n d indices compenstes for this one-cycle ltency. When n is reltively smll (n < 5), it is dvisle to use distriuted RAM resources when ville. In distriuted RAM, ech slice cn store up to 32 its of dt. When n is lrge or when logic resources re limited, these prmeter tles cn e kept in lock RAM. The decision to use lock RAM or distriuted RAM lrgely depends on the prticulr limiting resource constrint in model. 30

11 FPGA neurl model rchitectures Figure 7. Output trces from cycle-ccurte, fixed-point simultion within Simulink of ten concurrently executing Fitzhugh Ngumo models. The trces were generted from the model shown in figure 6. All prmeters were kept constnt with the exception of I, which ws vried from 0.99 to 2.97 with step of 0.22 illustrted from top trce to ottom trce. The first 3000 points re shown. Note tht this simultion would execute in pproximtely 0.25 ms on the FPGA, nd 50 such designs could run concurrently for totl of 500 models. 6. Coupled, multiple unit cse Coupled, multiple unit models re strightforwrd extension to the isolted multiple model, single unit cse. Coupling cn occur etween comprtments in morphologiclly complex neuron model, s electricl connection etween neurons in the form of gp junctions, or s chemicl connections, or synpses within popultion. This cse dels exclusively with homogenous popultions of neurons or neurl comprtments with some form of coupling or reltime interction. A prticulr exmple of ten-comprtment motoneuron designed in this mnner is descried in the previous work within this lortory (Weinstein nd Lee 2005). A coupling is defined s stte vrile from one unit cting s term or fctor of n intermedite clcultion of different unit. A unit cn e coupled to ll other units of model in the cse of fully interconnected popultion model. In contrst, unit might only e coupled to its neighours in the cse of liner multi-comprtment chin. These two cses re considered the generl cses of unit coupling, where there is regulrity etween the connections. When modelling non-generl cses where few connections re creted, it is often simpler to mp the scenrio ck into generl cse when possile. This my not turn out to e the most optiml implementtion. For exmple, in model of 20 neurons such tht ech neuron is coupled to nother to form pirs of hlf-centre oscilltors, ech neuron will tke input from only one other neuron nd output to only one neuron. If djcent neurons within the pipeline re coupled, then odd neurons will couple to the next neuron nd even neurons will couple to the previous neuron. The following descries two such pproches to generting this coupling logic. The first pproch is literl conversion of the coupling lgorithm into System Genertor locks. An even odd test cn e ccomplished y using the LSB (lest significnt it) of the prmeter set ddress counter s select line into 2: multiplexer. The counter will go from 0 to 9 in this exmple, toggling the LSB t ech cycle. The inputs of the multiplexer will e the voltge sttes from the djcent units. Following the convention where the first model is t the til of the register chin, when the multiplexer select line is 0, the unit is odd nd requires the stte from the following unit. The stte used will e tpped from the stte register chin t the point of the insertion dely of the lef node plus one. When the multiplexer select line is one, then the unit is even, nd the previous unit s stte is used, which will e tpped t the insertion dely minus one. Any prmeters cting on the coupled stte cn e processed s usul with 20 element ROM. The second pproch generlizes the coupling nd removes the multiplexer from the implementtion. This pproch provides for two inputs to ech model, one from the previous unit nd one from the next unit. Very often neurl models hve n intensity prmeter in the form of mximl current or conductnce. These prmeters cn e set to zero for the cses where there is no coupling nd the proper vlue when there is coupling. Two prmeter tles will e required, one for the even units nd one for the odd. This pproch, while wsteful in resources, simplifies design s only the stndrd rithmetic locks re required. Models requiring full interconnection etween ll units re resonle nd strightforwrd to design ut re generlly resource constrining. In the cse of synpses for fully interconnected popultion of n neurons, given recurrent connections, the logic requirements include n 2 synpses with n implementtions of the synptic mechnism including t lest n stte solvers, n prmeter tles of n depth hold the synptic weights, nd n dders in tree with log 2 (n) levelstosum the synptic input. As n ecomes lrger, the synpses tke on n ever-incresing percentge of FPGA resources, quickly limiting the scle of popultion models. Future work is needed to consider lterntive design pproches for incresing the size of neurl popultion models. 7. Discussion This pper serves to provide the methods for implementing stereotypicl neurl circuit model elements in n FPGA. While designing model in itself is strightforwrd, there re some key limittions nd res of future work to mke it s esy to use s softwre simultion. This discussion documents the chllenges of converting floting-point clcultions to fixed-point representtions, interfcing the model to externl systems, nd the limits of sclility within n FPGA. 7.. Precision determintion System Genertor provides no mens for hndling rel or floting-point numers. Insted, ll prmeters, sttes nd intermedite clcultions utilize fixed-point numers, defined y sign, numer of totl its nd numer of frctionl its. Before executing in hrdwre, it is necessry to set ech opertion to hve sufficient precision to void overflows, 3

12 R K Weinstein nd R H Lee Tle 3. Clcultions to determine rnge vlues per opertion type. R = R R 2 High rnge vlue (H ) Low rnge vlue (L ) Addition + H + H 2 L + L 2 Sutrction mx(h L 2, L H 2 ) min(h L 2, L H 2 ) Multipliction mx(l L 2, L H 2, H L 2, H H 2 ) min(l L 2, L H 2, H L 2, H H 2 ) Division / mx(l /L 2, L /H 2, H /L 2, H /H 2 ) min(l /L 2, L /H 2, H /L 2, H /H 2 ) The division ssumes tht the rnge of the inputs does not cross zero. underflows or functionl mismtches due to quntiztion errors. Excessive precision should e voided s re utiliztion nd performnce will suffer. Optiml precision for ll opertions is difficult if not n impossile gol nd is still n ctive re of reserch. On the other end of the scle, the numer of its required to gurntee full precision continuously increses fter ech opertion. (Full precision here mens precision sed on the rgument precisions rther thn the rguments themselves.) Full precision for multipliction, s implemented in System Genertor, is the sum of the numer of its of ech opernd. An ddition hs full precision when the numer of integer its of the output is the mximum of the integer its of ech opernd. Similrly, the numer of frctionl its of the output is the mximum of the numer of frctionl its of ech opernd. This is worst-cse, pessimistic pproch to determining precision through the pipeline y ssuming tht ll the full rnge of vlues re vlid for ech opertion nd tht mximum frctionl precision is lwys necessry. As n exmple, 7 it multiply (7 it unsigned or 8 it signed inputs) requires just one emedded multiplier. A 34 it multiply requires 4 emedded multipliers to clculte the prtil products nd dders to sum the two prtil products. When moving to 5 it multiplier, the resources jump to 9 emedded multipliers (Shi et l 2004). Excess precision will require dditionl ltency to mintin the sme throughput nd wste logic resources tht could e used for dditionl prllel opertions. We hve found severl techniques to reduce the required precision ut hve yet to report on generl lgorithm for determining the optiml precision. First, we cn reduce the numer of integer its per opertion y ounding the prmeters nd sttes within prcticl rnges. For exmple, for prticulr neuron firing, the memrne potentil might rnge from mv to mv requiring one sign it, seven integer its nd numer of frctionl its to chieve the desired resolution. Those signed seven integer its llow rnge of 27 to 26. Full precision uses the full rnge of the fixed-point representtion, ut in relity, only the usle rnge is required. We define S to denote signed numer (versus n unsigned, positive only vlue) nd R i = (L i, H i ), where L i nd H i re the low nd high vlues of the usle rnge of the ith lef node. When L i nd H i re of different signs or oth negtive, the numer is signed nd requires n extr it to denote the sign. Formlly, the sign, S i, nd the numer of integer its, Z i, given rnge R i re determined y:, L i < 0 S i =, H i < 0 0, otherwise Given : mx int = mx L i, H i { log2 (mxint) + mxint > 0 Z i = 0, otherwise. (2) Using the rnge nottion, R i, insted of the full precision to represent ech vlue, the numer of integer its, Z i, required throughout the pipeline is generlly reduced. We reclculte these rnge vlues t the output of ech opertion using the expressions in tle 3. Without further priori knowledge of the rnge of intermedite vlues within the pipeline, this provides n pproch to determining the numer of integer its required throughout the dt pth, voiding ny overflow conditions. Underflow conditions re more difficult to optimize round. An underflow occurs when the frction precision of n output of n opertion is truncted from full precision such tht smll numer is represented s zero. Underflows my or my not cuse ny discrepncies in simultion, depending on where in the dt pth they occur. In the cse of Hodgkin nd Huxley style potssium chnnel with n 4 kinetics, given n ctivting gte with f frctionl its of precision, the output vi two cscding multiplictions will require 4f frctionl its when utilizing full precision. Relisticlly, fourth-order kinetics requires no dditionl precision over first-order expression, in which cse n underflow would e cceptle. An dder tree comining ll currents clculted per chnnel would lso e possile cndidte for llowing underflows s very smll currents re diminished y lrger trnsients or lekge pths. Other clcultions, such s scling y the time step when performing integrtion, must e free from underflows to mintin functionl ehviour. The remining clss of errors, quntiztion errors, is sustntilly more difficult to reson through intuitively nd requires more rigorous pproch. This cn e through error nlysis techniques such s propgting reltive nd solute error through the dt pth. The inherent feedck in the system vi the integrtion steps mkes this nlysis exceedingly difficult. Alterntively, simultions cn help isolte the differences etween floting- nd fixed-point representtions of the model. There is still considerle work to e done in investigting wys of nlyzing fixed-point neurl models to minimize quntiztion error. 32

Digital Signal Processing: A Hardware-Based Approach

Digital Signal Processing: A Hardware-Based Approach Digitl Signl Processing: A Hrdwre-Bsed Approch Roert Esposito Electricl nd Computer Engineering Temple University troduction Teching Digitl Signl Processing (DSP) hs included the utilition of simultion