ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble. You can use any of the prevously dscussed technques to accelerate the mplementaton: use software optmzaton, buld a coprocessor, optmze the hardware/software communcaton. The constrants of your mplementaton are 1. that t must be completed by 11/26/2007 at 5:00PM. 2. that t must run correctly on the Spartan 3E starter kt. 3. that t follows the gven testng procedure to demonstrate the performance of your mplementaton. The qualty of your desgn wll be evaluated usng the followng ctera: 1. the resultng clock cycle count of your mplementaton, wth a clock cycle correspondng to one tck of an OPB Tmer module clocked at 50MHz.. 2. the area of your desgn, expressed n slces of the Spartan3E FPGA. 3. the tme when you turned n the soluton (before the deadlne, but earler s better). The clock cycle count s a frst-order crterum, the area s a second-order crterum, the desgn tme s a thrd order crterum. Faster (but correct) desgns wll always wn. For clock cycle counts that le wthn 1% of each other, area wll be used as a dstnctve factor. For example, gven four desgns A, B, C, and D as shown below, the rankng would be as follows, from best to worst: D, B, C, A. In case the area as well as the cycle count are wthn 1% of each other, then the tme of postng the soluton wll be used to resolve the rankng of the two desgns. Area (Slces) D C < 1% of n B A n Cycle Count Thus, all desgns wll be strctly ranked accordng to these crtera. It s n your nterest to try and fnd the hghest possble performance that can stll be accommodated on a Spartan3 board, and to fnd that soluton as quckly as possble. P. Schaumont, Vrgna Tech
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn Assgnment: Coordnate Rotaton Dgtal Computer (CORDIC) The task s to mplement a CORDIC algorthm as effcently as possble. CORDIC s often used n dgtal hardware to mplement trgonometrc functons. The CORDIC kernel mplements a vector rotaton operaton. In a two-dmensonal space, a vector rotaton starts from a vector (X,Y) and rotates t over an angle ph as follows: x' = x cos( φ) y sn( φ) y' = y cos( φ) + xsn( φ) Ths can be rearranged to: x' = cos( φ)[ x y tan( φ)] y' = cos( φ)[ y + x tan( φ)] An effcent mplementaton of ths formula s possble be restrctng the rotaton to amounts of angles for whch tan(φ ) = ± 2. Thus, we should ensure that the tangent of the angle s a power of two. Under that condton, the above rotaton formulas requre only shft-operatons to mplement the multplcaton wth tan( φ ). We call the rotaton over such an angle an elementary rotaton. An arbtrary angle can now be approxmated as a sequence of elementary rotatons, much n the same way as the ndvdual bts n a btvector can express weghts to approxmate an nteger number. Ths dea s llustrated n the fgure above. We need to mplement a rotaton over angle β. We start wth an ntal vector v0 at (1,0). The frst elementary rotaton s over an angle tan 1 (0.5). Ths rotates v0 counter-clockwse to v1, usng the rotaton formulas gven
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn above. The next elementary rotaton would be over an angle tan 1 (0.25). Agan, ths would be a counter-clockwse rotaton, such that we decrease the error between the desred rotaton angle β and the approxmatons n terms of elementary rotatons. v1 now moves to the poston v2. The next rotaton, over tan 1 (0.125), would be clock wse, snce v2 has moved beyond the desred rotaton β. By usng ncreasngly smaller elementary rotatons, we would obtan an ncreasngly better approxmaton. Therefore, we can express the rotaton formulas above usng a set of dfference equatons. x + 1 = K [ x y+ 1 = K[ y wth K d = ± 1 = cos(tan y. d.2 + x. d.2 1 2 1 ) = ] ] 1 1+ 2 At each teraton, a smaller rotaton angle s selected, and a decson to rotate forward or backward s made ( d = ± 1 ) such that we obtan a better approxmaton of the actual rotaton angle n terms of elementary rotatons. Note that the constants n these formulas only depend on elementary rotatons, and as such they can be evaluated upfront and stored as constants. In CORDIC mplementatons, the K factors are not appled at each rotaton, but rather they are collected nto a sngle scalng factor A. For a large number of (ncreasngly smaller) elementary rotatons, A converges to 1.647 and s gven by A = lm 1+ 2 2 To fnd how well the target rotaton angle s approxmated by elementary rotatons, we can also nclude an angle-accumulator nto the teratons, defned by z + 1 = z d tan 1 (2 ) Ths angle accumulator expresses the dfference between the target angle and the seres of elementary rotatons.
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn CORDIC algorthms are used n two possble modes of operaton. In the rotaton mode, we start wth a desred rotaton angle and rotate a gven vector over that angle. At each teraton, the decson to rotate counter-clockwse or clockwse s made based on the sgn of the angle accumulator. The objectve s to drve the angle accumulator to zero. The result of the rotaton mode s a gven vector rotated over a gven angle. In the vector mode, we start wth a gven vector and rotate that vector untl the vector s algned wth the X axs. At each teraton, the decson to rotate counterclockwse or clockwse s made by the sgn of the Y component of the vector. The objectve s to drve the Y component to zero. The result of the vector mode s the angle of a gven vector. CORDIC mplementaton on Spartan 3E Starter Kt The codesgn challenge s descrbed by the followng ntal archtecture. DDR Ram target_angle[65536] result_x[65536] result_y[65536] McroBlaze DDR Controller OPB Tmer In a DDR Ram, three 64 KWord arrays are stored. The objectve s to rotate a unt vector (1,0) over all the angles expressed n target_angle[ ], and store the result of each rotaton n result_x[ ] and result_y[ ]. The performance of your desgn s measured as the tme t takes to complete ths set of rotatons (ncludng readng from/wrtng to DDR). To accelerate the desgn, you can modfy the hardware as needed (add coprocessors, develop effcent data transfer technques, etc).
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn start prepare_angle() tmer_on reference cycles reference_cordc() golden_cordc() tmer_off tmer_on cordc cycles your_cordc() tmer_off Speedup = reference cycles cordc cycles check_result() golden_cordc() prnt cycles prnt errors You desgn wll be tested usng a test program (runnng on Mcroblaze) as descrbed above. Intally, the mcroblaze wll generate 64K random target angles. Next, t wll collect the executon tmng for 64K rotatons on two cordc functons. The frst s a reference mplementaton n software (reference_cordc). The rato of the two cycle counts determnes the relatve speedup obtaned by your mplementaton. Note that ths method of speedup measurement s relatvely ndependent of the compler optmzaton level, snce the -O2 flag wll beneft the reference mplementaton as well. Fnally, your desgn results are verfed aganst the golden reference. For a vald soluton, zero errors are requred (.e. f your soluton shows a sngle error, t s automatcally moved to lowest rank of all desgns returned by the class). The CORDIC reference algorthm s mplemented usng fxed-pont arthmetc and s expressed usng ntegers. A fxed-pont data type <32,28> s used. In ths data type, the value 1 s expressed as (1 << 28). The scalng factor allows expresson of fractonal values. For example, 0.75 s expressed as: 0.75 = 0.5 + 0.25 = (1 << 27) <32,28> + (1 << 26) <32,28> = 671,088,640 <32,28> For the verfcaton process descrbed above to succeed, your accelerated CORDIC mplementaton must have the same bt-accuracy as the reference CORDIC mplementaton.
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn How to start On Blackboard, download the baselne reference mplementaton. Ths desgn wll run drectly on your Spartan kt. Start by studyng the reference mplementaton software. Ths reference mplementaton uses calls to golden_cordc n order to mplement the your_cordc functon. Eventually, you need to accelerate your_cordc as fast as possble. It s hghly recommended to construct a cosmulaton model of your desgn usng GEZEL. Whle you can develop coprocessor hardware drectly n VHDL, t wll requre you to take care of many detals at once. Gong through cosmulaton frst enables you to test your dea before takng t to the board. Also, when developng hardware, ntally test your deas on small desgns, such as 100 rotatons (rather then 64K). When the low level components work fne, next verfy how well t scales up to 64K rotatons. Also, carefully consder tradeoffs. You can move part of the golden_cordc functon to hardware, or move the complete golden_cordc to hardware. You can use a memory-mapped nterface, or use an FSL nterface. You can wrte VHDL or GEZEL code (If HDL are unfamlar to you, please stck to GEZEL). You can mplement the golden_cordc n hardware as a completely unrolled functon, or desgn t n hardware as an FSMD, usng multple control steps. You can send arguments serally or n parallel. You can provde arguments wth a processor (Mcroblaze) or through DMA. There are obvously more mplementaton alternatves than the allocated desgn tme. Thus, you wll have to thnk before you mplement, and experment to fnd the largest acceleraton as quckly as possble. Always focus on the bottleneck n the overall system. Remember the earler examples we dscussed. Hardware parallelsm s useless unless the datappes nto that hardware has suffcent bandwdth. Also, make use of your homework assgnments/solutons to see examples how a memory-mapped nterface or an FSL nterface can be created.
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn What to turn n By the deadlne, post the followng nformaton on Blackboard. A short report (no more than 4 pages) that summarzes the man characterstcs of your desgn. Your report must at least contan the followng table. Area of the baselne desgn (slces) Performance of the baselne desgn (cycles) Area of the optmzed desgn (slces) Performance of the optmzed desgn (cycles) In addton, you are encouraged to dscuss trade-offs you made, to provde a blockdagram of the resultng system, to descrbe the archtectural features of the hardware coprocessor you made, and so on. Also nclude a screenshot of the desgn as t executes, such as shown below. If you developed a cosmulaton model n GEZEL, also provde the cosmulaton model (C drver and FDL fle). The optmzed mplementaton n XPS. Before postng the desgn on Blackboard, make sure you run Project->Clean All Generated Fles. Then, zp the project drectory and post t on Blackboard.
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn Gradng Your desgn wll be graded based on the numbers you report, n combnaton wth the cosmulaton model and the XPS project you wll turn n. The cosmulaton model, and the XPS project may be run to verfy the correctness of the statements you make n the report. The rankng crtera descrbed above wll be used. Havng a workng soluton s not suffcent to obtan a full grade. Havng a speed mprovement of, for example, 3 tmes, s not suffcent to obtan a full grade. The full grade wll go to the desgn wth the hghest performance. All other desgns wll be strctly ranked accordng n relaton to the best one. Ths strct rankng rule s ntroduced based on the observaton that, under free market condtons, better desgns have a better chance to make t nto a product. However, don t let ths rule spol the fun. Ths s your chance to explore new deas and to try out what you have learned n ths class! We wll dscuss the desgn n detal n the class of November 12, and partly n the class of November 14.