The Codesign Challenge

Similar documents
Programming in Fortran 90 : 2017/2018

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Optimizing Document Scoring for Query Retrieval

Mathematics 256 a course in differential equations for engineering students

Lecture 3: Computer Arithmetic: Multiplication and Division

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Support Vector Machines

A Binarization Algorithm specialized on Document Images and Photos

GSLM Operations Research II Fall 13/14

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Brave New World Pseudocode Reference

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Cluster Analysis of Electrical Behavior

Problem Set 3 Solutions

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Wishing you all a Total Quality New Year!

Module Management Tool in Software Development Organizations

S1 Note. Basis functions.

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Parallel matrix-vector multiplication

Decision Strategies for Rating Objects in Knowledge-Shared Research Networks

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Algorithm To Convert A Decimal To A Fraction

Array transposition in CUDA shared memory

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Memory Modeling in ESL-RTL Equivalence Checking

Nachos Project 3. Speaker: Sheng-Wei Cheng 2010/12/16

THE low-density parity-check (LDPC) code is getting

3D vector computer graphics

CMPS 10 Introduction to Computer Science Lecture Notes

LOOP ANALYSIS. The second systematic technique to determine all currents and voltages in a circuit

An Optimal Algorithm for Prufer Codes *

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring 2016

11. HARMS How To: CSV Import

Esc101 Lecture 1 st April, 2008 Generating Permutation

Classifying Acoustic Transient Signals Using Artificial Intelligence

Random Kernel Perceptron on ATTiny2313 Microcontroller

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision


Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

3. CR parameters and Multi-Objective Fitness Function

Fast Computation of Shortest Path for Visiting Segments in the Plane

AADL : about scheduling analysis

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Conditional Speculative Decimal Addition*

Storage Binding in RTL synthesis

Simulation Based Analysis of FAST TCP using OMNET++

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Machine Learning 9. week

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Smoothing Spline ANOVA for variable screening

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Structure from Motion

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Assembler. Building a Modern Computer From First Principles.

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Related-Mode Attacks on CTR Encryption Mode

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Meta-heuristics for Multidimensional Knapsack Problems

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Biostatistics 615/815

FPGA Implementation of CORDIC Algorithms for Sine and Cosine Generator

An Entropy-Based Approach to Integrated Information Needs Assessment

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Fast Color Space Transformation for Embedded Controller by SA-C Recofigurable Computing

Hermite Splines in Lie Groups as Products of Geodesics

A Facet Generation Procedure. for solving 0/1 integer programs

X- Chart Using ANOM Approach

Analysis of Continuous Beams in General

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Loop Transformations, Dependences, and Parallelization

Learning to Project in Multi-Objective Binary Linear Programming

High-Boost Mesh Filtering for 3-D Shape Enhancement

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

y and the total sum of

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Curve Representation for Outlines of Planar Images using Multilevel Coordinate Search

New Extensions of the 3-Simplex for Exterior Orientation

Classifier Selection Based on Data Complexity Measures *

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Transcription:

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble. You can use any of the prevously dscussed technques to accelerate the mplementaton: use software optmzaton, buld a coprocessor, optmze the hardware/software communcaton. The constrants of your mplementaton are 1. that t must be completed by 11/26/2007 at 5:00PM. 2. that t must run correctly on the Spartan 3E starter kt. 3. that t follows the gven testng procedure to demonstrate the performance of your mplementaton. The qualty of your desgn wll be evaluated usng the followng ctera: 1. the resultng clock cycle count of your mplementaton, wth a clock cycle correspondng to one tck of an OPB Tmer module clocked at 50MHz.. 2. the area of your desgn, expressed n slces of the Spartan3E FPGA. 3. the tme when you turned n the soluton (before the deadlne, but earler s better). The clock cycle count s a frst-order crterum, the area s a second-order crterum, the desgn tme s a thrd order crterum. Faster (but correct) desgns wll always wn. For clock cycle counts that le wthn 1% of each other, area wll be used as a dstnctve factor. For example, gven four desgns A, B, C, and D as shown below, the rankng would be as follows, from best to worst: D, B, C, A. In case the area as well as the cycle count are wthn 1% of each other, then the tme of postng the soluton wll be used to resolve the rankng of the two desgns. Area (Slces) D C < 1% of n B A n Cycle Count Thus, all desgns wll be strctly ranked accordng to these crtera. It s n your nterest to try and fnd the hghest possble performance that can stll be accommodated on a Spartan3 board, and to fnd that soluton as quckly as possble. P. Schaumont, Vrgna Tech

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn Assgnment: Coordnate Rotaton Dgtal Computer (CORDIC) The task s to mplement a CORDIC algorthm as effcently as possble. CORDIC s often used n dgtal hardware to mplement trgonometrc functons. The CORDIC kernel mplements a vector rotaton operaton. In a two-dmensonal space, a vector rotaton starts from a vector (X,Y) and rotates t over an angle ph as follows: x' = x cos( φ) y sn( φ) y' = y cos( φ) + xsn( φ) Ths can be rearranged to: x' = cos( φ)[ x y tan( φ)] y' = cos( φ)[ y + x tan( φ)] An effcent mplementaton of ths formula s possble be restrctng the rotaton to amounts of angles for whch tan(φ ) = ± 2. Thus, we should ensure that the tangent of the angle s a power of two. Under that condton, the above rotaton formulas requre only shft-operatons to mplement the multplcaton wth tan( φ ). We call the rotaton over such an angle an elementary rotaton. An arbtrary angle can now be approxmated as a sequence of elementary rotatons, much n the same way as the ndvdual bts n a btvector can express weghts to approxmate an nteger number. Ths dea s llustrated n the fgure above. We need to mplement a rotaton over angle β. We start wth an ntal vector v0 at (1,0). The frst elementary rotaton s over an angle tan 1 (0.5). Ths rotates v0 counter-clockwse to v1, usng the rotaton formulas gven

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn above. The next elementary rotaton would be over an angle tan 1 (0.25). Agan, ths would be a counter-clockwse rotaton, such that we decrease the error between the desred rotaton angle β and the approxmatons n terms of elementary rotatons. v1 now moves to the poston v2. The next rotaton, over tan 1 (0.125), would be clock wse, snce v2 has moved beyond the desred rotaton β. By usng ncreasngly smaller elementary rotatons, we would obtan an ncreasngly better approxmaton. Therefore, we can express the rotaton formulas above usng a set of dfference equatons. x + 1 = K [ x y+ 1 = K[ y wth K d = ± 1 = cos(tan y. d.2 + x. d.2 1 2 1 ) = ] ] 1 1+ 2 At each teraton, a smaller rotaton angle s selected, and a decson to rotate forward or backward s made ( d = ± 1 ) such that we obtan a better approxmaton of the actual rotaton angle n terms of elementary rotatons. Note that the constants n these formulas only depend on elementary rotatons, and as such they can be evaluated upfront and stored as constants. In CORDIC mplementatons, the K factors are not appled at each rotaton, but rather they are collected nto a sngle scalng factor A. For a large number of (ncreasngly smaller) elementary rotatons, A converges to 1.647 and s gven by A = lm 1+ 2 2 To fnd how well the target rotaton angle s approxmated by elementary rotatons, we can also nclude an angle-accumulator nto the teratons, defned by z + 1 = z d tan 1 (2 ) Ths angle accumulator expresses the dfference between the target angle and the seres of elementary rotatons.

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn CORDIC algorthms are used n two possble modes of operaton. In the rotaton mode, we start wth a desred rotaton angle and rotate a gven vector over that angle. At each teraton, the decson to rotate counter-clockwse or clockwse s made based on the sgn of the angle accumulator. The objectve s to drve the angle accumulator to zero. The result of the rotaton mode s a gven vector rotated over a gven angle. In the vector mode, we start wth a gven vector and rotate that vector untl the vector s algned wth the X axs. At each teraton, the decson to rotate counterclockwse or clockwse s made by the sgn of the Y component of the vector. The objectve s to drve the Y component to zero. The result of the vector mode s the angle of a gven vector. CORDIC mplementaton on Spartan 3E Starter Kt The codesgn challenge s descrbed by the followng ntal archtecture. DDR Ram target_angle[65536] result_x[65536] result_y[65536] McroBlaze DDR Controller OPB Tmer In a DDR Ram, three 64 KWord arrays are stored. The objectve s to rotate a unt vector (1,0) over all the angles expressed n target_angle[ ], and store the result of each rotaton n result_x[ ] and result_y[ ]. The performance of your desgn s measured as the tme t takes to complete ths set of rotatons (ncludng readng from/wrtng to DDR). To accelerate the desgn, you can modfy the hardware as needed (add coprocessors, develop effcent data transfer technques, etc).

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn start prepare_angle() tmer_on reference cycles reference_cordc() golden_cordc() tmer_off tmer_on cordc cycles your_cordc() tmer_off Speedup = reference cycles cordc cycles check_result() golden_cordc() prnt cycles prnt errors You desgn wll be tested usng a test program (runnng on Mcroblaze) as descrbed above. Intally, the mcroblaze wll generate 64K random target angles. Next, t wll collect the executon tmng for 64K rotatons on two cordc functons. The frst s a reference mplementaton n software (reference_cordc). The rato of the two cycle counts determnes the relatve speedup obtaned by your mplementaton. Note that ths method of speedup measurement s relatvely ndependent of the compler optmzaton level, snce the -O2 flag wll beneft the reference mplementaton as well. Fnally, your desgn results are verfed aganst the golden reference. For a vald soluton, zero errors are requred (.e. f your soluton shows a sngle error, t s automatcally moved to lowest rank of all desgns returned by the class). The CORDIC reference algorthm s mplemented usng fxed-pont arthmetc and s expressed usng ntegers. A fxed-pont data type <32,28> s used. In ths data type, the value 1 s expressed as (1 << 28). The scalng factor allows expresson of fractonal values. For example, 0.75 s expressed as: 0.75 = 0.5 + 0.25 = (1 << 27) <32,28> + (1 << 26) <32,28> = 671,088,640 <32,28> For the verfcaton process descrbed above to succeed, your accelerated CORDIC mplementaton must have the same bt-accuracy as the reference CORDIC mplementaton.

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn How to start On Blackboard, download the baselne reference mplementaton. Ths desgn wll run drectly on your Spartan kt. Start by studyng the reference mplementaton software. Ths reference mplementaton uses calls to golden_cordc n order to mplement the your_cordc functon. Eventually, you need to accelerate your_cordc as fast as possble. It s hghly recommended to construct a cosmulaton model of your desgn usng GEZEL. Whle you can develop coprocessor hardware drectly n VHDL, t wll requre you to take care of many detals at once. Gong through cosmulaton frst enables you to test your dea before takng t to the board. Also, when developng hardware, ntally test your deas on small desgns, such as 100 rotatons (rather then 64K). When the low level components work fne, next verfy how well t scales up to 64K rotatons. Also, carefully consder tradeoffs. You can move part of the golden_cordc functon to hardware, or move the complete golden_cordc to hardware. You can use a memory-mapped nterface, or use an FSL nterface. You can wrte VHDL or GEZEL code (If HDL are unfamlar to you, please stck to GEZEL). You can mplement the golden_cordc n hardware as a completely unrolled functon, or desgn t n hardware as an FSMD, usng multple control steps. You can send arguments serally or n parallel. You can provde arguments wth a processor (Mcroblaze) or through DMA. There are obvously more mplementaton alternatves than the allocated desgn tme. Thus, you wll have to thnk before you mplement, and experment to fnd the largest acceleraton as quckly as possble. Always focus on the bottleneck n the overall system. Remember the earler examples we dscussed. Hardware parallelsm s useless unless the datappes nto that hardware has suffcent bandwdth. Also, make use of your homework assgnments/solutons to see examples how a memory-mapped nterface or an FSL nterface can be created.

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn What to turn n By the deadlne, post the followng nformaton on Blackboard. A short report (no more than 4 pages) that summarzes the man characterstcs of your desgn. Your report must at least contan the followng table. Area of the baselne desgn (slces) Performance of the baselne desgn (cycles) Area of the optmzed desgn (slces) Performance of the optmzed desgn (cycles) In addton, you are encouraged to dscuss trade-offs you made, to provde a blockdagram of the resultng system, to descrbe the archtectural features of the hardware coprocessor you made, and so on. Also nclude a screenshot of the desgn as t executes, such as shown below. If you developed a cosmulaton model n GEZEL, also provde the cosmulaton model (C drver and FDL fle). The optmzed mplementaton n XPS. Before postng the desgn on Blackboard, make sure you run Project->Clean All Generated Fles. Then, zp the project drectory and post t on Blackboard.

ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn Gradng Your desgn wll be graded based on the numbers you report, n combnaton wth the cosmulaton model and the XPS project you wll turn n. The cosmulaton model, and the XPS project may be run to verfy the correctness of the statements you make n the report. The rankng crtera descrbed above wll be used. Havng a workng soluton s not suffcent to obtan a full grade. Havng a speed mprovement of, for example, 3 tmes, s not suffcent to obtan a full grade. The full grade wll go to the desgn wth the hghest performance. All other desgns wll be strctly ranked accordng n relaton to the best one. Ths strct rankng rule s ntroduced based on the observaton that, under free market condtons, better desgns have a better chance to make t nto a product. However, don t let ths rule spol the fun. Ths s your chance to explore new deas and to try out what you have learned n ths class! We wll dscuss the desgn n detal n the class of November 12, and partly n the class of November 14.