FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

Size: px
Start display at page:

Download "FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY"

Transcription

1 FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, UK {kjt01, k.masselos, ABSTRACT In this paper the latest FPGAs are compared to a high end microprocessor with respect to sustained performance for a popular floating point CPU performance benchmark, namely LINPACK The FPGA hardware has been developed in Handel-C and uses the Celoxica Floating Point Library functions to perform both double and single precision arithmetic. A set of translation and optimization steps have been applied to transform a sequential C description of the LINPACK benchmark, based on a monolithic memory model, into a parallel Handel-C description that utilizes the plurality of memory resources available on a realistic reconfigurable computing platform. These transformations allow high levels of parallelism to be exploited within a limited memory bandwidth. The experimental results show that the latest generation of FPGAs, programmed using Handel-C, can achieve a sustained double precision floating point performance up to 6.8 times greater than the microprocessor while operating at a clock frequency that is 30 times lower. 1. INTRODUCTION Philip Leong Department of Computing, Imperial College London, UK philip.leong@imperial.ac.uk The dynamic range and accuracy demanded by many engineering and scientific applications has led to the widespread usage of floating point arithmetic on conventional computers [1]. In recent years the continued growth in FPGA densities has come to allow devices that are capable of implementing ever increasing numbers of both single and double precision floating point cores. As a result, significant research effort has been expended recently on investigating the floating point potential of FPGAs [2, 3, 4, 5, 6]. Significantly, the work presented in [1] and [7] shows that FPGAs can compete with commodity microprocessors in terms of both peak and sustained floating point performance. While microprocessors and FPGAs offer comparable performance, a large gap remains between the design flows for the two platforms. FPGA programming is still performed to a large extent using conventional registertransfer level (RTL) design flows while software for microprocessors is written in behavioral languages. [1] predicts that FPGAs could have an order of magnitude advantage over microprocessors for floating point performance within ten years. However, for FPGAs to be considered in the future as a competitive platform for scientific computing, efficient, software like design flows must be developed. These design flows must allow the full processing power of the FPGAs to be harnessed in a design time comparable to that for a microprocessor. There are a number of commercial products already available that allow FPGAs to be programmed at higher levels of abstraction [8, 9]. While these tools can bring FPGA programming closer to software development, they can often reduce the ultimate performance of the final design compared to that of hand coded VHDL or Verilog. Little work has been done to investigate the current floating point potential of FPGAs when higher level design techniques/tools are used. In this paper the floating point performance of the latest FPGAs (Stratix II and Virtex 4), programmed at a level above RTL (using Handel-C), is compared to that of a commodity microprocessor (3GHz Intel Pentium 4). The performance of both platforms is measured for the LINPACK 1000 benchmark [10], a popular floating point benchmark widely used to test the performance of microprocessors. The currently released Celoxica floating point library (PDK v4.0 [11]) allows only single precision algorithms to be written in Handel-C. However, we also present results from a soon-to-be-released new library that allows both singe and double precision floating point cores to be instantiated. The single precision performance of this new library is compared to that of the current library. The performance of both libraries is compared to that of the microprocessor. The rest of the paper is organized as follows. Section 2 gives a brief description of the LINPACK 1000 benchmark. Section 3 describes the general platform this work seeks to target and the partitioning of the benchmark. Section 4 describes how the critical sections of the LINPACK benchmark are implemented on an FPGA and presents a four stage process for converting from C to Turkington 1

2 begin -- generate random linear system Ax = b -- matgen (&A[ ][ ],&b[ ]); -- start timer -- t1 = second(); -- apply LU decompositon -- dgefa (&A[ ][ ], &ipvt[ ]); -- solve simplified system -- x[ ][ ] = dgesl (&A[ ][ ], &b[ ], ipvt[ ]); -- store time taken -- t1 = second( ) t1; -- calc average FLOPs -- ops = 2/3*(1000)^3 + (1000)^2; flops = ops / t1; -- regenerate original A[ ] & b[ ] -- matgen(&a[ ], &b[ ]); -- calculate b = Ax - b -- b[ ] = -b[ ]; dmxpy(a[ ][ ], &b[ ], x[ ], 1000); -- find residual and normalise -- resid = max (b[ ]); residn = norm(resid); print(t1, flops, resid, residn); end; Fig. 1. Pseudo code for LINPACK 1000 Handel-C. In Section 5 the results for this implementation are presented and Section 6 contains the conclusions drawn from this work. 2. LINPACK 1000 The LINPACK family of benchmarks solves a system of linear equations using LU decomposition and is typical of many matrix-based scientific computations. This work will focus on the LINPACK 1000 benchmark which is most commonly used to evaluate general purpose microprocessors. LINPACK 1000 solves a random linear system of order 1000, measures the average floating point performance achieved and reports the residual error produced. The pseudo-code for the operation of the LINPACK 1000 benchmark is given in Fig. 1. It first generates a random 1000x1000 element matrix, A, and 1000 element vector, b. The elements in A and b are all floating point numbers. The dgefa subroutine simplifies the linear system by performing LU decomposition (by Gaussian elimination with partial pivoting) on A. A brief description of this method can be found in [12]. The dgesl subroutine then solves the simplified LU version of the original system to produce the vector x. The benchmark measures the time taken to complete the dgefa and dgesl subroutines and calculates the average FLOPS achieved. The remainder of the algorithm serves to estimate the normalized residual Host Microprocessor System SRAM SRAM SRAM PCI Bus Coprocessor Fig. 2. Target Platform error in the result vector x. Both single and double precision versions of the benchmark were profiled for an Intel Pentium 4 processor (3GHz, 1Mbyte L2 cache) running Windows XP using GCC (running through Cygwin). The entire single precision benchmark takes 1.03 seconds to complete, with an average floating point performance of 647 MFLOPs across the dgefa and dgesl subroutines. The double precision benchmark takes 1.61 seconds achieving 411 MFLOPS. 90% of the execution time for LINPACK 1000 is spent in the dgefa subroutine. dgefa in turn calls a smaller function, named daxpy, times with the result that the latter alone accounts for 88% of the total execution time for the benchmark. 3. TARGET PLATFORM FPGA PCI Controller This work targets the general platform shown in Fig. 2. It consists of a host microprocessor system linked to an FPGA based coprocessor via a PCI interface. The coprocessor features a high density FPGA, a PCI interface/controller and on-board memory resources. The memory on the coprocessor is assumed to be SRAM, with up to 6 banks providing a maximum of 16Mbytes. The data width of each bank is assumed to be 32 bits. Although this does not match any existing development boards the features specified are realistic and similar to those found on a number of boards including the Celoxica RC300 [13] and RC2000 [14]. When using a target platform with both microprocessor and FPGA resources it is typical to partition the target algorithm between the two. Frequently reused code with the potential for parallelism (usually loops) is targeted to the FPGA while the rest of the code is implemented on the microprocessor. Due to their lower clock speeds, FPGAs only provide acceleration when their resources are used to implement sections of the algorithm in parallel. Therefore only the most reused sections of code should be targeted to the FPGA to maximize parallelism. However, communication between the host and coprocessor should be minimized so Turkington 2

3 that the PCI bus does not become a bottleneck in the system. This leads to a tradeoff between maximizing the potential for parallelism and minimizing communication. The benchmark was partitioned so that the entire dgefa (LU decomposition) subroutine is implemented on the FPGA. This limits the communication to two block transfers, one at the start of dgefa and the other at the end. The pseudo code for dgefa is given in Fig. 3. dgefa contains a nested loop with three levels (the third is in the daxpy function) and so provides reasonable opportunities for parallelism. Since it accounts for 90% of the execution time of the benchmark it offers the potential for good acceleration of the complete algorithm. 4. OPTIMIZATION OF THE DGEFA ROUTINE This work uses an ANSI C version of the LINPACK benchmark as its starting point. The code that has been partitioned to the FPGA (the dgefa subroutine) must be converted into Handel-C and optimized to maximize parallelism. Before any optimizations are considered some simple translation steps are performed to convert the ANSI C syntax into valid Handel-C. These include the removal of side effects and the replacement of floating point arithmetic with Handel-C floating point macro procedures. On top of this pointer-based array access is converted into direct array addressing. This simplifies memory optimization in later stages. To complete the translation to Handel-C word lengths are assigned to all integer variables in the code. While this final step is a normally a complex optimization, it is included as a translation step as Handel- C code will often not compile without it. The result of these translation steps is valid Handel-C code that can be compiled to hardware. However, the hardware generated will be very inefficient and so the following four optimization stages are applied to exploit parallelism and data reuse available in the algorithm Data Reuse Exploitation The dgefa subroutine accesses the 1000x1000 element A matrix for almost every operation. Due to its large size, A must be stored in the off-chip SRAM on the coprocessor. There is limited bandwidth for accessing the off-chip SRAM (up to 6 ports) which could limit the parallelism that can be exploited. However, it may be possible to reduce accesses to A by buffering and reusing data on the FPGA. This may allow more parallelism in the final system. The pseudo code for the dgefa subroutine is shown in Fig. 3. The daxpy function requires data from two matrix columns as its inputs, denoted as A[ ][j] and A[ ][k] (where k and j are the column numbers). For a given iteration of loop 1 the same A[ ][k] column is used in each dgefa(*a[ ][ ], *ipvt[ ]) begin for (k = 0 : 998) -- loop find the pivot index of column k -- piv = idamax(a[ ][k], k) + k; ipvt[k] = piv; if (A[piv][k]!= 0) -- swap matrix elements if k!= piv-- swap(&a[piv][k], &A[k][k]); t = -1/(A[k][k]); -- scale column k by t -- dscal(&a[ ][k], t, k ); for (j = (k+1) : 999) -- loop swap matrix elements if k!= piv -- swap(&a[piv][j], &A[k][j]); t = A[piv][j]; --A[ ][j] = t*a[ ][k] + A[ ][j] -- daxpy(&a[ ][j], A[ ][k], t, k ); end for; -- end loop 2 -- end if; end for; -- end loop 1 -- ipvt[999] =999; end; Fig. 3. Pseudo code for dgefa subroutine iteration of loop 2 meaning that, if the correct A[ ][k] data is buffered on the FPGA at the start of loop 1, it can be reused for every iteration of loop 2. More importantly, the A[ ][k] data used in iteration k+1 of loop 1 is calculated as a result column in iteration k. If this column of data is stored on the FPGA as it is generated then only A[ ][j] data must read from the external memory, instead of the A[ ][k] and A[ ][j] data. This effectively halves the number of reads from the external memory and halves the bandwidth required. The A[ ][k] column for the very first iteration of loop 1 must still be read from the external memory, but this now represents a small one-off initialization cost. This data reuse can be implemented using two new on chip buffers, each capable of storing one column of matrix data. One buffer, called k_col, stores the A[ ][k] data for the current iteration of loop 1. The second buffer, called k_next, stores the A[ ][k] data needed for the next iteration as it is generated Loop Pipelining Loop pipelining is a well known optimization that allows parallelism to be exploited in loops without increasing the arithmetic logic resources required (there may a control overhead). The three functions, idamax, dscal and daxpy, called in the dgefa subroutine all contain a further loop level. They each also contain calls to pipelined floating point macros from the Celoxica floating point library. There are no loop carried dependencies on Turkington 3

4 Processor 0 idamax[0] dscal[0] daxpy[1] daxpy[2] daxpy [3] daxpy [4] daxpy [5] daxpy [6] Processor 1 Processor 2 idamax[1] dscal[1] daxpy[2] daxpy[3] daxpy[4] daxpy[5] idamax[2] dscal[2] daxpy[3] daxpy[4] Fig. 4. Schedule for 3 parallel processing units any of these inner loops so the iterations of each loop can be pipelined into the stages of the floating point units. The outer loops of the dgefa subroutine could have been pipelined instead, using the methods presented in [15] for example, but the dependencies at these loop levels make them less favorable. Since all of the floating point cores are internally pipelined, pipelining the inner loops does not increase the resource usage. However, it does place extra strain on the external memory. The pipelined daxpy function must complete one read and one write to the external SRAM for every iteration of its inner loop. A new inner loop iteration can be initiated every cycle provided these memory operations can complete in one cycle. Each bank of the external SRAM has only a single port so the A matrix must be partitioned across multiple banks in such a way that the same bank will never be required to perform a read and write on the same cycle. This can be achieved by placing the odd and even rows of the matrix in different SRAM banks. As the pipeline works down each column it reads from alternating odd and even rows and writes to alternating odd and even rows. The memory controller is scheduled so that when an odd row is read an even row is written (and vice versa). There is sufficient slack in the schedule to allow a write to be delayed by a cycle when necessary to avoid conflict Outer Loop Parallelization Simply pipelining the inner loops does not yield sufficient parallelism to outperform the Pentium 4 so the outer loops of the dgefa subroutine must somehow be executed in parallel. However, the required memory bandwidth must not scale with the parallelism achieved as there are (at most) six ports to the external SRAM, two of which are already used on almost every cycle. Fortunately, there is a loop carried dependency in loop 1 (Fig. 3) such that all of the input data used by iteration n+1 is generated in iteration n. This means that, if multiple loop 1 iterations are run in parallel with their start times offset correctly, then data could effectively flow Fig. 5. Structure of dgefa coprocessor from the output of one iteration to the input of the next without going through the external memory. This allows arbitrary outer loop parallelism with only two ports to the external memory. Fig.4. shows how the first three iterations of loop 1 can be scheduled across three parallel loop processing pipelines. The number of pipelines can be varied using a # define in the Handel-C source code. In Fig. 4 the notation idamax[k] represents the execution of the idamax function on column k of the matrix. The same notation is used for the dscal and daxpy functions. All of the rest of code in the dgefa subroutine has been moved into one of these three functions to simplify the system. idamax, dscal and daxpy all contain loops iterating over the same bounds so they take the same number of clock cycles to run (+/- 2%). This allows instances of the three functions to be scheduled as if they were individual instructions. idamax[k] operates on the data generated in daxpy[k]. As the output data from daxpy[k] is generated it is sent to idamax[k] through global variables, allowing the two functions to run in parallel. daxpy[k] runs in parallel with the execution of daxpy[k+1] for the previous outer loop iteration, but uses the data generated by daxpy[k]. Hence, each pipeline must include a buffer (capable of storing a column of matrix data) to store the data generated by the previous pipeline in the chain until it is ready to be used by the operations in the daxpy function As can be seen in Fig. 4, idamax and dscal are never required by more than one pipeline at once. This will allow them to be shared in the following optimization stage. This means that the k_next buffer (generated in Section 4.1) can also be shared since only idamax and dscal access it. Each pipeline must have its own k_col buffer however, since Turkington 4

5 Fig. 5. Single precision speedup over Pentium 4 using Altera Stratix II devices each pipeline uses a different A[ ][k] column as its input. Only the last pipeline in the chain must write its A[ ][j] output data to the external memory. However, the A[ ][k] data stored in the k_col buffer in each pipeline must be written to the external memory at some point since these columns form part of the final result matrix. Fortunately the final pipeline in the chain has sufficient slots to output all the necessary data so only a single write port to the external memory is required. Since there are now multiple write sources an additional function must be included to arbitrate over the memory bus. The pipelines communicate with this function via global variables, allowing it to run in parallel with the normal operation of the system. 4.4 Low Level Optimization At this point the code still has a software-like structure, with a single main process/function calling a number of functions and macro-procedures. In Handel-C this can often lead to large, complex state machines with long routing and large fan-out. This will in turn lead to a design with a low clock frequency. To avoid this problem the code is restructured. Each functional block in the system is redefined as a main function, giving it a separate state machine. Each block is altered so that it loops repeatedly over its contents, with a while delay loop at the start of the function to hold it until it is called by another block. Functional calls within each block are replaced with single bit go signals that break the while delay loop in the functional block being called. Global variables allow data communication between blocks. This gives the final system the structure shown in Fig. 5. The blocks labeled as pipex each contain the hardware to implement the daxpy function and the dgefa loops around it. As a final step a number of general low level optimizations are applied. Complex arithmetic and control expressions are split over multiple clock cycles. Extra pipeline registers are inserted on the read ports of the memories and in any potentially long communication paths (mainly the global variables used for inter-block communication). Instruction level parallelism in each block is exploited and resource sharing is examined and declared where appropriate. 5. RESULTS Fig. 5 shows the single precision speedup achieved over a 3GHz Pentium 4 when designs with various numbers of pipelines are targeted to Altera Stratix II devices. Fig 6 shows the double precision speedup, again using Stratix II devices. In each case the smallest device that would accommodate the design was targeted. The fastest speed grade available was used in each case. The Handel-C source files were compiled to EDIF using Celoxica DK suite v4.0. The syntheses were performed using Altera Quartus II v4.2. Designs using floating point units form both the current Celoxica floating point library and a new pre-release library were tested for the single precision case. The current floating point library does not offer double precision so no comparison can be made in this case. Equivalent designs were also targeted to Xilinx Virtex 4 LX devices. The results were comparable to those for Stratix II but the largest Xilinx device holds fewer pipelines than the largest Altera device. As a result the ultimate speedup using Virtex 4 is not as high. Turkington 5

6 Fig. 6. Double precision speedup over Pentium 4 using Altera Stratix II devices The first plot in Fig. 5 and Fig. 6 shows the speedup for just the dgefa subroutine (including an estimate of the time taken to transfer data across the PCI bus). The second plot estimates the overall speedup for the complete benchmark, assuming that the code not targeted to the FPGA is implemented on the 3GHz Pentium 4. Using the current library the largest Stratix II device (EP2S180) can implement the dgefa subroutine 7 times faster than the 3GHz Pentium 4, at a maximum frequency of 98MHz. This is improved to an 8.5 times speedup at 94MHz when the new library is used. The smaller size of the improved library components allows up to 48 parallel pipelines compared to only 36 with the current library. These results translate to an overall benchmark speedup of 4.5 times for the current library and 4.9 times for the new library. The maximum speedup achieved for the double precision benchmark is 4.4 times, with a speedup of 6.8 times for the dgefa subroutine. When more than 16 single precision or 8 double precision pipelines are synthesized an EP2S130 or EP2S180 device must be used. This causes a drop in the maximum clock frequency that can be achieved for two reasons. Firstly, the version of Quartus II used does not allow the fastest speed grade of these devices (-3) to be targeted and so slower (-4) devices were targeted. Secondly, for the larger designs the physical synthesis optimizations in Quartus II had to be turned off as they caused the fitter to reach the limit of its addressable memory. These problems can be overcome if a newer 64 bit version of Quartus II is used. In this case the results indicated on the dashed lines should be reachable giving potentially a 10 times speedup for single precision and a 9 times speedup for double precision. 6. CONCLUSIONS An optimized Handel-C implementation of a coprocessor for both single and double precision versions of the LINPACK 1000 benchmark was produced through a series of optimizations that identify opportunities to exploit data reuse, loop pipelining, coarse grained parallelism, parallel memory assignment, fine grained parallelism and pipelining. When targeted to the latest generation of FPGAs this coprocessor can execute parts of the double precision benchmark up to 6.8 times faster than a 3GHz Pentium 4 processor and accelerate the execution of the complete benchmark by a factor of 4.4. This suggests that it is currently possible to accelerate scientific computing algorithms using FPGAs, even when using high level design methods and tools. However, it has also been shown that significant effort is required to convert the standard ANSI C algorithms into efficient, parallel Handel-C. As a result the development times for hardware are still much longer than those for software, even when high level languages are used. Furthermore, an awareness of how specific high level code will be synthesized in hardware is still required to produce efficient designs. This prevents software designers from migrating seamlessly to hardware. It seems that further design tools, possibly automating optimizations similar to those presented here, are still required to make hardware design run as quickly and efficiently as software design. Turkington 6

7 7. REFERENCES [1] K. Underwood, FPGAs vs. CPUs: Trends in Peak Floating Point Performance, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [2] P. Belanovic, M. Leeser, A library of parameterized floating point modules and their use, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [3] J. Dido, et. al., A flexible floating point format for optimizing data paths and operators in FPGA based DSPs, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [4] X. Wang, B. E. Nelson, Tradeoffs of designing floating point division and square root on Virtex FPGAs, Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp , April [5] E. Roesler, B. Nelson, Novel optimizations for hardware floating point units in a modern FPGA architecture, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [6] W. Ligon, et. al., A re-evaluation of the practicality of floating point on FPGAs, Proc. IEEE Symp. Field- Programmable Custom Computing Machines, pp , April [7] K. Underwood, K. Hemmert, Closing the gap: CPU and FPGA trends in sustainable floating point BLAS performance, Proc. IEEE Symp. Field Programmable Custom Computing Machines, pp , April [8] [9] [10] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK benchmark: Past, Present and Future, l.pdf [11] ENGSPDPDKPDK_4.1_Software_Product_Description pdf [12] ctures/chap02.pdf, pp [13] [14] [15] H. Rong, Z. Tang, R. Govindarajan, A. Douillet, G Gao, Single-Dimension Software Pipelining for Multi- Dimensional Loops, Proc. IEEE Int. Symp. Code Generation and Optimization, pp , March 2000 ACKNOWLEDGEMENTS This work was supported by the EPSRC (EP/C549481/1) Turkington 7

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,

More information

Sparse Linear Solver for Power System Analyis using FPGA

Sparse Linear Solver for Power System Analyis using FPGA Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.

More information

Developing a Data Driven System for Computational Neuroscience

Developing a Data Driven System for Computational Neuroscience Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Floating-Point Matrix Product on FPGA

Floating-Point Matrix Product on FPGA Floating-Point Matrix Product on FPGA Faycal Bensaali University of Hertfordshire f.bensaali@herts.ac.uk Abbes Amira Brunel University abbes.amira@brunel.ac.uk Reza Sotudeh University of Hertfordshire

More information

Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 J. Johnson, P. Vachranukunkiet, S. Tiwari, P. Nagvajara, C. Nwankpa Drexel University Philadelphia, PA Abstract Full-AC load flow constitutes

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115

More information

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Hardware Acceleration of Edge Detection Algorithm on FPGAs Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

FPGAs: FAST TRACK TO DSP

FPGAs: FAST TRACK TO DSP FPGAs: FAST TRACK TO DSP Revised February 2009 ABSRACT: Given the prevalence of digital signal processing in a variety of industry segments, several implementation solutions are available depending on

More information

Code modernization of Polyhedron benchmark suite

Code modernization of Polyhedron benchmark suite Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for

More information

High Speed Pipelined Architecture for Adaptive Median Filter

High Speed Pipelined Architecture for Adaptive Median Filter Abstract High Speed Pipelined Architecture for Adaptive Median Filter D.Dhanasekaran, and **Dr.K.Boopathy Bagan *Assistant Professor, SVCE, Pennalur,Sriperumbudur-602105. **Professor, Madras Institute

More information

Sparse LU Decomposition using FPGA

Sparse LU Decomposition using FPGA Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I Overview Anti-fuse and EEPROM-based devices Contemporary SRAM devices - Wiring - Embedded New trends - Single-driver wiring -

More information

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Early Models in Silicon with SystemC synthesis

Early Models in Silicon with SystemC synthesis Early Models in Silicon with SystemC synthesis Agility Compiler summary C-based design & synthesis for SystemC Pure, standard compliant SystemC/ C++ Most widely used C-synthesis technology Structural SystemC

More information

Introduction to reconfigurable systems

Introduction to reconfigurable systems Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used

More information

Customising Parallelism and Caching for Machine Learning

Customising Parallelism and Caching for Machine Learning Customising Parallelism and Caching for Machine Learning Andreas Fidjeland and Wayne Luk Department of Computing Imperial College London akf99,wl @doc.ic.ac.uk Abstract Inductive logic programming is an

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Stratix II vs. Virtex-4 Performance Comparison

Stratix II vs. Virtex-4 Performance Comparison White Paper Stratix II vs. Virtex-4 Performance Comparison Altera Stratix II devices use a new and innovative logic structure called the adaptive logic module () to make Stratix II devices the industry

More information

GPU vs FPGA : A comparative analysis for non-standard precision

GPU vs FPGA : A comparative analysis for non-standard precision GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Custom computing systems

Custom computing systems Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) 1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

9. Building Memory Subsystems Using SOPC Builder

9. Building Memory Subsystems Using SOPC Builder 9. Building Memory Subsystems Using SOPC Builder QII54006-6.0.0 Introduction Most systems generated with SOPC Builder require memory. For example, embedded processor systems require memory for software

More information

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,

More information

The LINPACK Benchmark on a Multi-Core Multi-FPGA System. Emanuel Caldeira Ramalho

The LINPACK Benchmark on a Multi-Core Multi-FPGA System. Emanuel Caldeira Ramalho The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Caldeira Ramalho A thesis submitted in conformity with the requirements for the degree of Masters of Applied Science Graduate Department

More information

Memory Access Optimization and RAM Inference for Pipeline Vectorization

Memory Access Optimization and RAM Inference for Pipeline Vectorization Memory Access Optimization and Inference for Pipeline Vectorization M. Weinhardt and W. Luk Springer-Verlag Berlin Heildelberg 1999. This paper was first published in Field-Programmable Logic and Applications,

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Qiang Liu, George A. Constantinides, Konstantinos Masselos and Peter Y.K. Cheung Department of Electrical and Electronics

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

White Paper Assessing FPGA DSP Benchmarks at 40 nm

White Paper Assessing FPGA DSP Benchmarks at 40 nm White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of

More information

8. Migrating Stratix II Device Resources to HardCopy II Devices

8. Migrating Stratix II Device Resources to HardCopy II Devices 8. Migrating Stratix II Device Resources to HardCopy II Devices H51024-1.3 Introduction Altera HardCopy II devices and Stratix II devices are both manufactured on a 1.2-V, 90-nm process technology and

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC Chun Te Ewe, Peter Y. K. Cheung and George A. Constantinides Department of Electrical & Electronic Engineering,

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Fast evaluation of nonlinear functions using FPGAs

Fast evaluation of nonlinear functions using FPGAs Adv. Radio Sci., 6, 233 237, 2008 Author(s) 2008. This work is distributed under the Creative Commons Attribution 3.0 License. Advances in Radio Science Fast evaluation of nonlinear functions using FPGAs

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Lecture 41: Introduction to Reconfigurable Computing

Lecture 41: Introduction to Reconfigurable Computing inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41: Introduction to Reconfigurable Computing Michael Le, Sp07 Head TA April 30, 2007 Slides Courtesy of Hayden So, Sp06 CS61c Head TA Following

More information

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas Power Solutions for Leading-Edge FPGAs Vaughn Betz & Paul Ekas Agenda 90 nm Power Overview Stratix II : Power Optimization Without Sacrificing Performance Technical Features & Competitive Results Dynamic

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

Hardware Oriented Security

Hardware Oriented Security 1 / 20 Hardware Oriented Security SRC-7 Programming Basics and Pipelining Miaoqing Huang University of Arkansas Fall 2014 2 / 20 Outline Basics of SRC-7 Programming Pipelining 3 / 20 Framework of Program

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

A High Performance Bus Communication Architecture through Bus Splitting

A High Performance Bus Communication Architecture through Bus Splitting A High Performance Communication Architecture through Splitting Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University,West Lafayette, IN, 797, USA {lur, chengkok}@ecn.purdue.edu

More information

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs P. Banerjee Department of Electrical and Computer Engineering Northwestern University 2145 Sheridan Road, Evanston, IL-60208 banerjee@ece.northwestern.edu

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in

More information

An Implementation of Double precision Floating point Adder & Subtractor Using Verilog

An Implementation of Double precision Floating point Adder & Subtractor Using Verilog IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-issn: 2278-1676,p-ISSN: 2320-3331, Volume 9, Issue 4 Ver. III (Jul Aug. 2014), PP 01-05 An Implementation of Double precision Floating

More information

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs V8-uRISC 8-bit RISC Microprocessor February 8, 1998 Product Specification VAutomation, Inc. 20 Trafalgar Square Nashua, NH 03063 Phone: +1 603-882-2282 Fax: +1 603-882-1587 E-mail: sales@vautomation.com

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Jai Prakash Mishra 1, Mukesh Maheshwari 2 1 M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur,

More information

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Exploration of Heterogeneous FPGAs for Mapping Linear Projection Designs Christos-S. Bouganis, Member, IEEE, Iosifina Pournara, and Peter

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff Lecture 20: ory Hierarchy Main ory and Enhancing its Performance Professor Alvin R. Lebeck Computer Science 220 Fall 1999 HW #4 Due November 12 Projects Finish reading Chapter 5 Grinch-Like Stuff CPS 220

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information