FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY
|
|
- Marshall Robbins
- 6 years ago
- Views:
Transcription
1 FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, UK {kjt01, k.masselos, ABSTRACT In this paper the latest FPGAs are compared to a high end microprocessor with respect to sustained performance for a popular floating point CPU performance benchmark, namely LINPACK The FPGA hardware has been developed in Handel-C and uses the Celoxica Floating Point Library functions to perform both double and single precision arithmetic. A set of translation and optimization steps have been applied to transform a sequential C description of the LINPACK benchmark, based on a monolithic memory model, into a parallel Handel-C description that utilizes the plurality of memory resources available on a realistic reconfigurable computing platform. These transformations allow high levels of parallelism to be exploited within a limited memory bandwidth. The experimental results show that the latest generation of FPGAs, programmed using Handel-C, can achieve a sustained double precision floating point performance up to 6.8 times greater than the microprocessor while operating at a clock frequency that is 30 times lower. 1. INTRODUCTION Philip Leong Department of Computing, Imperial College London, UK philip.leong@imperial.ac.uk The dynamic range and accuracy demanded by many engineering and scientific applications has led to the widespread usage of floating point arithmetic on conventional computers [1]. In recent years the continued growth in FPGA densities has come to allow devices that are capable of implementing ever increasing numbers of both single and double precision floating point cores. As a result, significant research effort has been expended recently on investigating the floating point potential of FPGAs [2, 3, 4, 5, 6]. Significantly, the work presented in [1] and [7] shows that FPGAs can compete with commodity microprocessors in terms of both peak and sustained floating point performance. While microprocessors and FPGAs offer comparable performance, a large gap remains between the design flows for the two platforms. FPGA programming is still performed to a large extent using conventional registertransfer level (RTL) design flows while software for microprocessors is written in behavioral languages. [1] predicts that FPGAs could have an order of magnitude advantage over microprocessors for floating point performance within ten years. However, for FPGAs to be considered in the future as a competitive platform for scientific computing, efficient, software like design flows must be developed. These design flows must allow the full processing power of the FPGAs to be harnessed in a design time comparable to that for a microprocessor. There are a number of commercial products already available that allow FPGAs to be programmed at higher levels of abstraction [8, 9]. While these tools can bring FPGA programming closer to software development, they can often reduce the ultimate performance of the final design compared to that of hand coded VHDL or Verilog. Little work has been done to investigate the current floating point potential of FPGAs when higher level design techniques/tools are used. In this paper the floating point performance of the latest FPGAs (Stratix II and Virtex 4), programmed at a level above RTL (using Handel-C), is compared to that of a commodity microprocessor (3GHz Intel Pentium 4). The performance of both platforms is measured for the LINPACK 1000 benchmark [10], a popular floating point benchmark widely used to test the performance of microprocessors. The currently released Celoxica floating point library (PDK v4.0 [11]) allows only single precision algorithms to be written in Handel-C. However, we also present results from a soon-to-be-released new library that allows both singe and double precision floating point cores to be instantiated. The single precision performance of this new library is compared to that of the current library. The performance of both libraries is compared to that of the microprocessor. The rest of the paper is organized as follows. Section 2 gives a brief description of the LINPACK 1000 benchmark. Section 3 describes the general platform this work seeks to target and the partitioning of the benchmark. Section 4 describes how the critical sections of the LINPACK benchmark are implemented on an FPGA and presents a four stage process for converting from C to Turkington 1
2 begin -- generate random linear system Ax = b -- matgen (&A[ ][ ],&b[ ]); -- start timer -- t1 = second(); -- apply LU decompositon -- dgefa (&A[ ][ ], &ipvt[ ]); -- solve simplified system -- x[ ][ ] = dgesl (&A[ ][ ], &b[ ], ipvt[ ]); -- store time taken -- t1 = second( ) t1; -- calc average FLOPs -- ops = 2/3*(1000)^3 + (1000)^2; flops = ops / t1; -- regenerate original A[ ] & b[ ] -- matgen(&a[ ], &b[ ]); -- calculate b = Ax - b -- b[ ] = -b[ ]; dmxpy(a[ ][ ], &b[ ], x[ ], 1000); -- find residual and normalise -- resid = max (b[ ]); residn = norm(resid); print(t1, flops, resid, residn); end; Fig. 1. Pseudo code for LINPACK 1000 Handel-C. In Section 5 the results for this implementation are presented and Section 6 contains the conclusions drawn from this work. 2. LINPACK 1000 The LINPACK family of benchmarks solves a system of linear equations using LU decomposition and is typical of many matrix-based scientific computations. This work will focus on the LINPACK 1000 benchmark which is most commonly used to evaluate general purpose microprocessors. LINPACK 1000 solves a random linear system of order 1000, measures the average floating point performance achieved and reports the residual error produced. The pseudo-code for the operation of the LINPACK 1000 benchmark is given in Fig. 1. It first generates a random 1000x1000 element matrix, A, and 1000 element vector, b. The elements in A and b are all floating point numbers. The dgefa subroutine simplifies the linear system by performing LU decomposition (by Gaussian elimination with partial pivoting) on A. A brief description of this method can be found in [12]. The dgesl subroutine then solves the simplified LU version of the original system to produce the vector x. The benchmark measures the time taken to complete the dgefa and dgesl subroutines and calculates the average FLOPS achieved. The remainder of the algorithm serves to estimate the normalized residual Host Microprocessor System SRAM SRAM SRAM PCI Bus Coprocessor Fig. 2. Target Platform error in the result vector x. Both single and double precision versions of the benchmark were profiled for an Intel Pentium 4 processor (3GHz, 1Mbyte L2 cache) running Windows XP using GCC (running through Cygwin). The entire single precision benchmark takes 1.03 seconds to complete, with an average floating point performance of 647 MFLOPs across the dgefa and dgesl subroutines. The double precision benchmark takes 1.61 seconds achieving 411 MFLOPS. 90% of the execution time for LINPACK 1000 is spent in the dgefa subroutine. dgefa in turn calls a smaller function, named daxpy, times with the result that the latter alone accounts for 88% of the total execution time for the benchmark. 3. TARGET PLATFORM FPGA PCI Controller This work targets the general platform shown in Fig. 2. It consists of a host microprocessor system linked to an FPGA based coprocessor via a PCI interface. The coprocessor features a high density FPGA, a PCI interface/controller and on-board memory resources. The memory on the coprocessor is assumed to be SRAM, with up to 6 banks providing a maximum of 16Mbytes. The data width of each bank is assumed to be 32 bits. Although this does not match any existing development boards the features specified are realistic and similar to those found on a number of boards including the Celoxica RC300 [13] and RC2000 [14]. When using a target platform with both microprocessor and FPGA resources it is typical to partition the target algorithm between the two. Frequently reused code with the potential for parallelism (usually loops) is targeted to the FPGA while the rest of the code is implemented on the microprocessor. Due to their lower clock speeds, FPGAs only provide acceleration when their resources are used to implement sections of the algorithm in parallel. Therefore only the most reused sections of code should be targeted to the FPGA to maximize parallelism. However, communication between the host and coprocessor should be minimized so Turkington 2
3 that the PCI bus does not become a bottleneck in the system. This leads to a tradeoff between maximizing the potential for parallelism and minimizing communication. The benchmark was partitioned so that the entire dgefa (LU decomposition) subroutine is implemented on the FPGA. This limits the communication to two block transfers, one at the start of dgefa and the other at the end. The pseudo code for dgefa is given in Fig. 3. dgefa contains a nested loop with three levels (the third is in the daxpy function) and so provides reasonable opportunities for parallelism. Since it accounts for 90% of the execution time of the benchmark it offers the potential for good acceleration of the complete algorithm. 4. OPTIMIZATION OF THE DGEFA ROUTINE This work uses an ANSI C version of the LINPACK benchmark as its starting point. The code that has been partitioned to the FPGA (the dgefa subroutine) must be converted into Handel-C and optimized to maximize parallelism. Before any optimizations are considered some simple translation steps are performed to convert the ANSI C syntax into valid Handel-C. These include the removal of side effects and the replacement of floating point arithmetic with Handel-C floating point macro procedures. On top of this pointer-based array access is converted into direct array addressing. This simplifies memory optimization in later stages. To complete the translation to Handel-C word lengths are assigned to all integer variables in the code. While this final step is a normally a complex optimization, it is included as a translation step as Handel- C code will often not compile without it. The result of these translation steps is valid Handel-C code that can be compiled to hardware. However, the hardware generated will be very inefficient and so the following four optimization stages are applied to exploit parallelism and data reuse available in the algorithm Data Reuse Exploitation The dgefa subroutine accesses the 1000x1000 element A matrix for almost every operation. Due to its large size, A must be stored in the off-chip SRAM on the coprocessor. There is limited bandwidth for accessing the off-chip SRAM (up to 6 ports) which could limit the parallelism that can be exploited. However, it may be possible to reduce accesses to A by buffering and reusing data on the FPGA. This may allow more parallelism in the final system. The pseudo code for the dgefa subroutine is shown in Fig. 3. The daxpy function requires data from two matrix columns as its inputs, denoted as A[ ][j] and A[ ][k] (where k and j are the column numbers). For a given iteration of loop 1 the same A[ ][k] column is used in each dgefa(*a[ ][ ], *ipvt[ ]) begin for (k = 0 : 998) -- loop find the pivot index of column k -- piv = idamax(a[ ][k], k) + k; ipvt[k] = piv; if (A[piv][k]!= 0) -- swap matrix elements if k!= piv-- swap(&a[piv][k], &A[k][k]); t = -1/(A[k][k]); -- scale column k by t -- dscal(&a[ ][k], t, k ); for (j = (k+1) : 999) -- loop swap matrix elements if k!= piv -- swap(&a[piv][j], &A[k][j]); t = A[piv][j]; --A[ ][j] = t*a[ ][k] + A[ ][j] -- daxpy(&a[ ][j], A[ ][k], t, k ); end for; -- end loop 2 -- end if; end for; -- end loop 1 -- ipvt[999] =999; end; Fig. 3. Pseudo code for dgefa subroutine iteration of loop 2 meaning that, if the correct A[ ][k] data is buffered on the FPGA at the start of loop 1, it can be reused for every iteration of loop 2. More importantly, the A[ ][k] data used in iteration k+1 of loop 1 is calculated as a result column in iteration k. If this column of data is stored on the FPGA as it is generated then only A[ ][j] data must read from the external memory, instead of the A[ ][k] and A[ ][j] data. This effectively halves the number of reads from the external memory and halves the bandwidth required. The A[ ][k] column for the very first iteration of loop 1 must still be read from the external memory, but this now represents a small one-off initialization cost. This data reuse can be implemented using two new on chip buffers, each capable of storing one column of matrix data. One buffer, called k_col, stores the A[ ][k] data for the current iteration of loop 1. The second buffer, called k_next, stores the A[ ][k] data needed for the next iteration as it is generated Loop Pipelining Loop pipelining is a well known optimization that allows parallelism to be exploited in loops without increasing the arithmetic logic resources required (there may a control overhead). The three functions, idamax, dscal and daxpy, called in the dgefa subroutine all contain a further loop level. They each also contain calls to pipelined floating point macros from the Celoxica floating point library. There are no loop carried dependencies on Turkington 3
4 Processor 0 idamax[0] dscal[0] daxpy[1] daxpy[2] daxpy [3] daxpy [4] daxpy [5] daxpy [6] Processor 1 Processor 2 idamax[1] dscal[1] daxpy[2] daxpy[3] daxpy[4] daxpy[5] idamax[2] dscal[2] daxpy[3] daxpy[4] Fig. 4. Schedule for 3 parallel processing units any of these inner loops so the iterations of each loop can be pipelined into the stages of the floating point units. The outer loops of the dgefa subroutine could have been pipelined instead, using the methods presented in [15] for example, but the dependencies at these loop levels make them less favorable. Since all of the floating point cores are internally pipelined, pipelining the inner loops does not increase the resource usage. However, it does place extra strain on the external memory. The pipelined daxpy function must complete one read and one write to the external SRAM for every iteration of its inner loop. A new inner loop iteration can be initiated every cycle provided these memory operations can complete in one cycle. Each bank of the external SRAM has only a single port so the A matrix must be partitioned across multiple banks in such a way that the same bank will never be required to perform a read and write on the same cycle. This can be achieved by placing the odd and even rows of the matrix in different SRAM banks. As the pipeline works down each column it reads from alternating odd and even rows and writes to alternating odd and even rows. The memory controller is scheduled so that when an odd row is read an even row is written (and vice versa). There is sufficient slack in the schedule to allow a write to be delayed by a cycle when necessary to avoid conflict Outer Loop Parallelization Simply pipelining the inner loops does not yield sufficient parallelism to outperform the Pentium 4 so the outer loops of the dgefa subroutine must somehow be executed in parallel. However, the required memory bandwidth must not scale with the parallelism achieved as there are (at most) six ports to the external SRAM, two of which are already used on almost every cycle. Fortunately, there is a loop carried dependency in loop 1 (Fig. 3) such that all of the input data used by iteration n+1 is generated in iteration n. This means that, if multiple loop 1 iterations are run in parallel with their start times offset correctly, then data could effectively flow Fig. 5. Structure of dgefa coprocessor from the output of one iteration to the input of the next without going through the external memory. This allows arbitrary outer loop parallelism with only two ports to the external memory. Fig.4. shows how the first three iterations of loop 1 can be scheduled across three parallel loop processing pipelines. The number of pipelines can be varied using a # define in the Handel-C source code. In Fig. 4 the notation idamax[k] represents the execution of the idamax function on column k of the matrix. The same notation is used for the dscal and daxpy functions. All of the rest of code in the dgefa subroutine has been moved into one of these three functions to simplify the system. idamax, dscal and daxpy all contain loops iterating over the same bounds so they take the same number of clock cycles to run (+/- 2%). This allows instances of the three functions to be scheduled as if they were individual instructions. idamax[k] operates on the data generated in daxpy[k]. As the output data from daxpy[k] is generated it is sent to idamax[k] through global variables, allowing the two functions to run in parallel. daxpy[k] runs in parallel with the execution of daxpy[k+1] for the previous outer loop iteration, but uses the data generated by daxpy[k]. Hence, each pipeline must include a buffer (capable of storing a column of matrix data) to store the data generated by the previous pipeline in the chain until it is ready to be used by the operations in the daxpy function As can be seen in Fig. 4, idamax and dscal are never required by more than one pipeline at once. This will allow them to be shared in the following optimization stage. This means that the k_next buffer (generated in Section 4.1) can also be shared since only idamax and dscal access it. Each pipeline must have its own k_col buffer however, since Turkington 4
5 Fig. 5. Single precision speedup over Pentium 4 using Altera Stratix II devices each pipeline uses a different A[ ][k] column as its input. Only the last pipeline in the chain must write its A[ ][j] output data to the external memory. However, the A[ ][k] data stored in the k_col buffer in each pipeline must be written to the external memory at some point since these columns form part of the final result matrix. Fortunately the final pipeline in the chain has sufficient slots to output all the necessary data so only a single write port to the external memory is required. Since there are now multiple write sources an additional function must be included to arbitrate over the memory bus. The pipelines communicate with this function via global variables, allowing it to run in parallel with the normal operation of the system. 4.4 Low Level Optimization At this point the code still has a software-like structure, with a single main process/function calling a number of functions and macro-procedures. In Handel-C this can often lead to large, complex state machines with long routing and large fan-out. This will in turn lead to a design with a low clock frequency. To avoid this problem the code is restructured. Each functional block in the system is redefined as a main function, giving it a separate state machine. Each block is altered so that it loops repeatedly over its contents, with a while delay loop at the start of the function to hold it until it is called by another block. Functional calls within each block are replaced with single bit go signals that break the while delay loop in the functional block being called. Global variables allow data communication between blocks. This gives the final system the structure shown in Fig. 5. The blocks labeled as pipex each contain the hardware to implement the daxpy function and the dgefa loops around it. As a final step a number of general low level optimizations are applied. Complex arithmetic and control expressions are split over multiple clock cycles. Extra pipeline registers are inserted on the read ports of the memories and in any potentially long communication paths (mainly the global variables used for inter-block communication). Instruction level parallelism in each block is exploited and resource sharing is examined and declared where appropriate. 5. RESULTS Fig. 5 shows the single precision speedup achieved over a 3GHz Pentium 4 when designs with various numbers of pipelines are targeted to Altera Stratix II devices. Fig 6 shows the double precision speedup, again using Stratix II devices. In each case the smallest device that would accommodate the design was targeted. The fastest speed grade available was used in each case. The Handel-C source files were compiled to EDIF using Celoxica DK suite v4.0. The syntheses were performed using Altera Quartus II v4.2. Designs using floating point units form both the current Celoxica floating point library and a new pre-release library were tested for the single precision case. The current floating point library does not offer double precision so no comparison can be made in this case. Equivalent designs were also targeted to Xilinx Virtex 4 LX devices. The results were comparable to those for Stratix II but the largest Xilinx device holds fewer pipelines than the largest Altera device. As a result the ultimate speedup using Virtex 4 is not as high. Turkington 5
6 Fig. 6. Double precision speedup over Pentium 4 using Altera Stratix II devices The first plot in Fig. 5 and Fig. 6 shows the speedup for just the dgefa subroutine (including an estimate of the time taken to transfer data across the PCI bus). The second plot estimates the overall speedup for the complete benchmark, assuming that the code not targeted to the FPGA is implemented on the 3GHz Pentium 4. Using the current library the largest Stratix II device (EP2S180) can implement the dgefa subroutine 7 times faster than the 3GHz Pentium 4, at a maximum frequency of 98MHz. This is improved to an 8.5 times speedup at 94MHz when the new library is used. The smaller size of the improved library components allows up to 48 parallel pipelines compared to only 36 with the current library. These results translate to an overall benchmark speedup of 4.5 times for the current library and 4.9 times for the new library. The maximum speedup achieved for the double precision benchmark is 4.4 times, with a speedup of 6.8 times for the dgefa subroutine. When more than 16 single precision or 8 double precision pipelines are synthesized an EP2S130 or EP2S180 device must be used. This causes a drop in the maximum clock frequency that can be achieved for two reasons. Firstly, the version of Quartus II used does not allow the fastest speed grade of these devices (-3) to be targeted and so slower (-4) devices were targeted. Secondly, for the larger designs the physical synthesis optimizations in Quartus II had to be turned off as they caused the fitter to reach the limit of its addressable memory. These problems can be overcome if a newer 64 bit version of Quartus II is used. In this case the results indicated on the dashed lines should be reachable giving potentially a 10 times speedup for single precision and a 9 times speedup for double precision. 6. CONCLUSIONS An optimized Handel-C implementation of a coprocessor for both single and double precision versions of the LINPACK 1000 benchmark was produced through a series of optimizations that identify opportunities to exploit data reuse, loop pipelining, coarse grained parallelism, parallel memory assignment, fine grained parallelism and pipelining. When targeted to the latest generation of FPGAs this coprocessor can execute parts of the double precision benchmark up to 6.8 times faster than a 3GHz Pentium 4 processor and accelerate the execution of the complete benchmark by a factor of 4.4. This suggests that it is currently possible to accelerate scientific computing algorithms using FPGAs, even when using high level design methods and tools. However, it has also been shown that significant effort is required to convert the standard ANSI C algorithms into efficient, parallel Handel-C. As a result the development times for hardware are still much longer than those for software, even when high level languages are used. Furthermore, an awareness of how specific high level code will be synthesized in hardware is still required to produce efficient designs. This prevents software designers from migrating seamlessly to hardware. It seems that further design tools, possibly automating optimizations similar to those presented here, are still required to make hardware design run as quickly and efficiently as software design. Turkington 6
7 7. REFERENCES [1] K. Underwood, FPGAs vs. CPUs: Trends in Peak Floating Point Performance, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [2] P. Belanovic, M. Leeser, A library of parameterized floating point modules and their use, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [3] J. Dido, et. al., A flexible floating point format for optimizing data paths and operators in FPGA based DSPs, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [4] X. Wang, B. E. Nelson, Tradeoffs of designing floating point division and square root on Virtex FPGAs, Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp , April [5] E. Roesler, B. Nelson, Novel optimizations for hardware floating point units in a modern FPGA architecture, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [6] W. Ligon, et. al., A re-evaluation of the practicality of floating point on FPGAs, Proc. IEEE Symp. Field- Programmable Custom Computing Machines, pp , April [7] K. Underwood, K. Hemmert, Closing the gap: CPU and FPGA trends in sustainable floating point BLAS performance, Proc. IEEE Symp. Field Programmable Custom Computing Machines, pp , April [8] [9] [10] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK benchmark: Past, Present and Future, l.pdf [11] ENGSPDPDKPDK_4.1_Software_Product_Description pdf [12] ctures/chap02.pdf, pp [13] [14] [15] H. Rong, Z. Tang, R. Govindarajan, A. Douillet, G Gao, Single-Dimension Software Pipelining for Multi- Dimensional Loops, Proc. IEEE Int. Symp. Code Generation and Optimization, pp , March 2000 ACKNOWLEDGEMENTS This work was supported by the EPSRC (EP/C549481/1) Turkington 7
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,
More informationSparse Linear Solver for Power System Analyis using FPGA
Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.
More informationDeveloping a Data Driven System for Computational Neuroscience
Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate
More informationSoft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study
Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering
More informationFloating-Point Matrix Product on FPGA
Floating-Point Matrix Product on FPGA Faycal Bensaali University of Hertfordshire f.bensaali@herts.ac.uk Abbes Amira Brunel University abbes.amira@brunel.ac.uk Reza Sotudeh University of Hertfordshire
More informationPerformance and Overhead in a Hybrid Reconfigurable Computer
Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationCOPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design
COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm
More informationA Library of Parameterized Floating-point Modules and Their Use
A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationA Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors
A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationPERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1
PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 J. Johnson, P. Vachranukunkiet, S. Tiwari, P. Nagvajara, C. Nwankpa Drexel University Philadelphia, PA Abstract Full-AC load flow constitutes
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationVendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs
Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115
More informationKeck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore
MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationHardware Acceleration of Edge Detection Algorithm on FPGAs
Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationAdvanced Topics in Computer Architecture
Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationFPGAs: FAST TRACK TO DSP
FPGAs: FAST TRACK TO DSP Revised February 2009 ABSRACT: Given the prevalence of digital signal processing in a variety of industry segments, several implementation solutions are available depending on
More informationCode modernization of Polyhedron benchmark suite
Code modernization of Polyhedron benchmark suite Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Approaches for
More informationHigh Speed Pipelined Architecture for Adaptive Median Filter
Abstract High Speed Pipelined Architecture for Adaptive Median Filter D.Dhanasekaran, and **Dr.K.Boopathy Bagan *Assistant Professor, SVCE, Pennalur,Sriperumbudur-602105. **Professor, Madras Institute
More informationSparse LU Decomposition using FPGA
Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,
More informationHow Much Logic Should Go in an FPGA Logic Block?
How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca
More informationECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I
ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I Overview Anti-fuse and EEPROM-based devices Contemporary SRAM devices - Wiring - Embedded New trends - Single-driver wiring -
More informationCHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier
CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationEarly Models in Silicon with SystemC synthesis
Early Models in Silicon with SystemC synthesis Agility Compiler summary C-based design & synthesis for SystemC Pure, standard compliant SystemC/ C++ Most widely used C-synthesis technology Structural SystemC
More informationIntroduction to reconfigurable systems
Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used
More informationCustomising Parallelism and Caching for Machine Learning
Customising Parallelism and Caching for Machine Learning Andreas Fidjeland and Wayne Luk Department of Computing Imperial College London akf99,wl @doc.ic.ac.uk Abstract Inductive logic programming is an
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationLECTURE 10: Improving Memory Access: Direct and Spatial caches
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses
More informationStratix II vs. Virtex-4 Performance Comparison
White Paper Stratix II vs. Virtex-4 Performance Comparison Altera Stratix II devices use a new and innovative logic structure called the adaptive logic module () to make Stratix II devices the industry
More informationGPU vs FPGA : A comparative analysis for non-standard precision
GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationCustom computing systems
Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More information4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)
1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.
More informationImproving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints
Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationISSN Vol.05,Issue.09, September-2017, Pages:
WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,
More informationA Novel Design Framework for the Design of Reconfigurable Systems based on NoCs
Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More information9. Building Memory Subsystems Using SOPC Builder
9. Building Memory Subsystems Using SOPC Builder QII54006-6.0.0 Introduction Most systems generated with SOPC Builder require memory. For example, embedded processor systems require memory for software
More informationTradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter
Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,
More informationThe LINPACK Benchmark on a Multi-Core Multi-FPGA System. Emanuel Caldeira Ramalho
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Caldeira Ramalho A thesis submitted in conformity with the requirements for the degree of Masters of Applied Science Graduate Department
More informationMemory Access Optimization and RAM Inference for Pipeline Vectorization
Memory Access Optimization and Inference for Pipeline Vectorization M. Weinhardt and W. Luk Springer-Verlag Berlin Heildelberg 1999. This paper was first published in Field-Programmable Logic and Applications,
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationA Reconfigurable Multifunction Computing Cache Architecture
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,
More informationData Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems
Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Qiang Liu, George A. Constantinides, Konstantinos Masselos and Peter Y.K. Cheung Department of Electrical and Electronics
More informationLaboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication
Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.
More informationParallel FIR Filters. Chapter 5
Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture
More informationARITHMETIC operations based on residue number systems
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationWhite Paper Assessing FPGA DSP Benchmarks at 40 nm
White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of
More information8. Migrating Stratix II Device Resources to HardCopy II Devices
8. Migrating Stratix II Device Resources to HardCopy II Devices H51024-1.3 Introduction Altera HardCopy II devices and Stratix II devices are both manufactured on a 1.2-V, 90-nm process technology and
More informationAbstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE
A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC
ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC Chun Te Ewe, Peter Y. K. Cheung and George A. Constantinides Department of Electrical & Electronic Engineering,
More informationLarge-Scale Network Simulation Scalability and an FPGA-based Network Simulator
Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid
More informationFast evaluation of nonlinear functions using FPGAs
Adv. Radio Sci., 6, 233 237, 2008 Author(s) 2008. This work is distributed under the Creative Commons Attribution 3.0 License. Advances in Radio Science Fast evaluation of nonlinear functions using FPGAs
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More informationLecture 41: Introduction to Reconfigurable Computing
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41: Introduction to Reconfigurable Computing Michael Le, Sp07 Head TA April 30, 2007 Slides Courtesy of Hayden So, Sp06 CS61c Head TA Following
More informationPower Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas
Power Solutions for Leading-Edge FPGAs Vaughn Betz & Paul Ekas Agenda 90 nm Power Overview Stratix II : Power Optimization Without Sacrificing Performance Technical Features & Competitive Results Dynamic
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationThe Xilinx XC6200 chip, the software tools and the board development tools
The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions
More informationHardware Oriented Security
1 / 20 Hardware Oriented Security SRC-7 Programming Basics and Pipelining Miaoqing Huang University of Arkansas Fall 2014 2 / 20 Outline Basics of SRC-7 Programming Pipelining 3 / 20 Framework of Program
More informationStratix vs. Virtex-II Pro FPGA Performance Analysis
White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance
More informationA Configurable Multi-Ported Register File Architecture for Soft Processor Cores
A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationA High Performance Bus Communication Architecture through Bus Splitting
A High Performance Communication Architecture through Splitting Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University,West Lafayette, IN, 797, USA {lur, chengkok}@ecn.purdue.edu
More informationAn Overview of a Compiler for Mapping MATLAB Programs onto FPGAs
An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs P. Banerjee Department of Electrical and Computer Engineering Northwestern University 2145 Sheridan Road, Evanston, IL-60208 banerjee@ece.northwestern.edu
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationThe Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in
More informationAn Implementation of Double precision Floating point Adder & Subtractor Using Verilog
IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-issn: 2278-1676,p-ISSN: 2320-3331, Volume 9, Issue 4 Ver. III (Jul Aug. 2014), PP 01-05 An Implementation of Double precision Floating
More informationV8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs
V8-uRISC 8-bit RISC Microprocessor February 8, 1998 Product Specification VAutomation, Inc. 20 Trafalgar Square Nashua, NH 03063 Phone: +1 603-882-2282 Fax: +1 603-882-1587 E-mail: sales@vautomation.com
More informationA Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System
A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer
More informationSimulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture
Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Jai Prakash Mishra 1, Mukesh Maheshwari 2 1 M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur,
More informationDevelopment of efficient computational kernels and linear algebra routines for out-of-order superscalar processors
Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationIEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Exploration of Heterogeneous FPGAs for Mapping Linear Projection Designs Christos-S. Bouganis, Member, IEEE, Iosifina Pournara, and Peter
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationLecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff
Lecture 20: ory Hierarchy Main ory and Enhancing its Performance Professor Alvin R. Lebeck Computer Science 220 Fall 1999 HW #4 Due November 12 Projects Finish reading Chapter 5 Grinch-Like Stuff CPS 220
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationFPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard
FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More information