FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

Size: px

Start display at page:

Download "FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY"

Marshall Robbins
6 years ago
Views:

1 FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, UK {kjt01, k.masselos, ABSTRACT In this paper the latest FPGAs are compared to a high end microprocessor with respect to sustained performance for a popular floating point CPU performance benchmark, namely LINPACK The FPGA hardware has been developed in Handel-C and uses the Celoxica Floating Point Library functions to perform both double and single precision arithmetic. A set of translation and optimization steps have been applied to transform a sequential C description of the LINPACK benchmark, based on a monolithic memory model, into a parallel Handel-C description that utilizes the plurality of memory resources available on a realistic reconfigurable computing platform. These transformations allow high levels of parallelism to be exploited within a limited memory bandwidth. The experimental results show that the latest generation of FPGAs, programmed using Handel-C, can achieve a sustained double precision floating point performance up to 6.8 times greater than the microprocessor while operating at a clock frequency that is 30 times lower. 1. INTRODUCTION Philip Leong Department of Computing, Imperial College London, UK philip.leong@imperial.ac.uk The dynamic range and accuracy demanded by many engineering and scientific applications has led to the widespread usage of floating point arithmetic on conventional computers [1]. In recent years the continued growth in FPGA densities has come to allow devices that are capable of implementing ever increasing numbers of both single and double precision floating point cores. As a result, significant research effort has been expended recently on investigating the floating point potential of FPGAs [2, 3, 4, 5, 6]. Significantly, the work presented in [1] and [7] shows that FPGAs can compete with commodity microprocessors in terms of both peak and sustained floating point performance. While microprocessors and FPGAs offer comparable performance, a large gap remains between the design flows for the two platforms. FPGA programming is still performed to a large extent using conventional registertransfer level (RTL) design flows while software for microprocessors is written in behavioral languages. [1] predicts that FPGAs could have an order of magnitude advantage over microprocessors for floating point performance within ten years. However, for FPGAs to be considered in the future as a competitive platform for scientific computing, efficient, software like design flows must be developed. These design flows must allow the full processing power of the FPGAs to be harnessed in a design time comparable to that for a microprocessor. There are a number of commercial products already available that allow FPGAs to be programmed at higher levels of abstraction [8, 9]. While these tools can bring FPGA programming closer to software development, they can often reduce the ultimate performance of the final design compared to that of hand coded VHDL or Verilog. Little work has been done to investigate the current floating point potential of FPGAs when higher level design techniques/tools are used. In this paper the floating point performance of the latest FPGAs (Stratix II and Virtex 4), programmed at a level above RTL (using Handel-C), is compared to that of a commodity microprocessor (3GHz Intel Pentium 4). The performance of both platforms is measured for the LINPACK 1000 benchmark [10], a popular floating point benchmark widely used to test the performance of microprocessors. The currently released Celoxica floating point library (PDK v4.0 [11]) allows only single precision algorithms to be written in Handel-C. However, we also present results from a soon-to-be-released new library that allows both singe and double precision floating point cores to be instantiated. The single precision performance of this new library is compared to that of the current library. The performance of both libraries is compared to that of the microprocessor. The rest of the paper is organized as follows. Section 2 gives a brief description of the LINPACK 1000 benchmark. Section 3 describes the general platform this work seeks to target and the partitioning of the benchmark. Section 4 describes how the critical sections of the LINPACK benchmark are implemented on an FPGA and presents a four stage process for converting from C to Turkington 1

2 begin -- generate random linear system Ax = b -- matgen (&A[ ][ ],&b[ ]); -- start timer -- t1 = second(); -- apply LU decompositon -- dgefa (&A[ ][ ], &ipvt[ ]); -- solve simplified system -- x[ ][ ] = dgesl (&A[ ][ ], &b[ ], ipvt[ ]); -- store time taken -- t1 = second( ) t1; -- calc average FLOPs -- ops = 2/3*(1000)^3 + (1000)^2; flops = ops / t1; -- regenerate original A[ ] & b[ ] -- matgen(&a[ ], &b[ ]); -- calculate b = Ax - b -- b[ ] = -b[ ]; dmxpy(a[ ][ ], &b[ ], x[ ], 1000); -- find residual and normalise -- resid = max (b[ ]); residn = norm(resid); print(t1, flops, resid, residn); end; Fig. 1. Pseudo code for LINPACK 1000 Handel-C. In Section 5 the results for this implementation are presented and Section 6 contains the conclusions drawn from this work. 2. LINPACK 1000 The LINPACK family of benchmarks solves a system of linear equations using LU decomposition and is typical of many matrix-based scientific computations. This work will focus on the LINPACK 1000 benchmark which is most commonly used to evaluate general purpose microprocessors. LINPACK 1000 solves a random linear system of order 1000, measures the average floating point performance achieved and reports the residual error produced. The pseudo-code for the operation of the LINPACK 1000 benchmark is given in Fig. 1. It first generates a random 1000x1000 element matrix, A, and 1000 element vector, b. The elements in A and b are all floating point numbers. The dgefa subroutine simplifies the linear system by performing LU decomposition (by Gaussian elimination with partial pivoting) on A. A brief description of this method can be found in [12]. The dgesl subroutine then solves the simplified LU version of the original system to produce the vector x. The benchmark measures the time taken to complete the dgefa and dgesl subroutines and calculates the average FLOPS achieved. The remainder of the algorithm serves to estimate the normalized residual Host Microprocessor System SRAM SRAM SRAM PCI Bus Coprocessor Fig. 2. Target Platform error in the result vector x. Both single and double precision versions of the benchmark were profiled for an Intel Pentium 4 processor (3GHz, 1Mbyte L2 cache) running Windows XP using GCC (running through Cygwin). The entire single precision benchmark takes 1.03 seconds to complete, with an average floating point performance of 647 MFLOPs across the dgefa and dgesl subroutines. The double precision benchmark takes 1.61 seconds achieving 411 MFLOPS. 90% of the execution time for LINPACK 1000 is spent in the dgefa subroutine. dgefa in turn calls a smaller function, named daxpy, times with the result that the latter alone accounts for 88% of the total execution time for the benchmark. 3. TARGET PLATFORM FPGA PCI Controller This work targets the general platform shown in Fig. 2. It consists of a host microprocessor system linked to an FPGA based coprocessor via a PCI interface. The coprocessor features a high density FPGA, a PCI interface/controller and on-board memory resources. The memory on the coprocessor is assumed to be SRAM, with up to 6 banks providing a maximum of 16Mbytes. The data width of each bank is assumed to be 32 bits. Although this does not match any existing development boards the features specified are realistic and similar to those found on a number of boards including the Celoxica RC300 [13] and RC2000 [14]. When using a target platform with both microprocessor and FPGA resources it is typical to partition the target algorithm between the two. Frequently reused code with the potential for parallelism (usually loops) is targeted to the FPGA while the rest of the code is implemented on the microprocessor. Due to their lower clock speeds, FPGAs only provide acceleration when their resources are used to implement sections of the algorithm in parallel. Therefore only the most reused sections of code should be targeted to the FPGA to maximize parallelism. However, communication between the host and coprocessor should be minimized so Turkington 2

3 that the PCI bus does not become a bottleneck in the system. This leads to a tradeoff between maximizing the potential for parallelism and minimizing communication. The benchmark was partitioned so that the entire dgefa (LU decomposition) subroutine is implemented on the FPGA. This limits the communication to two block transfers, one at the start of dgefa and the other at the end. The pseudo code for dgefa is given in Fig. 3. dgefa contains a nested loop with three levels (the third is in the daxpy function) and so provides reasonable opportunities for parallelism. Since it accounts for 90% of the execution time of the benchmark it offers the potential for good acceleration of the complete algorithm. 4. OPTIMIZATION OF THE DGEFA ROUTINE This work uses an ANSI C version of the LINPACK benchmark as its starting point. The code that has been partitioned to the FPGA (the dgefa subroutine) must be converted into Handel-C and optimized to maximize parallelism. Before any optimizations are considered some simple translation steps are performed to convert the ANSI C syntax into valid Handel-C. These include the removal of side effects and the replacement of floating point arithmetic with Handel-C floating point macro procedures. On top of this pointer-based array access is converted into direct array addressing. This simplifies memory optimization in later stages. To complete the translation to Handel-C word lengths are assigned to all integer variables in the code. While this final step is a normally a complex optimization, it is included as a translation step as Handel- C code will often not compile without it. The result of these translation steps is valid Handel-C code that can be compiled to hardware. However, the hardware generated will be very inefficient and so the following four optimization stages are applied to exploit parallelism and data reuse available in the algorithm Data Reuse Exploitation The dgefa subroutine accesses the 1000x1000 element A matrix for almost every operation. Due to its large size, A must be stored in the off-chip SRAM on the coprocessor. There is limited bandwidth for accessing the off-chip SRAM (up to 6 ports) which could limit the parallelism that can be exploited. However, it may be possible to reduce accesses to A by buffering and reusing data on the FPGA. This may allow more parallelism in the final system. The pseudo code for the dgefa subroutine is shown in Fig. 3. The daxpy function requires data from two matrix columns as its inputs, denoted as A[ ][j] and A[ ][k] (where k and j are the column numbers). For a given iteration of loop 1 the same A[ ][k] column is used in each dgefa(*a[ ][ ], *ipvt[ ]) begin for (k = 0 : 998) -- loop find the pivot index of column k -- piv = idamax(a[ ][k], k) + k; ipvt[k] = piv; if (A[piv][k]!= 0) -- swap matrix elements if k!= piv-- swap(&a[piv][k], &A[k][k]); t = -1/(A[k][k]); -- scale column k by t -- dscal(&a[ ][k], t, k ); for (j = (k+1) : 999) -- loop swap matrix elements if k!= piv -- swap(&a[piv][j], &A[k][j]); t = A[piv][j]; --A[ ][j] = t*a[ ][k] + A[ ][j] -- daxpy(&a[ ][j], A[ ][k], t, k ); end for; -- end loop 2 -- end if; end for; -- end loop 1 -- ipvt[999] =999; end; Fig. 3. Pseudo code for dgefa subroutine iteration of loop 2 meaning that, if the correct A[ ][k] data is buffered on the FPGA at the start of loop 1, it can be reused for every iteration of loop 2. More importantly, the A[ ][k] data used in iteration k+1 of loop 1 is calculated as a result column in iteration k. If this column of data is stored on the FPGA as it is generated then only A[ ][j] data must read from the external memory, instead of the A[ ][k] and A[ ][j] data. This effectively halves the number of reads from the external memory and halves the bandwidth required. The A[ ][k] column for the very first iteration of loop 1 must still be read from the external memory, but this now represents a small one-off initialization cost. This data reuse can be implemented using two new on chip buffers, each capable of storing one column of matrix data. One buffer, called k_col, stores the A[ ][k] data for the current iteration of loop 1. The second buffer, called k_next, stores the A[ ][k] data needed for the next iteration as it is generated Loop Pipelining Loop pipelining is a well known optimization that allows parallelism to be exploited in loops without increasing the arithmetic logic resources required (there may a control overhead). The three functions, idamax, dscal and daxpy, called in the dgefa subroutine all contain a further loop level. They each also contain calls to pipelined floating point macros from the Celoxica floating point library. There are no loop carried dependencies on Turkington 3

4 Processor 0 idamax[0] dscal[0] daxpy[1] daxpy[2] daxpy [3] daxpy [4] daxpy [5] daxpy [6] Processor 1 Processor 2 idamax[1] dscal[1] daxpy[2] daxpy[3] daxpy[4] daxpy[5] idamax[2] dscal[2] daxpy[3] daxpy[4] Fig. 4. Schedule for 3 parallel processing units any of these inner loops so the iterations of each loop can be pipelined into the stages of the floating point units. The outer loops of the dgefa subroutine could have been pipelined instead, using the methods presented in [15] for example, but the dependencies at these loop levels make them less favorable. Since all of the floating point cores are internally pipelined, pipelining the inner loops does not increase the resource usage. However, it does place extra strain on the external memory. The pipelined daxpy function must complete one read and one write to the external SRAM for every iteration of its inner loop. A new inner loop iteration can be initiated every cycle provided these memory operations can complete in one cycle. Each bank of the external SRAM has only a single port so the A matrix must be partitioned across multiple banks in such a way that the same bank will never be required to perform a read and write on the same cycle. This can be achieved by placing the odd and even rows of the matrix in different SRAM banks. As the pipeline works down each column it reads from alternating odd and even rows and writes to alternating odd and even rows. The memory controller is scheduled so that when an odd row is read an even row is written (and vice versa). There is sufficient slack in the schedule to allow a write to be delayed by a cycle when necessary to avoid conflict Outer Loop Parallelization Simply pipelining the inner loops does not yield sufficient parallelism to outperform the Pentium 4 so the outer loops of the dgefa subroutine must somehow be executed in parallel. However, the required memory bandwidth must not scale with the parallelism achieved as there are (at most) six ports to the external SRAM, two of which are already used on almost every cycle. Fortunately, there is a loop carried dependency in loop 1 (Fig. 3) such that all of the input data used by iteration n+1 is generated in iteration n. This means that, if multiple loop 1 iterations are run in parallel with their start times offset correctly, then data could effectively flow Fig. 5. Structure of dgefa coprocessor from the output of one iteration to the input of the next without going through the external memory. This allows arbitrary outer loop parallelism with only two ports to the external memory. Fig.4. shows how the first three iterations of loop 1 can be scheduled across three parallel loop processing pipelines. The number of pipelines can be varied using a # define in the Handel-C source code. In Fig. 4 the notation idamax[k] represents the execution of the idamax function on column k of the matrix. The same notation is used for the dscal and daxpy functions. All of the rest of code in the dgefa subroutine has been moved into one of these three functions to simplify the system. idamax, dscal and daxpy all contain loops iterating over the same bounds so they take the same number of clock cycles to run (+/- 2%). This allows instances of the three functions to be scheduled as if they were individual instructions. idamax[k] operates on the data generated in daxpy[k]. As the output data from daxpy[k] is generated it is sent to idamax[k] through global variables, allowing the two functions to run in parallel. daxpy[k] runs in parallel with the execution of daxpy[k+1] for the previous outer loop iteration, but uses the data generated by daxpy[k]. Hence, each pipeline must include a buffer (capable of storing a column of matrix data) to store the data generated by the previous pipeline in the chain until it is ready to be used by the operations in the daxpy function As can be seen in Fig. 4, idamax and dscal are never required by more than one pipeline at once. This will allow them to be shared in the following optimization stage. This means that the k_next buffer (generated in Section 4.1) can also be shared since only idamax and dscal access it. Each pipeline must have its own k_col buffer however, since Turkington 4

5 Fig. 5. Single precision speedup over Pentium 4 using Altera Stratix II devices each pipeline uses a different A[ ][k] column as its input. Only the last pipeline in the chain must write its A[ ][j] output data to the external memory. However, the A[ ][k] data stored in the k_col buffer in each pipeline must be written to the external memory at some point since these columns form part of the final result matrix. Fortunately the final pipeline in the chain has sufficient slots to output all the necessary data so only a single write port to the external memory is required. Since there are now multiple write sources an additional function must be included to arbitrate over the memory bus. The pipelines communicate with this function via global variables, allowing it to run in parallel with the normal operation of the system. 4.4 Low Level Optimization At this point the code still has a software-like structure, with a single main process/function calling a number of functions and macro-procedures. In Handel-C this can often lead to large, complex state machines with long routing and large fan-out. This will in turn lead to a design with a low clock frequency. To avoid this problem the code is restructured. Each functional block in the system is redefined as a main function, giving it a separate state machine. Each block is altered so that it loops repeatedly over its contents, with a while delay loop at the start of the function to hold it until it is called by another block. Functional calls within each block are replaced with single bit go signals that break the while delay loop in the functional block being called. Global variables allow data communication between blocks. This gives the final system the structure shown in Fig. 5. The blocks labeled as pipex each contain the hardware to implement the daxpy function and the dgefa loops around it. As a final step a number of general low level optimizations are applied. Complex arithmetic and control expressions are split over multiple clock cycles. Extra pipeline registers are inserted on the read ports of the memories and in any potentially long communication paths (mainly the global variables used for inter-block communication). Instruction level parallelism in each block is exploited and resource sharing is examined and declared where appropriate. 5. RESULTS Fig. 5 shows the single precision speedup achieved over a 3GHz Pentium 4 when designs with various numbers of pipelines are targeted to Altera Stratix II devices. Fig 6 shows the double precision speedup, again using Stratix II devices. In each case the smallest device that would accommodate the design was targeted. The fastest speed grade available was used in each case. The Handel-C source files were compiled to EDIF using Celoxica DK suite v4.0. The syntheses were performed using Altera Quartus II v4.2. Designs using floating point units form both the current Celoxica floating point library and a new pre-release library were tested for the single precision case. The current floating point library does not offer double precision so no comparison can be made in this case. Equivalent designs were also targeted to Xilinx Virtex 4 LX devices. The results were comparable to those for Stratix II but the largest Xilinx device holds fewer pipelines than the largest Altera device. As a result the ultimate speedup using Virtex 4 is not as high. Turkington 5

6 Fig. 6. Double precision speedup over Pentium 4 using Altera Stratix II devices The first plot in Fig. 5 and Fig. 6 shows the speedup for just the dgefa subroutine (including an estimate of the time taken to transfer data across the PCI bus). The second plot estimates the overall speedup for the complete benchmark, assuming that the code not targeted to the FPGA is implemented on the 3GHz Pentium 4. Using the current library the largest Stratix II device (EP2S180) can implement the dgefa subroutine 7 times faster than the 3GHz Pentium 4, at a maximum frequency of 98MHz. This is improved to an 8.5 times speedup at 94MHz when the new library is used. The smaller size of the improved library components allows up to 48 parallel pipelines compared to only 36 with the current library. These results translate to an overall benchmark speedup of 4.5 times for the current library and 4.9 times for the new library. The maximum speedup achieved for the double precision benchmark is 4.4 times, with a speedup of 6.8 times for the dgefa subroutine. When more than 16 single precision or 8 double precision pipelines are synthesized an EP2S130 or EP2S180 device must be used. This causes a drop in the maximum clock frequency that can be achieved for two reasons. Firstly, the version of Quartus II used does not allow the fastest speed grade of these devices (-3) to be targeted and so slower (-4) devices were targeted. Secondly, for the larger designs the physical synthesis optimizations in Quartus II had to be turned off as they caused the fitter to reach the limit of its addressable memory. These problems can be overcome if a newer 64 bit version of Quartus II is used. In this case the results indicated on the dashed lines should be reachable giving potentially a 10 times speedup for single precision and a 9 times speedup for double precision. 6. CONCLUSIONS An optimized Handel-C implementation of a coprocessor for both single and double precision versions of the LINPACK 1000 benchmark was produced through a series of optimizations that identify opportunities to exploit data reuse, loop pipelining, coarse grained parallelism, parallel memory assignment, fine grained parallelism and pipelining. When targeted to the latest generation of FPGAs this coprocessor can execute parts of the double precision benchmark up to 6.8 times faster than a 3GHz Pentium 4 processor and accelerate the execution of the complete benchmark by a factor of 4.4. This suggests that it is currently possible to accelerate scientific computing algorithms using FPGAs, even when using high level design methods and tools. However, it has also been shown that significant effort is required to convert the standard ANSI C algorithms into efficient, parallel Handel-C. As a result the development times for hardware are still much longer than those for software, even when high level languages are used. Furthermore, an awareness of how specific high level code will be synthesized in hardware is still required to produce efficient designs. This prevents software designers from migrating seamlessly to hardware. It seems that further design tools, possibly automating optimizations similar to those presented here, are still required to make hardware design run as quickly and efficiently as software design. Turkington 6

7 7. REFERENCES [1] K. Underwood, FPGAs vs. CPUs: Trends in Peak Floating Point Performance, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [2] P. Belanovic, M. Leeser, A library of parameterized floating point modules and their use, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [3] J. Dido, et. al., A flexible floating point format for optimizing data paths and operators in FPGA based DSPs, Proc. Int. Symp. Field Programmable Gate Arrays, pp , February [4] X. Wang, B. E. Nelson, Tradeoffs of designing floating point division and square root on Virtex FPGAs, Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp , April [5] E. Roesler, B. Nelson, Novel optimizations for hardware floating point units in a modern FPGA architecture, Proc. Int. Conf. Field Programmable Logic and Applications, pp , September [6] W. Ligon, et. al., A re-evaluation of the practicality of floating point on FPGAs, Proc. IEEE Symp. Field- Programmable Custom Computing Machines, pp , April [7] K. Underwood, K. Hemmert, Closing the gap: CPU and FPGA trends in sustainable floating point BLAS performance, Proc. IEEE Symp. Field Programmable Custom Computing Machines, pp , April [8] [9] [10] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK benchmark: Past, Present and Future, l.pdf [11] ENGSPDPDKPDK_4.1_Software_Product_Description pdf [12] ctures/chap02.pdf, pp [13] [14] [15] H. Rong, Z. Tang, R. Govindarajan, A. Douillet, G Gao, Single-Dimension Software Pipelining for Multi- Dimensional Loops, Proc. IEEE Int. Symp. Code Generation and Optimization, pp , March 2000 ACKNOWLEDGEMENTS This work was supported by the EPSRC (EP/C549481/1) Turkington 7

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,