A Reconfigurable Accelerator for 32-Bit Arithmetic

Size: px

Start display at page:

Download "A Reconfigurable Accelerator for 32-Bit Arithmetic"

Hugh Bell
5 years ago
Views:

1 A Reconfigurable Accelerator for 32-Bit Arithetic Reiner W. Hartenstein, Rainer Kress, Helut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D Kaiserslautern, Gerny Fax: , eil: Abstract This paper introduces the MoM-3 (Map oriented Machine 3) architecture as an accelerator for applications with a large aount of 32-bit arithetic. The data nipulations in the inner loop of the application can be configured as a coplex hardware operator into special fieldprograble devices. Several coplex address generators are prograed to ipleent the control structures of the algorith, so that data is accessed copletely under hardware control, without requiring eory cycles for address coputations. The MoM-3 supports pipelining on stateent and expression level. The copiler takes C as input language and produces the configuration code for the field-prograble devices and the address generators without further user interaction. Acceleration factors in the range of 12 to 62 have been obtained, copared to a high-end Sun SPARC workstation. 1. Introduction Many coputation-intensive algoriths take too uch execution tie, even on a well-equipped odern workstation. This opens a rket for hardware accelerators of all kinds. If the rket share is big enough, ASIC-based accelerators are designed. JPEG and MPEG ige and video copression can be based on full custo circuits fro C-Cube [3] or Zoran [12], for exaple. But such ASICs are expensive and suffer fro a long developent tie. If a user doesn t require the highest perfornce possible for a specific algorith, but a speed-up for several different algoriths, he ight be better off with a kind of custo coputing chine. A lot of coercial and noncoercial solutions have already been presented, like Splash-2 [1], DEC Perle-1 [2], Virtual Coputers [4], and CHS2x4 Custo Coputer [10]. All of these are based on coercially available FPGA technology, especially the SRAM-based XILINX devices [11]. All of the currently available FPGAs are configurable at bit-level. Therefore they provide only odest perfornce and capacity when configuring 32-bit datapaths which are standard for current icrocoputers. As a consequence, ost custo coputing chines do not provide operators which tch the word length of standard high level language data types. Especially arithetic operators of a significant word length consue too uch area and suffer fro large delays, king it difficult to accelerate coputation-intensive algoriths on ost custo coputing chines. The MoM-3 design ais at the following goals: 1. progrability fro standard high level languages like C, without requiring hardware expertise, 2. acceleration of scientific algoriths that process a large aount of data. Since ny of these algoriths do a large aount of arithetic, the MoM-3 needs a coputational device that supports arithetic operations on standard data types like 8- bit, 16-bit and 32-bit integers. The device ust be reconfigurable, because different algoriths shall be accelerated by the sae MoM-3 hardware. For lack of coercially available devices that fulfil these requireents, we developed our own, the so-called rdpa [6]. The MoM-3 kes use of another characteristic of scientific algoriths, to achieve further speed-up. Since ny of these algoriths are organized in loops, which copute indices to large data arrays, the MoM-3 contains several address generators to copute the appropriate data address sequences copletely under hardware control. In the following section, this paper introduces the hardware structure of the MoM-3. Afterwards, the features of the C copiler for the MoM-3 are outlined. The fourth section explains the way algoriths are run on the MoM-3 by eans of an exaple. The final sections provide a perfornce coparison and conclude the paper. 2. The MoM-3 Hardware The MoM-3 architecture is based on the Xputer chine paradig [7]. MoM is an acrony for Map oriented Machine, because the data eory is organized in 1 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

2 Data Sequencer GAG GAG GAG ctrl ralu SW M3C eory addresses Data Meory Figure 1. Block Diagra of an Xputer data two diensions like a p. Instead of the von Neunnlike tight coupling of the instruction set to the data nipulations perfored, an Xputer shows only a loose coupling between the sequencing echanis and the ALU. That s why an Xputer efficiently supports a reconfigurable ALU (ralu). The ralu contains coplex operators which produce a nuber of results fro a nuber of input data (figure 1). All input and output data to the coplex operators is stored in so-called scan windows (SW). A scan window is a prograing odel of a sliding window, which oves across data eory under control of a data address generator. All data in the scan windows can be accessed in parallel by the ralu operator. The ralu operators are activated every tie a new set of input data is available in the scan window. This so-called data sequencing echanis is deterinistic, because the input data is addressed by Generic Address Generators (GAG). They copute a deterinistic sequence of data addresses fro a set of algorith-dependent paraeters. An Xputer Data Sequencer contains several Generic Address Generators running in parallel, to be able to efficiently cope with ultiple data sources and destinations for one set of coplex operators. In the MoM-3, the Data Sequencer is distributed across several coputational odules (C-Modules), as can be seen in figure 2. Each C-Module consists of a Generic Address Generator (GAG), an ralu subnet and at least two egabytes of local eory. All C-Modules can operate in parallel when each Generic Address Generator accesses data in its local eory only. The MoM-3 includes up to seven C-Modules. Apart fro the local eory access, two eans of global counication are available. First, the ralu subnets can exchange internal results with their neighbours by use of the ralu interconnect, without disturbing parallel activity. Second, the Generic Address Generators can access data eory and ralu subnets on other C-Modules using the global MoMbus. This can be done only sequentially, of course. The global MoMbus is used by the MoM-3 controller (M3C) as well, to reconfigure the address generators and the ralu whenever necessary. The MoM-3 controller is the coordinating instance of the Data Sequencer, which ensures a well-defined parallel activity of the Generic Address Generators by selecting the appropriate paraeter Host CPU VMEbus MoM-3 MoMbus <-> VMEbus Interface e. MoMbus Figure 2. The MoM-3 Hardware sets for configuration. Via the MoMbus-VMEbus interface, the host CPU has access to all eory areas of the MoM-3 and vice versa. The Generic Address Generators have DMA capability to the host s in eory, reducing the tie to download a part of an application for execution on the MoM-3. The host CPU is responsible for all disk I/O, user interaction and eory allocation, so that the MoM-3 copletely uses the functionality of the host s operating syste. The printed circuit boards with the MoM-3 controller and the C-Modules reside in their own 19 inch rack with the MoMbus backplane. The only connection to the host coputer is through the bus interface. That way, the MoM-3 could be interfaced to a different host with inil effort. The part of the bus interface that is injected into the host s backplane would have to be redesigned according to the host s bus specifications. All other parts could rein the sae. 2.1 The Data Sequencer GAG C-Module 1 A D e. ralu ralu interconnect ralu interconnect ralu C-Module interconnect 7 A D GAG e. ralu The MoM-3 Data Sequencer consists of up to seven Generic Address Generators and the MoM-3 controller. The Generic Address Generators operate in a two-stage pipeline. The first stage coputes handle positions for the scan windows (Handle Position Generator). Each Handle Position Generator controls one scan window. A handle position consists of two 16-bit values for the two diensions of the data eory. The sequence of handle positions describes how the corresponding scan window is oved across the data eory (figure 3). Such a sequence of handle positions is called scan pattern. A Handle Position Generator can produce a scan pattern corresponding to four nested loops at the ost. It is prograed by specifying a set of paraeters, such like starting position, 2 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

3 Generic Address Generator Handle Position Generator Control Logic handle position sequence Meory Address Generator eory address sequence = handle position read 1,1 read -1,0 read 0,1 write 0,0 write 0,1 = no access -1,0: r Figure 3. Generic Address Generator 0,1: r/w 0,0: w 1,1: r increent value, and end position of a loop, each both for the x and y diension of the data eory. The second pipeline stage coputes a sequence of offsets to the handle positions, to obtain the effective eory addresses for the data transfers between the scan window / ralu and the data eory. Therefore this stage is called Meory Address Generator. The range of offsets y be 32 to +31 in both diensions of the data eory. The sequence of offsets y be prograed arbitrarily, but at ost 250 references to the data eory can be de fro one handle position. The offset sequence y be varied at the beginning and at the end of the loops of the Handle Address Generator. This allows to perfor extra eory accesses to fill a pipeline in the ralu at the beginning of a loop, as well as additional write operations to flush a pipeline at the end. The address parts of the two diensions y cobined to a real linear eory address in four ways: one row of two-diensional eory y consue an address space of 10, 12, 14, or 16 bits. This allows to adjust the size of the data eory to the size of the processed data, to reduce wastage of address space. The software environent contains a eory nageent odule, which kes use of the unused area, fro the end of a data row to the next appropriate power-of-two boundary. If the algorith uses another array of data, which fits into the reining space, such two data arrays are placed in eory, one after the other. After the concatenation of the two address parts to a linear address, a 32-bit base address is added, to obtain the effective eory address. The base address typically is the starting address of the data array referenced by this Generic Address Generator, as deterined by the eory nageent software. The Meory Address Generator itself is a three stage pipeline. The first stage perfors the offset calculations and the cobination of address parts to a linear address. The second stage adds the 32-bit base address, and the third stage handles the bus protocol for data transfers. The Generic Address Generators run in parallel. They synchronize their scan patterns through the activations of the ralu. All scan patterns proceed until either they detect an explicit synchronization point in the offset sequence of the Meory Address Generator, or they are blocked in a write operation to eory, waiting for the ralu to copute the desired result. The hardware controlled generation of long address sequences allows to access data in eory every 120 ns, using 70 ns eory devices and a conventional, noninterleaved eory organization. A further speed-up of eory accesses could be obtained by interleaved eory banks and by the introduction of static RAM caches, like in conventional coputers. With caches, it would ke sense to have a prograble cache controller in the MoM-3. Since the hardware generated address sequences are deterinistic, a copiler could copute a cache update strategy at copile tie, which would result in the iniu nuber of cache isses. This should provide better perfornce than the probabilistic cache update strategies usually applied. 2.2 Reconfigurable ALU The coputations in each ralu subnet of the MoM-3 are perfored by an rdpa array (reconfigurable datapath architecture). The rdpa array consists of eight rdpa chips in a two by four organization. Each rdpa chip contains an array of sixteen (four by four) datapath units (DPU). Each datapath unit can ipleent any unary or binary operator fro C language on integer or fixed-point data types up to 32 bits length. Multiplication and division are perfored by eans of a icroprogra, whereas all other operators can be executed directly. Additionally, operators like ultiplication with accuulation are available in the library of configurable operators. Each datapath unit has two input registers, one at the north and one at the west, and two output registers, one at the south and one at the east (figure 4). All registers can be used for interediate results throughout the coputation. The registers in the DPUs in fact are a distributed ipleentation of the scan window of the Xputer paradig. The datapath units can read 32-bit interediate results fro their neighbours to the west and to the north and pass 32-bit results to their neighbours in south and/or east direction. A global I/O bus allows input and output data to be written directly to a DPU without passing the fro neighbour to neighbour. The DPUs operate data-driven. The operator is applied each tie, new input data is availa- 3 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

4 west I/O bus register ALU register north the ralu controller, supports pipelining on operator level (e.g. floating point operations are ipleented as a pipeline across several datapath units), pipeline chaining [8], and pipelining of loop bodies, as shown in the exaple in section 4.2. That way, the coplex operators of subsequent loop iterations are coputed in parallel in a pipeline. Each of the loop iterations is finished to a different degree, depending on its progress in the pipeline. The whole rdpa array is configured across one serial link of the DPU in the upper left corner of the array. Each configuration word is preceeded by a two-diensional DPU address in relative fort, and the address for the icroprogra storage. As long as both DPU address coponents are different fro zero, the configuration word is passed on to a neighbouring DPU, with the appropriate DPU address coponent decreented by one. If the local interconnect to the east is busy, a configuration word y be routed to the south first (if the DPU address coponent in vertical direction is different fro zero), which results in an autotic load balancing for the slow serial links. The relative address fort allows to use the sae configuration data for rdpa arrays of arbitrary size. Configuration data words are rked by a special configuration bit set by the MoM-3 controller. That way the rdpa array y be reconfigured partially during runtie, while other areas of the array are still busy with coputations. Config- a) b) op icrocontrol prograble FIFO global I/O bus status configuration MoMbus to GAGs, Host, Main Meory rdpa address generation unit register file rdpa control unit rdpa register register op op op op op op south east Figure 4. Datapath unit: a) sybol; b) block diagra ble either fro the bus or fro a neighbouring DPU. This decentralized control siplifies pipelining of operations, when each operation takes a different tie to execute. The array of DPUs expands across rdpa chip boundaries in a way that is copletely transparent to the copiler software. To overcoe pinout restrictions, the neighbour to neighbour connections are reduced to serial links with appropriate converters at the chip boundaries (figure 5). Although pipelining drastically reduces the effect of the conversion fro 32 bits to 2 bits, this still y turn out a bottle neck in the current rdpa prototype. With state of the art IC technology and packages larger than 144 pins, we could build an rdpa with 66 MHz serial links (instead of 33 MHz) and four bits wide counications channels. This fourfold speed-up would overcoe the shortfalls of the current prototype. In addition to the rdpa array, a MoM-3 ralu subnet (figure 6) contains an ralu controller circuit which interfaces to the MoMbus. It provides a register file as a kind of cache for frequently used data, and controls the data transfers on the global I/O bus. The rdpa, in cobination with parallel to serial converter I/O bus chip 1 serial op op op link serial to parallel converter PCB chip 2 Figure 5. Expansion of nearest neighbour interconnect across chip boundaries op op op op op op op op op op op op op op op op op op Figure 6. One subnet of the reconfigurable datadriven ALU 4 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

5 uration data is passed along with higher priority than coputational data, each tie before a datapath unit starts with the coputations for the next application of its operator. That way, both configuration and coputation are slowed down, if part of the rdpa array is busy, but both are sure to continue without deadlocks. For additional details on the rdpa see [6]. Although this publication refers to an older version of the rdpa, the in structure still is the sae, only the ipleentation has iproved. 3. MoM-3 C Copiler The C copiler for the MoM-3 takes an alost coplete subset of ANSI C as input. Only constructs, which would require a dynaic eory nageent to be run on the MoM-3 are excluded. These are pointers, operating syste calls and recursive functions. Since the host s operating syste takes care of eory nageent and I/O, the software parts written for execution on the MoM-3 do not need such constructs. There are no extensions to C language or copiler directives required to produce configuration code for the MoM-3. The copiler coputes the paraeter sets for the Generic Address Generators, the configuration code for the rdpa arrays, and the reconfiguration instructions for the MoM-3 controller, without further user interaction. First, the copiler perfors a data and control flow analysis. The data structure obtained allows restructurations to perfor parallelizations like those done by copilers for supercoputers. These include vectorization, loop unrolling, and parallelizations on expression level. The next step perfors a re-ordering of data accesses to obtain access sequences, which can be pped well to the paraeters of the Generic Address Generators. Therefore, the copiler generates a so-called data p, which describes the way the input data has to be distributed in the data eory to obtain optiized hardware generated address sequences. After a final autotic partitioning, data nipulations are translated to a ralu description, and the control structures of the algorith are transfored to Data Sequencer code. An assebler for the Data Sequencer translates the Data Sequencer code to paraeter sets for the Generic Address Generators and a reconfiguration schee for the MoM-3 controller. The ralu description is parsed by the ALE-X assebler (arithetic and logic expression language for Xputers). It generates a pping of operators to DPUs, erges DPU operators where possible, and coputes a conflictfree I/O schedule, which tches the operators speed, to keep the DPUs as busy as possible. The separation of the ralu code generation fro the rest of the copiler allows to use other devices than the rdpa to build a MoM-3 ralu, without having to rewrite the C copiler with all its optiizations. For a new ralu, only an assebler generating ralu configuration code fro ALE-X specifications has to be written. This is not too difficult, because ALE-X describes only arithetic and logic expressions on the scan windows contents. A ore detailed description of the MoM-3 prograing environent can be found in [13]. The ost iportant benefit of the MoM-3 C copiler copared to developent environents of other custo coputing chines is the fully autotic code generation. The prograer neither has to be a hardware expert, to guide hardware synthesis with appropriate constraints or copiler directives, nor has he to interfere with different stages of the copilation to get reasonable results. 4. Exaple: two diensional FIR filter The operation of the MoM-3 and the way an algorith is adapted to the non-von Neunn architecture of the MoM-3 is illustrated with a two-diensional FIR filter. The two diensional FIR (finite ipulse response) filter is one whose ipulse response processes only a finite nuber of nonzero saples. The equation for a general two diensional FIR filter is y n = k ij x n i, j. (1) i j 4.1 Straightforward Ipleentation In this exaple a two diensional filtering of second order is shown, that is indices i and j have a range of zero to two. Given the order of a filter and its weights k ij, the ost efficient ipleentation on a von Neunn coputer is to unroll the two loops of the filter kernel, to create a large expression coputing the new value of y n (see figure 7). The weights k ij now are constants, so that the copiler can optiize the ultiplications by replacing the with a tching sequence of additions and shift operations. The sae source code produces an efficient ipleentation on the MoM-3 as well. The C copiler vectorizes the expression and perfors loop unrolling to the extend of the capacity of the rdpa array. With ultiplication and accuulation being a valid operator, the expression to copute y[n][] takes three by four DPUs. The ultiplications and sution of the rows are done in a three by three array of ultipliers and ultiplier-accuulators. A colun of three routers/adders cobines the sus of rows to the coplete result (figure 8). Taking the chip boundaries into account (with the costly parallel-to-serial converters) eight of these expressions can be coputed in parallel, each in its own rdpa chip. The configuration for the whole rdpa array siply 5 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

6 For (n = 0; n < xn; n++) { For ( = 0; < x; ++) { y[n][] = k00*x[n][] + k01*x[n][+1] + k02*x[n][+2] + k10*x[n+1][] + k11*x[n+1][+1] + k12*x[n+1][+2] + k20*x[n+2][] + k21*x[n+2][+1] + k22*x[n+2][+2]; } } Figure 7. Source code of a 3 * 3 two-diensional FIR filter next handle position scan window rdpa array Figure 9. Scan window and ralu configuration of a vectorized 3 * 3 2-D FIR filter is a concatenation of eight of these operators in a two by four array, leaving always a spare row to adjust to the chip boundaries. The scan window de of the DPU registers now consists of ten words in x direction and three words in y direction, instead of the three by three words of a single filter operator. During vectorization, the copiler perfors a partial unrolling of the inner loop, so that eight consecutive loop iterations are executed in parallel, and the loop index x is increented by eight. The nuber of vectorizable loop iterations is taken fro the capacity of the rdpa array, which can ipleent eight 3 * 3 FIR filter operators in parallel. Furtherore, the C copiler knows fro the hardware configuration file the nuber of available C- Modules and splits the input array into a corresponding nuber of stripes. The required overlap of two can be coputed fro the data dependency analysis. That way all C-Modules y operate in parallel to speed-up coputation by a factor of seven, at the ost. The ALE-X assebler detects, that ost of the input values to the eight parallel expressions are used several ties in different ultiplications, and stores these values in the register file for quick access. Furtherore, the two input values in x direction, that overlap with the next loop iteration are read fro the register file as long as the pipeline is full throughout the inner loop. The copiler generates a Meory Address Generator configuration that first fills the pipeline by reading ten coluns of three input values. Within the inner loop, only eight coluns have to be read fro eory. The overlapping two coluns are taken fro the register file. The only rearrangeents to the data are the distribution of the stripes onto the C-Modules. Therefore the data p is quite straightforward. The input values are stored row by row, each strip of rows in a separate C-Module eory. All Generic Address Generators copute the sae address sequence (figure 10). Scanning the eory with a three by ten window row by row, at each scan window position ten input values are read fro eory and eight values are written as results. All values required at overlapping scan window positions are fetched fro the regis- GAG 0: 148 * 160 * (24 r + 8 w) A D ralu 0: 8 * 3 by 3 2-D FIR a a a = ultiply by k ij = ultiply & accuulate = accuulate sus of rows GAG 6: 148 * 160 * (24 r + 8 w) A D ralu 6: 8 * 3 by 3 2-D FIR Figure 8. ralu operator for a single 3 * 3 2-D FIR filter operation Figure 10.Copiler generated scan pattern for a 3 * 3 2-D FIR filter 6 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

7 ter file for subsequent accesses. The row by row scan pattern for the scan windows exactly tches the two nested for-loops of the algorith in figure 7, with the loop index of the outer loop being increented by eight. Knowing this structure of the copiler output, a perfornce estition is quite siple. Because eight filter operations are perfored in parallel, the coputation tie for the ultiplication is not the bottleneck, but the eory and register file access ties to provide the large nuber of input data. For each group of eight result values, 3 * 8 = 24 inputs have to be fetched fro eory, each eory access taking 120 ns. Two coluns of inputs are reused in the next loop iteration and can be fetched fro the register file in 60 ns each. Two of the ten input coluns are used in the coputation of two results, requiring 2 * 3 = 6 additional register file accesses. Six input coluns are used, each in three result coputations, totalling in another 6 * 3 * 2 = 36 register file accesses. The total tie to copute eight result values and store the in eory is 24 * 120 ns + 8 * 120 ns + 6 * 60 ns + 6 * 60 ns + 36 * 60 ns = 6720 ns. A single row includes = 1278 positions. Eight positions are coputed in parallel, resulting in 1278 / 8 = 160 iterations of the inner loop. One stripe takes ( * 2) / 7 = 148 rows, which is equal to the nuber of iterations of the outer loop. Using seven C-Modules in parallel, it takes 148 * 160 * 6720 ns = s to filter a 1280 by 1024 pixel ige. The filter weights y be chosen arbitrarily, because the execution tie is bound by eory I/O speed. On the MoM-3, the result would be coputed in the sae tie, regardless whether a pixel is 8 bits wide or 32 bits, because even 32- bit ultiplications would be faster than eory access tie in this exaple. The corresponding perfornce figures for von Neunn coputers in table 1 were based on the sae C source. All variables were declared as registers, the ones with the highest reference count first. Using GNU gcc copiler, all optiizations were turned on, including the exact specification of the icroprocessor type. 4.2 Manually Optiized Ipleentation bit bit Figure 11.Data p for an optiized 2-D FIR filter Fro the algorith description in figure 7, the MoM-3 C copiler cannot ke use of all optiization possibilities. If the input data is only 8 bits wide (a typical case for ost grayscale iges), four values could be transferred in a single 32 bit eory access, reducing I/O tie by a factor of four. The data p for a packed representation of a 1280 by 1024 pixel ige, using eight bits per pixel, can be seen in figure 11. Instead of using the register file, values which are required in subsequent loop iterations could be passed to the next DPU using the neighbour to neighbour interconnect. Since these transfers can be done in parallel, they are uch faster than register file accesses. Furtherore, if the eight FIR filter operators would copute the results for eight rows in parallel, instead of eight results in a single row, the neighbour to neighbour connections could be utilized to a far larger extend. Instead of six overlapping positions in the next loop iteration, twenty positions would overlap, so that fewer inputs would have to be read fro eory and ore inputs would be passed on fro DPU to DPU. The straightforward copiler ipleentation doesn t ke use of the fact, that the next row of results requires to read again ost of the inputs, which have been read during the coputation of the previous row of results. Coputing eight rows of results in parallel, the reused inputs can be fetched fro the register file instead of the eory. The ralu configuration and the scan window corresponding to these nual optiizations of the filtering algorith can be found in figure 12. In order to achieve such optiizations, nual changes to the input source are required. The passing of the input values to the next DPU for the following loop iteration has to be specified explicitly. Since the copiler perfors a vectorization, proceeding fro the inner loops to the outer loops, the parallelization of the row coputations has to 7 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

8 next handle position GAG 0: 19 * 320 * (10 r + 8 w) A D ralu 0: 8 * 3 by 3 2-D FIR GAG 6: 19 * 320 * (10 r + 8 w) A D ralu 6: 8 * 3 by 3 2-D FIR scan window rdpa array Figure 12.Manually optiized scan window and ralu configuration be done explicitly. Packed transfers of four 8-bit inputs in one 32-bit eory access cannot be specified solely in the C source. The unpacking of 32 bit values can be done in a single DPU, which accepts a 32 bit value and four ties passes an eight bit value to the next DPU for coputation. But this has to be done in the ALE-X assebler source. A description of that operation sequence in C would not pass unchanged through the copiler s transfortion and optiizations steps, so that the ALE-X assebler would not recognize an unpack operator in the resulting description and erge it to a single DPU. An ralu operator (figure 13) to copute a single result value consists of three unpacking DPUs ( u ), a three by three array of ultiply-accuulate DPUs ( and ), and three DPUs to accuulate the sus of rows ( a ). The last of these DPUs perfors the packing of four result values to a 32 bit word as well ( ap ). The partial sus of rows are transferred fro west to east to the next DPU. The operation is pipelined on expression u u u Figure 13.rALU operator for a 3 by 3 filter operation (nually optiized) a ap u a ap = unpack 32 bit = ultiply by k ij = ultiply & accuulate = accuulate sus of rows = accuulate & pack Figure 14.Scan Pattern for a 2-D FIR Filter level, because each ultiply-accuulate DPU accepts the next input operand (of the subsequent loop iteration) after it has calculated its partial su and passed the result to the next DPU to the east. That way, the whole rdpa array can be kept busy, with filter operators of consecutive loop iterations finished to different degrees in the pipeline. One filter operator fits into a single rdpa chip, so that the parallel-to-serial converters do not ipose a delay on the operation. Eight of these operators fit into the rdpa array of a single C-Module. The scan window oves one position to the right in the inner loop, but eight positions downwards in the outer loop, because eight rows are processed in parallel (figure 14). That way the overlapping inputs in each row can be transferred in parallel to the next DPU to the south, where they are required as inputs for the following iteration. The overlapping inputs of the adjacent rows of the eight operators are fetched fro eory only once and then taken fro the register file in half of the tie. One colun of eight result values requires the following tie for I/O: ten values have to be read fro eory, and eight written back. Furtherore, two input values are fetched a second tie fro the register file and six values are fetched twice fro the register file for reuse in another filter operator. The total I/O tie is 3000 / 4 = 750 ns. The division by four stes fro the fact, that every eory or register access transfers the values for four iterations. One pipeline stage of the ultiplication and accuulation takes 420 ns in the average case, because the ultiplications with constants are resolved to shift and add operations. The parallel-to-serial converters, taking 600 ns, do not account, because each filter operator is local to one rdpa chip. Although optiized, this ipleentation is still I/O bound. The optiized algorith takes 148 / 8 = 19 iterations of the outer loop. The inner loop contains = 1278 iterations, each producing a colun of eight results. 8 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

9 For (n = 0; n < xn; n++) { x00=x[n][0]; x01=x[n][1]; x10=x[n+1][0]; x11=x[n+1][1]; x20=x[n+2][0]; x21=x[n+2][1]; For ( = 0; < x; += 3) { x02 = x[n][+2]; x12 = x[n+1][+2]; x22 = x[n+2][+2]; y[n][] = k00*x00+k01*x01+k02*x02 + k10*x10+k11*x11+k12*x12 + k20*x20+k21*x21+k22*x22; x00 = x[n][+3]; x10 = x[n+1][+3]; x20 = x[n+2][+3]; y[n][+1] = k00*x01+k01*x02+k02*x00 + k10*x11+k11*x12+k12*x10 + k20*x21+k21*x22+k22*x20; x01 = x[n][+4]; x11 = x[n+1][+4]; x21 = x[n+2][+4]; y[n][+2] = k00*x02+k01*x00+k02*x01 + k10*x12+k11*x10+k12*x11 + k20*x22+k21*x20+k22*x21; } } Figure 15.Manually Optiized C Source Code The total tie to filter a 1280 by bit grayscale ige is 19 * 1278 * 750 ns = 0,018 s. For a fair coparison, the C source for the von-neunn coputers is nually optiized as well. Instead of copying the input values to the next position, three iterations of the inner loop are unrolled and a cyclic buffering schee is applied. This eliinates the copy tie for the icroprocessors, because they cannot rely on a register to register interconnect to do the copy operations in parallel. The resulting source code can be seen in figure 15. The x array is declared as unsigned character of course, to allow the copiler to do the sae optiizations on ultiplications as on the MoM-3. The xnm variables and the loop counters are declared as register variables, where the order of declarations takes into account, how often each variable is referenced. In this source code, there is no explicit unrolling of the outer loop like in the optiized MoM-3 source, because a standard icroprocessor cannot process ultiple FIR filter operators in parallel. Table 1 reveals, that the nual optiizations iprove perfornce on the conventional coputers by ore than a factor of two. 5. Perfornce Evaluation The perfornce figures for soe exaple algoriths are given in table 1. The first colun lists the algoriths. MergeSort perfors a sorting of 256k = bit integer nubers. On the MoM-3, initially six C-Modules operate in parallel, but the last erging steps are done in the eory of a single C-Module. The CPU ties for the conventional coputers are easured only for the tie to sort the nubers in eory. The two-diensional FIR filter algoriths are easured for different kernel sizes. For all ipleentations, the weights of the kernel are constants copiled into the code, allowing the copilers to replace ultiplications with optiized shift and add sequences. Input to the filter is a 1280 by 1024 pixel grayscale ige using 8 bits per pixel. Only the tie to filter the ige in eory is counted, excluding disk I/O operations. The first group of perfornce figures are easured on the output of the copilers, with all optiizations turned on. The second group of figures are easured on nually optiized code, that takes into account that the sae values are ultiplied to different weights in successing steps of the filter algorith (see section 4.2). This Algoriths 68020, 16 MHz [s] Sparc 10, 50 MHz [s] MoM-3, 33 MHz [s] MergeSort a x3 2-D FIR b x5 2-D FIR b x7 2-D FIR b x3 2-D FIR c x5 2-D FIR c x7 2-D FIR c JPEG d Table 1. CPU tie required (in seconds) a. sort Bit integers in eory b * 1024 grayscale ige, 8 bits per pixel, in eory, copiler optiizations only c * 1024 grayscale ige, 8 bits per pixel, in eory, nually optiized d. IJG cjpeg copression, 704 * 576 RGB ige, 24 bits per pixel, excluding disk I/O 9 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

10 allows to reduce eory cycles by storing these values internally. Multiplications and accuulations are done in the sae DPU. The additional cost of the addition is weighed out far by the reduction of the required rdpa array size. Since all DPUs fit into a single chip in the 3 * 3 case, the coparably high cost (in execution tie) of the serial links can be avoided. The MoM-3 uses all seven C- Modules in parallel on overlapping strips of the input ige to speed up the filter operation. The JPEG copression algorith is based on the IJG progra cjpeg [9], inserting tie easuring code around the copression Algoriths MoM-3 vs , 16 MHz MoM-3 vs. Sparc 10/51 odern MoM-3 vs. Sparc 10/51 a MergeSort b 3x3 2-D FIR c d 5x5 2-D FIR c d 7x7 2-D FIR c d 3x3 2-D FIR e d 5x5 2-D FIR e f 7x7 2-D FIR e f JPEG Table 2. Speed-up Factors 23.6 d a. Sparc 10/51 vs. MoM-3 in Sparc s technology b. critical factor is eory access tie: 60 ns vs 120 ns DRAM access, 30 ns vs. 60 ns SRAM (register file) access, in the first stage of the algorith, the critical factor is the speed of the serial links, which changes to eory access tie for new technology, therefore ore than a factor of two additional speed-up c. copiler optiizations only d. critical factor is solely eory access tie: additional speed-up of two e. nually optiized f. critical factor is speed of serial links: 4 bit vs 2 bit with larger packages, and 66 MHz vs 33 MHz serial link clock, but only an additional speed-up of approxitely three, because with new technology critical factor is eory access tie kernel, after the input ige is read. To exclude disk output, output is redirected to /dev/null. A 704 by 576 RGB colour ige using 24 bits per pixel is used as input for the tie easureents. The JPEG copression algorith is adapted to the MoM-3 so that seven C-Modules operate in parallel. A ore detailed description of the MoM-3 ipleentation of this algorith can be found in [5]. The second colun of table 1 gives the CPU ties in seconds for the ELTEC host, which runs an MC68020 at 16 MHz. The third colun describes the perfornce of a SPARCstation 10/51 running at 50 MHz. The CPU ties of the conventional coputers were all based on copilations (GNU gcc) with all optiizations turned on, producing output specially adapted to exactly the kind of processor inside the coputer. The last colun indicates the tie to run the sae algorith on the MoM-3 prototype (33 MHz version). Table 2 illustrates the speed-up factors obtained copared to the hosting ELTEC workstation and a odern SUN Sparc 10/51. The last colun takes into account, that our prototype is built with an inferior technology copared to a odern Sparcstation. The notes at the botto of the table explain how the extra speed-up factors would be achieved. The rather inor speed-up for the MergeSort application stes fro the fact, that the sorting is not really coputation intensive and can only be parallelized and pipelined at the first stages. Most of the execution tie is spent in the erging process, where ost of the rdpa hardware is idle. The whole algorith is doinated by eory access tie, a reason why the Sparc 10 provides coparably sll speed-ups over the MC68020, as well. 6. Conclusions The MoM-3, a configurable accelerator has been presented, which provides acceleration factors of three orders of gnitude for its outdated host coputer, and speedups in the range of 12 to 62 when copared to a state of the art workstation. High acceleration factors can still be intained, when working fro a high level input language. The copilation is done fro an alost coplete subset of standard C language. Good results are achieved, without requiring user interaction or special copiler directives to generate code for field-prograble devices and the controlling hardware of the MoM-3. The custo designed rdpa circuit provides a coarse grain field-prograble hardware, especially suited for 32-bit datapaths for pipelined arithetic. A new sequencing paradig supports ultiple address generators and a loose coupling between sequencing echanis and ALU. The address generation under hardware control leaves all eory accesses free for data transfers, providing de facto a higher 10 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

11 eory bandwidth fro the sae eory hardware. The loose coupling of the data sequencing paradig allows to fully exploit the benefits of reconfigurable coputing devices. The cobination of hardwired address generators and configurable devices is the key to speed up both data nipulations and data accesses - and still intain a prograing environent failiar to a conventional software developer. The MoM-3 prototype is currently being built. The address generators, the MoM-3 controller and the ralu control chip have just returned fro fabrication and are now under test. All three are integrated circuits based on 1.0 CMOS standard cells, a technology de available by the EUROCHIP project of the CEC. The rdpa circuits will be subitted to fabrication in a 0.7 CMOS standard cell process soon. All MoM-3 perfornce figures were obtained fro siulations of the copleted circuit designs, which were integrated into a siulation environent, describing the whole MoM-3 at functional level. [12] A. Razavi, I. Shenberg, D. Seltz, D. Fronczak: A High Perfornce JPEG Ige Copression Chip Set for Multiedia Applications; Proceedings of the SPIE - The International Society for Optical Engineering, vol.1903, p , 1993 [13] K. Schidt: A Restructuring Copiler for Xputers; Ph. D. Thesis, University of Kaiserslautern, 1994 References [1] J. M. Arnold, D. A. Buell, E. G. Davis: Splash-2; 4th Annual ACM Syposiu on Parallel Algoriths and Architectures (SPPA 92), pp , ACM Press, San Diego, CA, June/July 1992 [2] P. Bertin, D. Roncin, J. Vuillein: Introduction to Prograble Active Meories; DIGITAL PRL Research Report 3, DIGITAL Paris Research Laboratory, June 1989 [3] D. Bursky: Codec copresses iges in real tie; Electronic Design, vol. 41, no. 20, pp , Oct [4] S. Casseln: Virtual Coputing and The Virtual Coputer; IEEE Workshop on FPGAs for Custo Coputing Machines, FCCM'93, IEEE Coputer Society Press, Napa, CA, pp , April 1993 [5] R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schidt: A Reconfigurable Machine for Applications in Ige and Video Copression; European Syposiu on Advanced Services and Networks / Conference on Copression Techniques and Standards for Ige and Video Counications, Asterda, March 1995 [6] R. W. Hartenstein, R. Kress, H. Reinig: A Reconfigurable Data-Driven ALU for Xputers; Proceedings of the IEEE Workshop on FPGAs for Custo Coputing Machines, FCCM'94, Napa, CA, April 1994 [7] R. W. Hartenstein, A. G. Hirschbiel, K. Schidt, M. Weber: A Novel Paradig of Parallel Coputation and its Use to Ipleent Siple High-Perfornce Hardware; Future Generation Systes 7 (1991/92), p , Elsevier Science Publishers, North-Holland, 1992 [8] Jaap Hollenberg: The CRAY-2 Coputer Syste; Supercoputer 8/9, pp , July/Septeber 1985 [9] T. G. Lane: cjpeg software, release 5; Independent JPEG Group (IJG), Sept [10] N.N.: The CHS2x4: The World s First Custo Coputer; Algotronix Ltd., 1991 [11] N.N.: The XC4000 Data Book; Xilinx, Inc., (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany