DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose

Size: px

Start display at page:

Download "DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose"

Stanley O’Neal’
6 years ago
Views:

1 DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE Peter Yiannacouras, J. Gregory Steffan, an Jonathan Rose Ewar S. Rogers Sr. Department of Electrical an Computer Engineering University of Toronto 10 King s College Roa, Toronto, ON yiannac,steffan,jayar@eecg.utoronto.ca ABSTRACT Commercial soft processors are unable to effectively exploit the ata parallelism present in many embee systems workloas, requiring FPGA esigners to exploit it (laboriously) with manual harware esign. Recent research [1, 2] has emonstrate that soft processors augmente with support for vector instructions provie significant improvements in performance an scalability for ataparallel workloas. These soft vector processors provie a software environment for quickly encoing ata parallel computation, but their competitiveness with manual harware esign in terms of area an performance remains unknown. In this work, using an FPGA platform equippe with DDR memory executing ata-parallel EEMBC embee benchmarks, we measure the area/performance gaps between (i) a scalar soft processor, (ii) our improve soft vector processor, an (iii) custom FPGA harware. We emonstrate that the 432x wall clock performance gap between scalar execute C an custom harware can be reuce significantly to 17x using our improve soft vector processor, while silicon-efficiency is improve by 3x in terms of area-elay prouct. We moifie the architecture to mitigate three key avantages we observe in custom harware: loop overhea, ata elivery, an exact resource usage. Combine these improvements increase performance by 3x an reuce area by almost half, significantly reucing the nee for esigners to resort to more challenging custom harware implementations. 1. INTRODUCTION The esigner of an FPGA-base embee system often has the ifficult choice between esigning custom harware by han using a harware-escription language (HDL) that is mappe irectly to the FPGA fabric, or writing software in a high-level language such as C that targets a soft processor a processor implemente using the programmable FPGA fabric an programme using traitional sequential programming languages an software compilers. The performance of a soft processor is often sufficient for parts of the esign allowing embee systems esigners to use them to reuce their time to market an exploit single-chip avantages without requiring specialize FPGAs with har processors; however, the performance an area of current commercial soft processors is still significantly inferior to that of a custom harware solution, meaning esigners nee to spen more time implementing harware to meet their esign constraints. As a result, we are motivate to improve soft processors to reuce FPGA esign time. Recent avancements [3 5] has inee expane the applicability of soft processors by improving them over current commercial soft processors. In particular, recent work has propose extening soft processors with vector processing capabilities [1, 2] as a means of scaling performance for ata-parallel workloas. Vector processing allows a single instruction to comman multiple atapaths calle vector lanes. On an FPGA the number of vector lanes can be configure by the esigner, allowing them to use more FPGA resources to scale-up performance. However, the impact of soft vector processors epens on their ability to lure FPGA esigners into software esign by proviing goo enough performance/area to reuce neee manual harware esign. Thus, it is crucial to unerstan the perfomance an area gap between soft vector processors an custom harware Measuring, Unerstaning, an Reucing the Gap We measure the area an performance gap using several ata-parallel benchmarks (primarily from the EEMBC [6] inustry-stanar embee benchmark suites) of three platforms executing: (i) out-of-the-box C on a scalar soft processor; (ii) han-vectorize-assembly on many configurations of the soft vector processor calle VESPA (Vector Extene Soft Processor Architecture) [1]; an (iii) custom harware han-esigne in Verilog. Our goal in this work is to use this measurement to quantify the competitiveness of recent soft vector processors an further improve them by leveraging our insights into the causes of the performance/area gap as well as the circuit structures use to

2 implement the benchmarks in harware. Specifically we ientify the following key avantages of custom harware over VESPA, an we improve VESPA reucing the impact of each avantage. Loop Overhea Loop control in custom harware is generally implemente using a finite state machine (FSM) that executes in parallel with the loop computation, while in VESPA the control atapath must complete an instruction before the vector lanes can issue the following instruction, an vice versa. We reuce this avantage an improve VESPA by ecoupling the control atapath from the vector lanes an exploiting instruction-level parallelism. Data Delivery High performance custom harware can often achieve near perfect elivery of ata to functional units with no cycles waste. In contrast, for soft processors incluing VESPA, ata flows from memory through caches to registers an eventually to functional units. We improve ata elivery in VESPA in two ways: (i) by tuning cache esign, an (ii) by supporting prefetching. Exact Resource Usage A custom harware implementation contains exactly the resources require to support the application: functional units support only the require operations, an atapath bit-withs exactly match those require. In contrast, soft processors such as VESPA are general-purpose an hence support a full instruction set (ISA) an the corresponing maximum bit-withs. We improve VESPA via support for subsetting the instruction set an reucing atapath bit-withs to match the application. In this work we emonstrate that these improvements when combine provie 3x improve performance over the original VESPA an significantly broaen its esign space. We also show that the performance gap between a scalar soft processor an custom harware is 432x, an that our fastest VESPA implementation reuces this gap to 17x, while proviing a performance-per-unit-area that is up to 3x that of the scalar processor. While the remaining gap is still large, these improvements allow soft vector processors to better compete with custom harware, allowing esigners to more often implement a software-programmable solution rather than having to esign custom harware Relate Work The most closely relate work is by Yu et. al. [2], who emonstrate the potential for vector processing as a simpleto-use an scalable accelerator for soft processors, potentially scaling better than Altera s C2H [5] behavioral synthesis tool for three benchmarks. However, that work moels a vector processor optimistically incluing using an on-chip one-cycle (latency) memory system. We compare a real vector processor to manual harware esign. Hart an Camposano [7] compare harware circuits synthesize to 2µ CMOS to software on a SPARC processor with cycle performance estimate from static coe analy- Vector Coproc Scalar MIPS Icache Lane 1 Lane 2 Lane L Memory Crossbar Arbiter Dcache Prefetch DDR Fig. 1. VESPA processor block iagram. sis. They fin that harware outperforms the processor by factors ranging between 24x an 44x for scalar workloas. Our work performs a similar comparison but between FPGA harware an soft vector processors, while incluing the effects of clock frequency an latent memory. More recent work [8] has compare FPGAs to har microprocessors but o not compare against soft vector processors Contributions In this paper we make the following contributions: (i) using an FPGA platform with DDR memory we quantify an analyze the area/performance gaps for inustry-stanar benchmarks between a scalar soft processor, a parameterize vector soft processor, an han-esigne harware implementations; (ii) we improve VESPA by targetting key avantages of harware implementations specifically by reucing loop overhea, tuning cache esign, supporting ata prefetching, an eliminating unuse harware; (iii) we show that our improve VESPA provies a powerful esign space, spanning 5x in area an 11x in performance, with the fastest VESPA reucing the 432x scalar soft processor performance gap to 17x while improving performance per area by up to 3x. 2. VESPA In our previous work on VESPA (Vector Extene Soft Processor Architecture) we implemente a parameterize vector processor in Verilog an explore its potential for scalability an customization. The following summarizes the VESPA architecture an parameters (ol an new), further etails can be foun in [1]. Figure 1 shows a block iagram of the VESPA processor that consists of a scalar MIPS-base processor automatically generate using the SPREE system [3], couple with a parameterize vector coprocessor base on the VIRAM [9] vector instruction set. The scalar SPREE processor is a 3- stage pipeline with full forwaring an a 1-bit branch history

3 Table 1. Configurable parameters for VESPA. Parameter Symbol Values Vector Lanes L 1,2,4,8,16,... Vector Lane Bit-With W 1,2,3,4,...,32 Maximum Vector Length MVL 2,4,8,16,... Memory Crossbar Lanes M 1,2,4,8,...L Each Vector Instruction - on/off DCache Depth (KB) DD 4KB,8KB,... DCache Line Size (B) DW 16,32,64,... DCache Miss Prefetch DPK 1,2,3,... Vector Miss Prefetch DPV 1,2,3,... table. The parameters of the VESPA system are liste in Table 1. The vector coprocessor consists of L parallel vector lanes where each lane can perform operations on a single element in a pipeline fashion. The with W of each vector lane atapath is 32 bits by efault, but can be reuce for applications that require less than the full 32 bit-with. MVL etermines the maximum vector length supporte in harware an is set to 64 for this stuy. The scalar processor an vector coprocessor share a single instruction stream fe by an instruction cache. The scalar processor an vector coprocessor are both in-orer pipelines, but can execute out-of-orer with respect to each other except for memory operations which are serialize to maintain sequential consistency. Both share a irectmappe ata cache with parameterize epth DD an cache line size DW. A crossbar routes each byte in a cache line to/from M of the L vector lanes in a given cycle. A full crossbar (M=L) can significantly reuce the clock frequency of the esign when L is large; in such cases M can be reuce to restore the clock rate an save area, but more cycles will be spent moving ata between the cache lines an vector lanes. The ata cache is equippe with a harware prefetcher configure with parameters DPK an DPV escribe in a later section. Beyon our previous work, we compare the VESPA configurations to harware for the first time, we ae configurable caches an ata prefetching, we explore the complete esign space with our new robust esign rather than iniviually for each parameter, an finally we make other architectural improvements (see Section 5.2). 3. MEASUREMENT METHODOLOGY Our goal is to measure the area/performance gap between scalar soft processors, soft vector processors, an harware, as well as to investigate techniques to reuce the gap. In this section we escribe the components of our infrastructure necessary to execute, verify, an evaluate the FPGA esigns. We escribe our harware platform, verification process, CAD tool measurement methoology, benchmarks, an compiler. We also iscuss how harware implementations of our benchmarks were create. Soft Processor Platform We use the Transmogrifier 4 (TM4) [10] to host the complete soft processor systems. The platform has four Altera Stratix EP1S80F1508C6 evices each with access to two 1GB PC3200 CL3 DDR SDRAM DIMMs clocke at 133 MHz (266 MHz DDR). We synthesize our processor systems onto one of the four Stratix I FPGAs connecte to one of the DIMMs an clock the processor at 50 MHz. All instances of VESPA are fully teste in harware using the built-in checksum values encoe into each benchmark. Debugging is guie by comparing traces of all writes to the scalar an vector register files. Note that because the Stratix I FPGAs on the TM4 are ate, we use this platform only for measuring benchmark cycle counts. For area an clock frequency measurements we use the CAD flow escribe below to target a faster Stratix III FPGA (which was unavailable to us) an achieves a clock spee of 130 MHz. While this faster clock spee woul increase the memory latency observe by the processor, we believe that this woul not significantly impact our results: the memory latency in our current system is alreay exaggerate by the fact that our DDR controller is han-mae an suffers many inefficiencies, incluing the use of a close-page policy. FPGA CAD Tools A key benefit of FPGA-base systems research is that we can obtain high quality measurements, incluing the area an clock frequency measurements provie by FPGA CAD tools. We use Altera s Quartus II 8.0 CAD software with register retiming an uplication enable an with aggressive timing constraints. Through experimentation we foun that these settings provie the best area, elay, an runtime trae-off. We perform eight such runs for each harware esign to average-out the noneterminism in the CAD algorithms. We approximate the relative silicon area of each Stratix III tile by ajusting the values supplie to us by Altera [11] for the Stratix II. We report the silicon area consume by a esign in units of equivalent ALMs the silicon area of a single ALM (Aaptive Logic Moule the basic programmable logic unit in the Stratix III) incluing its routing. For soft processors the areas we report inclue everything except the memory controller an host communication harware. Benchmarks The six benchmarks that we measure are liste in Table 2: five are from the inustry-stanar EEMBC collection [6], an one (IMGBLEND) was han-mae. All except IP CHECKSUM were han-vectorize an provie by Kozyrakis an the Berkeley VIRAM project [9]. For the top four benchmarks we execute the largest ataset with the EEMBC test harness uncompromise. We also manually extracte an vectorize the IP CHECKSUM kernel from the

4 Table 2. Benchmark applications. EEMBC EEMBC Input Output Largest Vector % VIRAM Benchmark Description Source Suite Dataset# size (B) size (B) Element ISA Use AUTCOR auto correlation EEMBC/VIRAM Telecom bits 9.6% CONVEN convolution encoer EEMBC/VIRAM Telecom bit 5.9% RGBCMYK rgb filter EEMBC/VIRAM Digital Ent bits 5.9% RGBYIQ rgb filter EEMBC/VIRAM Digital Ent bits 8.1% IP CHECKSUM checksum EEMBC Networking bits 8.1% IMGBLEND combine two images VIRAM bits 7.4% Networking suite of EEMBC, an execute it on 10 4KB input packets. Note that cycle counts are collecte from a complete execution on our harware platform as escribe above, an the vectorize coe is never moifie to support any specific vector configuration. Compilation Framework Benchmarks are built using a MIPS port of GNU gcc with the -O3 optimization level. Initial experiments with this version of gcc s autovectorization capability showe that it faile to vectorize key loops in our benchmarks, preventing us from automatically generating vectorize coe. Instea we porte the GNU assembler to support VIRAM vector instructions allowing us to manually vectorize in assembly. Area-Delay Prouct A system esigner may care more about area than performance, or vice-versa, epening on the constraints of the esign at han. However, it is important to have an unerstaning of the overall performanceper-area of caniate esigns motivating us to measure areaelay prouct as is traitionally one for igital circuits. We use the aforementione equivalent ALMs for area an the wall-clock-time of benchmark execution as the elay (combining the cycle counts reporte by real harware with the maximum clock frequency reporte by CAD tools) Designing Custom Harware Circuits We moel the performance of our harware circuits optimistically while using area an clock frequencies from a real FPGA harware esign, achieve by manually converting each benchmark into a Verilog harware circuit. While there are infinite variations of such harware esigns, we attempte to implement esigns that maximize performance while simplifying this process with the following assumptions: All input/output ata starts/ens in memory an is transfere uninterrupte at the full rate of our DRAM evice. We also iealize the control logic assuming it can make ecisions in a single clock an accounts for negligible area. Finally we on t allow any value or value-range specific optimization in either the software or harware. To summarize, we buil only the atapath of the circuit uner optimistic assumptions about the control logic an transfer of ata. The resulting harware circuits are teste in Table 3. Harware circuit area an performance. Clock Benchmark ALMs DSPs M9Ks (MHz) Cycles AUTCOR CONVEN RGBCMYK RGBYIQ IP CHECKSUM IMGBLEND AUTCOR-unroll CONVEN-unroll simulation using test vectors, an area an clock frequency are measure using the previously-escribe CAD flow. For each harware circuit we compute the total number of cycles for execution as the sum of the pipeline latency plus cycles spent transferring ata since the circuit computation is one in parallel with this transfer time. Overall we believe the harware circuits are optimistic an certainly overcome the manual vectorization avantage in software. As a result of forbiing value an value-range optimizations, we o not perform loop unrolling of nonvectorize loops, nor the equivalent in harware. For example, benchmarks such as AUTCOR operate repeately on the same ata set with the actual computation epenent on a parameter input which varies from 0 to 15. In harware we can unroll that loop performing all 16 operations simultaneously. The benefit of unrolling a loop woul be relatively small for VESPA which is an in-orer singleissue processor, while harware coul reaily exploit the expose instruction level parallelism (ILP). The last two rows in Table 3 show the impact of unrolling in harware for AUTCOR an CONVEN the only two benchmarks where unrolling is useful in harware. The unrolle circuits are not use in our results, but the performance impact can be large: in the case of AUTCOR execution completes in 7.4x fewer cycles, although clock frequency is reuce an circuit area increases substantially. 4. COMPARING TO HARDWARE In this section we compare the area an performance of the following three implementations of our benchmarks create

5 HW Spee Avantage HW Area Avantage 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes ominant parameter in etermining area an performance, but also being varie is the memory crossbar size M, the ata cache epth DD, the ata cache line size DW, an the ata prefetcher DPV. These parameters will be iscusse in a subsequent section, here they are use to show the fine-graine traeoffs within VESPA. The traeoffs are significant because VESPA coul be a potentially large component in an FPGA system. Fig. 2. Area-performance esign space of VESPA processors normalize against harware. via ifferent esign entry methos: (i) out-of-the-box C coe execute on the MIPS-base SPREE scalar processor; (ii) han-vectorize assembly language execute on many variations of our VESPA soft vector processor; an (iii) harware esigne in Verilog at the register transfer level as escribe in Section 3.1. Table 4 shows the area avantage an speeup of the harware implementation versus the scalar SPREE processor in the first row an the slowest, the least area-elay, an the fastest configurations of VESPA in the remaining three rows. The limite number of multipliers in the Stratix 1S80 on the TM4 prevent us from evaluating soft vector processors with more than 16 lanes, but we expect further performance scaling on larger Stratix III base harware platforms [1]. Focusing on the first row of the table, we observe that the scalar processor executing out-of-the-box C coe is on average 6.7x larger than the harware circuits an performs 432x slower. Not exploiting the available ata parallelism is the primary cause of the uner-performance. The area of the scalar processor is larger than each of the harware implementations, suggesting that espite the timemultiplexe resources, the general purpose overheas cause the processor to be still larger than the spatially execute harware. In an extreme case, CONVEN with its 1-bit atapath is 64x smaller than the scalar processor. With respect to the harware circuits VESPA is 13x to 64x larger an 192x to 17x slower. A more quantitative analysis follows in a subsequent section but it is clear that vector processing extensions to soft processors are motivate since the 432x scalar processor performance gap can be reuce own to 17x. Such a massive performance boost coul help convert many components of an FPGA system into software executing on a soft vector processor rather than laboriously-esigne custom harware. Figure 2 shows the area-performance esign space of many near-pareto-optimal VESPA processors normalize against harware. We observe that the VESPA esign space is quite large, spanning 5x in area an 11x in performance with the 16 lane VESPAs proviing the best performance at the cost of aitional area. The figure ientifies the number of lanes in each configuration which is the most 4.1. VESPA vs Scalar Looking at the first two rows of Table 4, we can compare the scalar processor with a VESPA processor that has only a single lane an ientical cache organization. The VESPA processors are at least 2x larger than the scalar since they are comprise of both a scalar processor an vector coprocessor. The han-vectorize assembly execute on VESPA gains more than 2x average performance over the scalar out-ofthe-box C coe on scalar SPREE, even though there is no ata parallel execution on the single-lane version of VESPA. This is partly ue to a number of avantages in VESPA: (a) More efficient pipeline execution with few epenencies. (b) The large vector register file can store an manipulate arrays without having to access the cache or memory. (c) Amortization of loop control instructions. () Direct support for fixe-point operations, preication, an built-in min/max/absolute instructions in the VIRAM instruction set. (e) Simultaneous execution in the scalar processor an vector co-processor. (f) Manual vectorization in assembly versus the C-compile scalar output from GCC. Determining the exact contribution of each avantage is beyon the scope of this work, we instea perform some qualitative analysis. Closer inspection of CONVEN reveale the cause of the 9x performance boost seen on the single lane VESPA to be the repeate operations performe on a single array. In VESPA the large vector register file can store large array chunks an manipulate them without storing an re-reaing them from cache as the scalar processor must. The other benchmarks are less impacte because of their streaming an low-reuse nature. The loop overhea amortization gaine by performing 64 loop iterations (MVL=64) at once benefits all benchmarks. The more powerful VIRAM instruction set with fixe-point support further reuce the loop boies of AUTCOR an RGBCMYK. Finally, the scalar isassemble GCC output i not appear significantly less efficient than the vectorize assembly for any of the benchmarks, leaing us to infer that manual assembly optimization was not a isproportionally significant avantage for VESPA VESPA vs HW By focussing only on loops we can ecompose the performance ifference between VESPA an harware into

6 Table 4. Area an performance avantage for harware over various processors Processor Clock Area (A processor/a hw ) Wall Clock Time (T processor/t hw ) L M DD DW DPV AUTCOR CONVEN RGB- RGB- IP CH- IMG- GEO AUTCOR CONVEN RGB- RGB- IP CH- IMG- GEO (KB) (B) (MHz) CMYK YIQ ECKSUM BLEND MEAN CMYK YIQ ECKSUM BLEND MEAN Scalar VL VL Table 5. Harware avantages over fastest VESPA. Iteration Cycles per Benchmark Clock Parallelism Iteration autcor 2.6x 1x 9.1x conven 3.9x 1x 6.1x rgbcmyk 3.7x 0.375x 13.8x rgbyiq 2.2x 0.375x 19.0x ip checksum 3.7x 0.5x 4.8x imgblen 3.6x 1x 4.4x GEOMEAN 3.2x 0.64x 8.2x the following categories: (i) the clock frequency; (ii) the number of loop iterations execute concurrently calle iteration level parallelism; an (iii) the number of cycles require to execute a single loop iteration. For each of these components, the harware avantage over the fastest VESPA configuration (see last row of Table 4) is shown in Table 5. The secon column shows the harware circuits have clock spees between 2.2x an 4x faster than the best performing VESPA. This 3.2x average clock avantage can be improve through further circuit esign effort in VESPA. The thir column of Table 5 shows that the iteration level parallelism exploite by the harware is less than or equal to that exploite by VESPA which is 16 for all benchmarks since there are 16 lanes. But in the harware circuits we matche the parallelism to the memory banwith, for example, the IP CHECKSUM benchmark operates on a stream of 16-bit elements meaning in a given DRAM access only 8 elements can be retrieve from memory. The circuit is hence esigne to have only 8-way parallelism while VESPA wastes cycles gathering ata for its 16 lanes. The last column shows the speeup of a single iteration in harware over VESPA an is calculate from the measure overall speeups in the last row of Table 4 ivie by the aformentione clock an iteration parallelism avantages. This component represents the inefficiencies inherent in our VESPA esign as well as in any processorstyle architecture. VESPA currently can sustain only one vector instruction in flight while known techniques such as vector chaining can be use to overlap execution of multiple instructions through a multi-porte vector register file an multiple functional units. The harware circuit has the benefit of creating as many functional units as necessary an can fee them ata without the scaling limitations of a centralize register file. HW Area-Delay Avantage HW Area Avantage Scalar 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes Fig. 3. Area-elay prouct versus area of VESPA processors normalize against harware. Further improvements to VESPA s cycles per iteration are motivate since it remains the largest component an will further expose funamental limitations in processor architectures. VESPA s vector extensions reuce the iteration parallelism harware avantage from 10.3x for a scalar soft processor to 0.64x, proving that VESPA has greatly increase iteration parallelism leaving cycles per iteration as a key target for further reucing the performance gap Area-Delay Prouct Gap Figure 3 shows the area-elay of the scalar an VESPA processors relative to that of harware, average across our benchmark set, an plotte against area. The figure emonstrates that VESPA can provie up to a 3.25x ecrease in area-elay versus the scalar SPREE processor. Note that VESPA inclues the same scalar SPREE processor, thus, aing the vector extensions significantly increase the performance-per-area of this processor. The VESPA processor with the least area-elay prouct is still 892x worse than the harware but is surprisingly not the VESPA esign with the highest performance, instea it is the 8-lane, full memory crossbar vector processor with a 16KB cache, 64B line size, an ata prefetching liste in the secon last row of Table 4. While this area-elay gap is enormous, a significant part of it is ue to area which in many cases may be well worth the general-purpose computing provie by the processor. Specifically, the processor can be use to time multiplex ifferent computations versus instantiating a circuit for each computation.

7 ^ e e e e el ellw Fig. 4. Performance gaine with improve VESPA architecture. 5. REDUCING THE PERFORMANCE GAP In this section we examine the performance avantages that harware circuits have over VESPA an escribe the architectural moifications that we use to mitigate the effects of those avantages: we examine ifferent cache esigns, the ecoupling of certain pipelines within VESPA, an ata prefetching These techniques irectly tackle the cycles per iteration highlighte in our earlier results which inclue these improvements. Figure 4 shows the accumulate performance gains from these three improvements measure in cycle speeup since clock frequency i not change significantly. On average the cache, ecoupling, an prefetching can be combine to increase performance by 3x over the previous VESPA, causing its 50x performance gap with harware to be reuce to the 17x reporte in Table Cache Design Harware circuits typically benefit from near-perfect elivery of ata from the DRAM to the pipeline functional units, while for most processors ata passes through levels of caches, then the register file, an finally to the functional units. Although we maintaine this framework, we accomoate VESPA by tuning the cache, specifically the cache line so that ieally all vector lanes can be satisfie with a single cache line request. The ata cache line was parameterize an expane from 16 bytes to 64 bytes, an accompanie with a corresponing growth in capacity to keep the FPGA block RAMs fully utilize. Our experiments show that this improve cache esign results in 2x average performance gain as seen in Figure 4, ue almost entirely to the expane cache line rather than the capacity [12]. This performance gain comes with a 2x growth in VESPA area ue primarily to the larger vector memory crossbar seen in Figure 1 which grows with the cache line size. The crossbar is necessary even without a cache, an since the cache storage is less than 6% of the area an is share with the scalar processor, we are not motivate to investigate a no-cache solution Zero Overhea Loops When comparing the harware circuits to the vectorize loops, one glaring ifference is the absence of the many control instructions require to manage a loop: in harware a finite state machine (FSM) manages the loop in parallel with the computation. We moifie VESPA by ecoupling the three pipelines allowing vector, vector control, an scalar instructions to execute simultaneously an out-oforer with respect to each other. As long as the number of cycles neee to compute the vector operations is greater than the cycles neee for the vector control an scalar operations, the loop will have no overhea. While our previous work alreay ecouple the scalar pipeline, in this work we ecouple the execution of the vector an vector control pipelines. The impact on performance for a 16-lane VESPA with 16KB ata cache an 64B line size is shown in Figure 4. The technique improves performance by up to 15% an 7% on average, while the area cost is negligible Data Prefetching Another avantage of custom harware is that it can overlap computation with memory accesses. We can o the same in VESPA by supporting harware ata prefetching where a cache miss translates into a request for the missing cache line as well as aitional cache lines that are preicte to soon be accesse. Due to the preictable memory access patterns in our benchmarks simple sequential prefetching that loas the next DPK cache lines is effective, reucing the time spent servicing misses to just 4% of execution time [12]. Using the DPV parameter instructs VESPA to prefetch only for vector memory instructions with low stries an to prefetch either a constant or a multiple of the current vector length elements into the cache. All of these methos yiel very similar results. Figure 4 shows the 42% performance boost of our best overall prefetching configuration which loas 8 times the current vector length elements into the cache. By using the vector length to etermine the number of cache lines to prefetch, we guarantee no more than one miss per vector instruction regarless of the length of the vector. The cost of the prefetcher is less than 2% of the area ue primarily to buffering irty cache lines evicte by prefetche lines. 6. REDUCING THE AREA GAP In harware, we implement only the functional units require by the application an match them to the bit-with of the ata operans. VESPA is equippe with parameters that allow it to perform similar application-specific customizations. The vector lane with W can be use to reuce the atapath for benchmarks which o not require 32- bit processing. For example, CONVEN requires only a 1-bit

8 HW Spee Avantage HW Area Avantage Full Subsette Subsette+With Reuce Fig. 5. Effect of instruction set subsetting an with reuction on the area an spee gap of VESPA processors versus harware. atapath (see Table 2) an its implementation in harware gains a large area avantage over VESPA because of it. Using the W parameter we can reuce the lane with to 1- bit an reuce VESPA s area by half vector state, control logic, the 32-bit aress space, an the scalar processor limit further reuction. Note our previous work [1] limite the lane with to multiples of 8. VESPA also supports the iniviual isabling of each vector instruction which automatically eliminates harware support for that instruction. This feature allows us to subset the instruction set to that use by the application shown in Table 2. Figure 5 shows the effect of instruction set subsetting as well as the combine effect of subsetting an with reuction on the set of pareto optimal points in our VESPA esign space. We see that compare to the full VESPA processor the area is significantly reuce, in the best case by 45%, an some performance is even gaine from the higher clock spees which reach as high as 153 MHz on the smaller customize VESPA processors. The points move closer to the origin as VESPA shes general purpose overheas an begins to resemble a eicate harware part. It is interesting to note that after trimming this area, the 16 lane VESPA with full memory crossbar, prefetching, an 64B line size has the smallest area-elay prouct which is 561x worse than harware; a substantial improvement over the 892x for the full-size 8 lane VESPA iscusse earlier, an 5.15x better than the scalar soft processor. 7. CONCLUSIONS Our comparisons have emonstrate that C coe executing on a scalar soft processor performs on average 432x slower an is 6.7x larger in area than custom FPGA harware. The VESPA soft vector processor now provies a large esign space of vector processors that, relative to harware, ranges from 192x slower an 13x larger to 17x slower an 64x larger. This large space allows a esigner to choose the area/performance of a system component without laborious harware esign, an can rastically reuce the 432x scalar soft processor performance gap to 17x for ata parallel workloas. In aition, VESPA is shown to have 3x better area-elay prouct than our scalar soft processor. Finally, by eliminating harware in VESPA which is not use by the application, we can reuce the area of VESPA by up to 45%, resulting in a 5.15x reuce area-elay prouct than that of a scalar soft processor. In summary, the quantifie gap an improve soft vector processor can significantly reuce the nee for embee esigners to resort to more challenging manual harware esign. 8. REFERENCES [1] P. Yiannacouras, J. G. Steffan, an J. Rose, Vespa: Portable, scalable, an flexible fpga-base vector processors, in CASES 08: International Conference on Compilers, Architecture an Synthesis for Embee Systems. ACM, [2] J. Yu, G. Lemieux, an C. Eagleston, Vector processing as a soft-core cpu accelerator, in Symposium on Fiel programmable gate arrays. New York, NY, USA: ACM, 2008, pp [3] P. Yiannacouras, J. G. Steffan, an J. Rose, Applicationspecific customization of soft processor microarchitecture, in FPGA 06: Proceeings of the International Symposium on Fiel Programmable Gate Arrays. New York, NY, USA: ACM Press, 2006, pp [4] R. Dimon, O. Mencer, an W. Luk, CUSTARD - A Customisable Threae FPGA Soft Processor an Tools, in International Conference on Fiel Programmable Logic (FPL), August [5] D. Lau, O. Pritchar, P. Molson, an C. Altera Santa Cruz, Automate Generation of Harware Accelerators with Direct Memory Access from ANSI/ISO Stanar C Functions, Fiel-Programmable Custom Computing Machines, pp , [6] The Embee Microprocessor Benchmark Consortium, EEMBC. [7] W. Hart an R. Camposano, Trae-offs in hw/sw coesign, in Workshop on Harware/Software Coesign. ACM, [8] Z. Guo, W. Najjar, F. Vahi, an K. Vissers, A quantitative analysis of the speeup factors of fpgas over processors, in Symposium on Fiel programmable gate arrays. New York, NY, USA: ACM, 2004, pp [9] C. Kozyrakis an D. Patterson, Scalable, vector processors for embee systems, Micro, IEEE, vol. 23, no. 6, pp , [10] J. Fener, J. Rose, an D. R. Galloway, The transmogrifier- 4: An fpga-base harware evelopment system with multigigabyte memory capacity an high host an memory banwith. in IEEE International Conference on Fiel Programmable Technology, 2005, pp [11] R. Cliff, Altera Corporation, Private Comm, [12] P. Yiannacouras, J. G. Steffan, an J. Rose, Improving memory systems for soft vector processors, in WoSPS 08: Workshop on Soft Processor Systems, 2008.

Portable, Flexible, and Scalable Soft Vector Processors

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1429 Portable, Flexible, and Scalable Soft Vector Processors Peter Yiannacouras, Member, IEEE, J. Gregory Steffan,