DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose

Size: px
Start display at page:

Download "DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose"

Transcription

1 DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE Peter Yiannacouras, J. Gregory Steffan, an Jonathan Rose Ewar S. Rogers Sr. Department of Electrical an Computer Engineering University of Toronto 10 King s College Roa, Toronto, ON yiannac,steffan,jayar@eecg.utoronto.ca ABSTRACT Commercial soft processors are unable to effectively exploit the ata parallelism present in many embee systems workloas, requiring FPGA esigners to exploit it (laboriously) with manual harware esign. Recent research [1, 2] has emonstrate that soft processors augmente with support for vector instructions provie significant improvements in performance an scalability for ataparallel workloas. These soft vector processors provie a software environment for quickly encoing ata parallel computation, but their competitiveness with manual harware esign in terms of area an performance remains unknown. In this work, using an FPGA platform equippe with DDR memory executing ata-parallel EEMBC embee benchmarks, we measure the area/performance gaps between (i) a scalar soft processor, (ii) our improve soft vector processor, an (iii) custom FPGA harware. We emonstrate that the 432x wall clock performance gap between scalar execute C an custom harware can be reuce significantly to 17x using our improve soft vector processor, while silicon-efficiency is improve by 3x in terms of area-elay prouct. We moifie the architecture to mitigate three key avantages we observe in custom harware: loop overhea, ata elivery, an exact resource usage. Combine these improvements increase performance by 3x an reuce area by almost half, significantly reucing the nee for esigners to resort to more challenging custom harware implementations. 1. INTRODUCTION The esigner of an FPGA-base embee system often has the ifficult choice between esigning custom harware by han using a harware-escription language (HDL) that is mappe irectly to the FPGA fabric, or writing software in a high-level language such as C that targets a soft processor a processor implemente using the programmable FPGA fabric an programme using traitional sequential programming languages an software compilers. The performance of a soft processor is often sufficient for parts of the esign allowing embee systems esigners to use them to reuce their time to market an exploit single-chip avantages without requiring specialize FPGAs with har processors; however, the performance an area of current commercial soft processors is still significantly inferior to that of a custom harware solution, meaning esigners nee to spen more time implementing harware to meet their esign constraints. As a result, we are motivate to improve soft processors to reuce FPGA esign time. Recent avancements [3 5] has inee expane the applicability of soft processors by improving them over current commercial soft processors. In particular, recent work has propose extening soft processors with vector processing capabilities [1, 2] as a means of scaling performance for ata-parallel workloas. Vector processing allows a single instruction to comman multiple atapaths calle vector lanes. On an FPGA the number of vector lanes can be configure by the esigner, allowing them to use more FPGA resources to scale-up performance. However, the impact of soft vector processors epens on their ability to lure FPGA esigners into software esign by proviing goo enough performance/area to reuce neee manual harware esign. Thus, it is crucial to unerstan the perfomance an area gap between soft vector processors an custom harware Measuring, Unerstaning, an Reucing the Gap We measure the area an performance gap using several ata-parallel benchmarks (primarily from the EEMBC [6] inustry-stanar embee benchmark suites) of three platforms executing: (i) out-of-the-box C on a scalar soft processor; (ii) han-vectorize-assembly on many configurations of the soft vector processor calle VESPA (Vector Extene Soft Processor Architecture) [1]; an (iii) custom harware han-esigne in Verilog. Our goal in this work is to use this measurement to quantify the competitiveness of recent soft vector processors an further improve them by leveraging our insights into the causes of the performance/area gap as well as the circuit structures use to

2 implement the benchmarks in harware. Specifically we ientify the following key avantages of custom harware over VESPA, an we improve VESPA reucing the impact of each avantage. Loop Overhea Loop control in custom harware is generally implemente using a finite state machine (FSM) that executes in parallel with the loop computation, while in VESPA the control atapath must complete an instruction before the vector lanes can issue the following instruction, an vice versa. We reuce this avantage an improve VESPA by ecoupling the control atapath from the vector lanes an exploiting instruction-level parallelism. Data Delivery High performance custom harware can often achieve near perfect elivery of ata to functional units with no cycles waste. In contrast, for soft processors incluing VESPA, ata flows from memory through caches to registers an eventually to functional units. We improve ata elivery in VESPA in two ways: (i) by tuning cache esign, an (ii) by supporting prefetching. Exact Resource Usage A custom harware implementation contains exactly the resources require to support the application: functional units support only the require operations, an atapath bit-withs exactly match those require. In contrast, soft processors such as VESPA are general-purpose an hence support a full instruction set (ISA) an the corresponing maximum bit-withs. We improve VESPA via support for subsetting the instruction set an reucing atapath bit-withs to match the application. In this work we emonstrate that these improvements when combine provie 3x improve performance over the original VESPA an significantly broaen its esign space. We also show that the performance gap between a scalar soft processor an custom harware is 432x, an that our fastest VESPA implementation reuces this gap to 17x, while proviing a performance-per-unit-area that is up to 3x that of the scalar processor. While the remaining gap is still large, these improvements allow soft vector processors to better compete with custom harware, allowing esigners to more often implement a software-programmable solution rather than having to esign custom harware Relate Work The most closely relate work is by Yu et. al. [2], who emonstrate the potential for vector processing as a simpleto-use an scalable accelerator for soft processors, potentially scaling better than Altera s C2H [5] behavioral synthesis tool for three benchmarks. However, that work moels a vector processor optimistically incluing using an on-chip one-cycle (latency) memory system. We compare a real vector processor to manual harware esign. Hart an Camposano [7] compare harware circuits synthesize to 2µ CMOS to software on a SPARC processor with cycle performance estimate from static coe analy- Vector Coproc Scalar MIPS Icache Lane 1 Lane 2 Lane L Memory Crossbar Arbiter Dcache Prefetch DDR Fig. 1. VESPA processor block iagram. sis. They fin that harware outperforms the processor by factors ranging between 24x an 44x for scalar workloas. Our work performs a similar comparison but between FPGA harware an soft vector processors, while incluing the effects of clock frequency an latent memory. More recent work [8] has compare FPGAs to har microprocessors but o not compare against soft vector processors Contributions In this paper we make the following contributions: (i) using an FPGA platform with DDR memory we quantify an analyze the area/performance gaps for inustry-stanar benchmarks between a scalar soft processor, a parameterize vector soft processor, an han-esigne harware implementations; (ii) we improve VESPA by targetting key avantages of harware implementations specifically by reucing loop overhea, tuning cache esign, supporting ata prefetching, an eliminating unuse harware; (iii) we show that our improve VESPA provies a powerful esign space, spanning 5x in area an 11x in performance, with the fastest VESPA reucing the 432x scalar soft processor performance gap to 17x while improving performance per area by up to 3x. 2. VESPA In our previous work on VESPA (Vector Extene Soft Processor Architecture) we implemente a parameterize vector processor in Verilog an explore its potential for scalability an customization. The following summarizes the VESPA architecture an parameters (ol an new), further etails can be foun in [1]. Figure 1 shows a block iagram of the VESPA processor that consists of a scalar MIPS-base processor automatically generate using the SPREE system [3], couple with a parameterize vector coprocessor base on the VIRAM [9] vector instruction set. The scalar SPREE processor is a 3- stage pipeline with full forwaring an a 1-bit branch history

3 Table 1. Configurable parameters for VESPA. Parameter Symbol Values Vector Lanes L 1,2,4,8,16,... Vector Lane Bit-With W 1,2,3,4,...,32 Maximum Vector Length MVL 2,4,8,16,... Memory Crossbar Lanes M 1,2,4,8,...L Each Vector Instruction - on/off DCache Depth (KB) DD 4KB,8KB,... DCache Line Size (B) DW 16,32,64,... DCache Miss Prefetch DPK 1,2,3,... Vector Miss Prefetch DPV 1,2,3,... table. The parameters of the VESPA system are liste in Table 1. The vector coprocessor consists of L parallel vector lanes where each lane can perform operations on a single element in a pipeline fashion. The with W of each vector lane atapath is 32 bits by efault, but can be reuce for applications that require less than the full 32 bit-with. MVL etermines the maximum vector length supporte in harware an is set to 64 for this stuy. The scalar processor an vector coprocessor share a single instruction stream fe by an instruction cache. The scalar processor an vector coprocessor are both in-orer pipelines, but can execute out-of-orer with respect to each other except for memory operations which are serialize to maintain sequential consistency. Both share a irectmappe ata cache with parameterize epth DD an cache line size DW. A crossbar routes each byte in a cache line to/from M of the L vector lanes in a given cycle. A full crossbar (M=L) can significantly reuce the clock frequency of the esign when L is large; in such cases M can be reuce to restore the clock rate an save area, but more cycles will be spent moving ata between the cache lines an vector lanes. The ata cache is equippe with a harware prefetcher configure with parameters DPK an DPV escribe in a later section. Beyon our previous work, we compare the VESPA configurations to harware for the first time, we ae configurable caches an ata prefetching, we explore the complete esign space with our new robust esign rather than iniviually for each parameter, an finally we make other architectural improvements (see Section 5.2). 3. MEASUREMENT METHODOLOGY Our goal is to measure the area/performance gap between scalar soft processors, soft vector processors, an harware, as well as to investigate techniques to reuce the gap. In this section we escribe the components of our infrastructure necessary to execute, verify, an evaluate the FPGA esigns. We escribe our harware platform, verification process, CAD tool measurement methoology, benchmarks, an compiler. We also iscuss how harware implementations of our benchmarks were create. Soft Processor Platform We use the Transmogrifier 4 (TM4) [10] to host the complete soft processor systems. The platform has four Altera Stratix EP1S80F1508C6 evices each with access to two 1GB PC3200 CL3 DDR SDRAM DIMMs clocke at 133 MHz (266 MHz DDR). We synthesize our processor systems onto one of the four Stratix I FPGAs connecte to one of the DIMMs an clock the processor at 50 MHz. All instances of VESPA are fully teste in harware using the built-in checksum values encoe into each benchmark. Debugging is guie by comparing traces of all writes to the scalar an vector register files. Note that because the Stratix I FPGAs on the TM4 are ate, we use this platform only for measuring benchmark cycle counts. For area an clock frequency measurements we use the CAD flow escribe below to target a faster Stratix III FPGA (which was unavailable to us) an achieves a clock spee of 130 MHz. While this faster clock spee woul increase the memory latency observe by the processor, we believe that this woul not significantly impact our results: the memory latency in our current system is alreay exaggerate by the fact that our DDR controller is han-mae an suffers many inefficiencies, incluing the use of a close-page policy. FPGA CAD Tools A key benefit of FPGA-base systems research is that we can obtain high quality measurements, incluing the area an clock frequency measurements provie by FPGA CAD tools. We use Altera s Quartus II 8.0 CAD software with register retiming an uplication enable an with aggressive timing constraints. Through experimentation we foun that these settings provie the best area, elay, an runtime trae-off. We perform eight such runs for each harware esign to average-out the noneterminism in the CAD algorithms. We approximate the relative silicon area of each Stratix III tile by ajusting the values supplie to us by Altera [11] for the Stratix II. We report the silicon area consume by a esign in units of equivalent ALMs the silicon area of a single ALM (Aaptive Logic Moule the basic programmable logic unit in the Stratix III) incluing its routing. For soft processors the areas we report inclue everything except the memory controller an host communication harware. Benchmarks The six benchmarks that we measure are liste in Table 2: five are from the inustry-stanar EEMBC collection [6], an one (IMGBLEND) was han-mae. All except IP CHECKSUM were han-vectorize an provie by Kozyrakis an the Berkeley VIRAM project [9]. For the top four benchmarks we execute the largest ataset with the EEMBC test harness uncompromise. We also manually extracte an vectorize the IP CHECKSUM kernel from the

4 Table 2. Benchmark applications. EEMBC EEMBC Input Output Largest Vector % VIRAM Benchmark Description Source Suite Dataset# size (B) size (B) Element ISA Use AUTCOR auto correlation EEMBC/VIRAM Telecom bits 9.6% CONVEN convolution encoer EEMBC/VIRAM Telecom bit 5.9% RGBCMYK rgb filter EEMBC/VIRAM Digital Ent bits 5.9% RGBYIQ rgb filter EEMBC/VIRAM Digital Ent bits 8.1% IP CHECKSUM checksum EEMBC Networking bits 8.1% IMGBLEND combine two images VIRAM bits 7.4% Networking suite of EEMBC, an execute it on 10 4KB input packets. Note that cycle counts are collecte from a complete execution on our harware platform as escribe above, an the vectorize coe is never moifie to support any specific vector configuration. Compilation Framework Benchmarks are built using a MIPS port of GNU gcc with the -O3 optimization level. Initial experiments with this version of gcc s autovectorization capability showe that it faile to vectorize key loops in our benchmarks, preventing us from automatically generating vectorize coe. Instea we porte the GNU assembler to support VIRAM vector instructions allowing us to manually vectorize in assembly. Area-Delay Prouct A system esigner may care more about area than performance, or vice-versa, epening on the constraints of the esign at han. However, it is important to have an unerstaning of the overall performanceper-area of caniate esigns motivating us to measure areaelay prouct as is traitionally one for igital circuits. We use the aforementione equivalent ALMs for area an the wall-clock-time of benchmark execution as the elay (combining the cycle counts reporte by real harware with the maximum clock frequency reporte by CAD tools) Designing Custom Harware Circuits We moel the performance of our harware circuits optimistically while using area an clock frequencies from a real FPGA harware esign, achieve by manually converting each benchmark into a Verilog harware circuit. While there are infinite variations of such harware esigns, we attempte to implement esigns that maximize performance while simplifying this process with the following assumptions: All input/output ata starts/ens in memory an is transfere uninterrupte at the full rate of our DRAM evice. We also iealize the control logic assuming it can make ecisions in a single clock an accounts for negligible area. Finally we on t allow any value or value-range specific optimization in either the software or harware. To summarize, we buil only the atapath of the circuit uner optimistic assumptions about the control logic an transfer of ata. The resulting harware circuits are teste in Table 3. Harware circuit area an performance. Clock Benchmark ALMs DSPs M9Ks (MHz) Cycles AUTCOR CONVEN RGBCMYK RGBYIQ IP CHECKSUM IMGBLEND AUTCOR-unroll CONVEN-unroll simulation using test vectors, an area an clock frequency are measure using the previously-escribe CAD flow. For each harware circuit we compute the total number of cycles for execution as the sum of the pipeline latency plus cycles spent transferring ata since the circuit computation is one in parallel with this transfer time. Overall we believe the harware circuits are optimistic an certainly overcome the manual vectorization avantage in software. As a result of forbiing value an value-range optimizations, we o not perform loop unrolling of nonvectorize loops, nor the equivalent in harware. For example, benchmarks such as AUTCOR operate repeately on the same ata set with the actual computation epenent on a parameter input which varies from 0 to 15. In harware we can unroll that loop performing all 16 operations simultaneously. The benefit of unrolling a loop woul be relatively small for VESPA which is an in-orer singleissue processor, while harware coul reaily exploit the expose instruction level parallelism (ILP). The last two rows in Table 3 show the impact of unrolling in harware for AUTCOR an CONVEN the only two benchmarks where unrolling is useful in harware. The unrolle circuits are not use in our results, but the performance impact can be large: in the case of AUTCOR execution completes in 7.4x fewer cycles, although clock frequency is reuce an circuit area increases substantially. 4. COMPARING TO HARDWARE In this section we compare the area an performance of the following three implementations of our benchmarks create

5 HW Spee Avantage HW Area Avantage 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes ominant parameter in etermining area an performance, but also being varie is the memory crossbar size M, the ata cache epth DD, the ata cache line size DW, an the ata prefetcher DPV. These parameters will be iscusse in a subsequent section, here they are use to show the fine-graine traeoffs within VESPA. The traeoffs are significant because VESPA coul be a potentially large component in an FPGA system. Fig. 2. Area-performance esign space of VESPA processors normalize against harware. via ifferent esign entry methos: (i) out-of-the-box C coe execute on the MIPS-base SPREE scalar processor; (ii) han-vectorize assembly language execute on many variations of our VESPA soft vector processor; an (iii) harware esigne in Verilog at the register transfer level as escribe in Section 3.1. Table 4 shows the area avantage an speeup of the harware implementation versus the scalar SPREE processor in the first row an the slowest, the least area-elay, an the fastest configurations of VESPA in the remaining three rows. The limite number of multipliers in the Stratix 1S80 on the TM4 prevent us from evaluating soft vector processors with more than 16 lanes, but we expect further performance scaling on larger Stratix III base harware platforms [1]. Focusing on the first row of the table, we observe that the scalar processor executing out-of-the-box C coe is on average 6.7x larger than the harware circuits an performs 432x slower. Not exploiting the available ata parallelism is the primary cause of the uner-performance. The area of the scalar processor is larger than each of the harware implementations, suggesting that espite the timemultiplexe resources, the general purpose overheas cause the processor to be still larger than the spatially execute harware. In an extreme case, CONVEN with its 1-bit atapath is 64x smaller than the scalar processor. With respect to the harware circuits VESPA is 13x to 64x larger an 192x to 17x slower. A more quantitative analysis follows in a subsequent section but it is clear that vector processing extensions to soft processors are motivate since the 432x scalar processor performance gap can be reuce own to 17x. Such a massive performance boost coul help convert many components of an FPGA system into software executing on a soft vector processor rather than laboriously-esigne custom harware. Figure 2 shows the area-performance esign space of many near-pareto-optimal VESPA processors normalize against harware. We observe that the VESPA esign space is quite large, spanning 5x in area an 11x in performance with the 16 lane VESPAs proviing the best performance at the cost of aitional area. The figure ientifies the number of lanes in each configuration which is the most 4.1. VESPA vs Scalar Looking at the first two rows of Table 4, we can compare the scalar processor with a VESPA processor that has only a single lane an ientical cache organization. The VESPA processors are at least 2x larger than the scalar since they are comprise of both a scalar processor an vector coprocessor. The han-vectorize assembly execute on VESPA gains more than 2x average performance over the scalar out-ofthe-box C coe on scalar SPREE, even though there is no ata parallel execution on the single-lane version of VESPA. This is partly ue to a number of avantages in VESPA: (a) More efficient pipeline execution with few epenencies. (b) The large vector register file can store an manipulate arrays without having to access the cache or memory. (c) Amortization of loop control instructions. () Direct support for fixe-point operations, preication, an built-in min/max/absolute instructions in the VIRAM instruction set. (e) Simultaneous execution in the scalar processor an vector co-processor. (f) Manual vectorization in assembly versus the C-compile scalar output from GCC. Determining the exact contribution of each avantage is beyon the scope of this work, we instea perform some qualitative analysis. Closer inspection of CONVEN reveale the cause of the 9x performance boost seen on the single lane VESPA to be the repeate operations performe on a single array. In VESPA the large vector register file can store large array chunks an manipulate them without storing an re-reaing them from cache as the scalar processor must. The other benchmarks are less impacte because of their streaming an low-reuse nature. The loop overhea amortization gaine by performing 64 loop iterations (MVL=64) at once benefits all benchmarks. The more powerful VIRAM instruction set with fixe-point support further reuce the loop boies of AUTCOR an RGBCMYK. Finally, the scalar isassemble GCC output i not appear significantly less efficient than the vectorize assembly for any of the benchmarks, leaing us to infer that manual assembly optimization was not a isproportionally significant avantage for VESPA VESPA vs HW By focussing only on loops we can ecompose the performance ifference between VESPA an harware into

6 Table 4. Area an performance avantage for harware over various processors Processor Clock Area (A processor/a hw ) Wall Clock Time (T processor/t hw ) L M DD DW DPV AUTCOR CONVEN RGB- RGB- IP CH- IMG- GEO AUTCOR CONVEN RGB- RGB- IP CH- IMG- GEO (KB) (B) (MHz) CMYK YIQ ECKSUM BLEND MEAN CMYK YIQ ECKSUM BLEND MEAN Scalar VL VL Table 5. Harware avantages over fastest VESPA. Iteration Cycles per Benchmark Clock Parallelism Iteration autcor 2.6x 1x 9.1x conven 3.9x 1x 6.1x rgbcmyk 3.7x 0.375x 13.8x rgbyiq 2.2x 0.375x 19.0x ip checksum 3.7x 0.5x 4.8x imgblen 3.6x 1x 4.4x GEOMEAN 3.2x 0.64x 8.2x the following categories: (i) the clock frequency; (ii) the number of loop iterations execute concurrently calle iteration level parallelism; an (iii) the number of cycles require to execute a single loop iteration. For each of these components, the harware avantage over the fastest VESPA configuration (see last row of Table 4) is shown in Table 5. The secon column shows the harware circuits have clock spees between 2.2x an 4x faster than the best performing VESPA. This 3.2x average clock avantage can be improve through further circuit esign effort in VESPA. The thir column of Table 5 shows that the iteration level parallelism exploite by the harware is less than or equal to that exploite by VESPA which is 16 for all benchmarks since there are 16 lanes. But in the harware circuits we matche the parallelism to the memory banwith, for example, the IP CHECKSUM benchmark operates on a stream of 16-bit elements meaning in a given DRAM access only 8 elements can be retrieve from memory. The circuit is hence esigne to have only 8-way parallelism while VESPA wastes cycles gathering ata for its 16 lanes. The last column shows the speeup of a single iteration in harware over VESPA an is calculate from the measure overall speeups in the last row of Table 4 ivie by the aformentione clock an iteration parallelism avantages. This component represents the inefficiencies inherent in our VESPA esign as well as in any processorstyle architecture. VESPA currently can sustain only one vector instruction in flight while known techniques such as vector chaining can be use to overlap execution of multiple instructions through a multi-porte vector register file an multiple functional units. The harware circuit has the benefit of creating as many functional units as necessary an can fee them ata without the scaling limitations of a centralize register file. HW Area-Delay Avantage HW Area Avantage Scalar 1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes Fig. 3. Area-elay prouct versus area of VESPA processors normalize against harware. Further improvements to VESPA s cycles per iteration are motivate since it remains the largest component an will further expose funamental limitations in processor architectures. VESPA s vector extensions reuce the iteration parallelism harware avantage from 10.3x for a scalar soft processor to 0.64x, proving that VESPA has greatly increase iteration parallelism leaving cycles per iteration as a key target for further reucing the performance gap Area-Delay Prouct Gap Figure 3 shows the area-elay of the scalar an VESPA processors relative to that of harware, average across our benchmark set, an plotte against area. The figure emonstrates that VESPA can provie up to a 3.25x ecrease in area-elay versus the scalar SPREE processor. Note that VESPA inclues the same scalar SPREE processor, thus, aing the vector extensions significantly increase the performance-per-area of this processor. The VESPA processor with the least area-elay prouct is still 892x worse than the harware but is surprisingly not the VESPA esign with the highest performance, instea it is the 8-lane, full memory crossbar vector processor with a 16KB cache, 64B line size, an ata prefetching liste in the secon last row of Table 4. While this area-elay gap is enormous, a significant part of it is ue to area which in many cases may be well worth the general-purpose computing provie by the processor. Specifically, the processor can be use to time multiplex ifferent computations versus instantiating a circuit for each computation.

7 ^ e e e e el ellw Fig. 4. Performance gaine with improve VESPA architecture. 5. REDUCING THE PERFORMANCE GAP In this section we examine the performance avantages that harware circuits have over VESPA an escribe the architectural moifications that we use to mitigate the effects of those avantages: we examine ifferent cache esigns, the ecoupling of certain pipelines within VESPA, an ata prefetching These techniques irectly tackle the cycles per iteration highlighte in our earlier results which inclue these improvements. Figure 4 shows the accumulate performance gains from these three improvements measure in cycle speeup since clock frequency i not change significantly. On average the cache, ecoupling, an prefetching can be combine to increase performance by 3x over the previous VESPA, causing its 50x performance gap with harware to be reuce to the 17x reporte in Table Cache Design Harware circuits typically benefit from near-perfect elivery of ata from the DRAM to the pipeline functional units, while for most processors ata passes through levels of caches, then the register file, an finally to the functional units. Although we maintaine this framework, we accomoate VESPA by tuning the cache, specifically the cache line so that ieally all vector lanes can be satisfie with a single cache line request. The ata cache line was parameterize an expane from 16 bytes to 64 bytes, an accompanie with a corresponing growth in capacity to keep the FPGA block RAMs fully utilize. Our experiments show that this improve cache esign results in 2x average performance gain as seen in Figure 4, ue almost entirely to the expane cache line rather than the capacity [12]. This performance gain comes with a 2x growth in VESPA area ue primarily to the larger vector memory crossbar seen in Figure 1 which grows with the cache line size. The crossbar is necessary even without a cache, an since the cache storage is less than 6% of the area an is share with the scalar processor, we are not motivate to investigate a no-cache solution Zero Overhea Loops When comparing the harware circuits to the vectorize loops, one glaring ifference is the absence of the many control instructions require to manage a loop: in harware a finite state machine (FSM) manages the loop in parallel with the computation. We moifie VESPA by ecoupling the three pipelines allowing vector, vector control, an scalar instructions to execute simultaneously an out-oforer with respect to each other. As long as the number of cycles neee to compute the vector operations is greater than the cycles neee for the vector control an scalar operations, the loop will have no overhea. While our previous work alreay ecouple the scalar pipeline, in this work we ecouple the execution of the vector an vector control pipelines. The impact on performance for a 16-lane VESPA with 16KB ata cache an 64B line size is shown in Figure 4. The technique improves performance by up to 15% an 7% on average, while the area cost is negligible Data Prefetching Another avantage of custom harware is that it can overlap computation with memory accesses. We can o the same in VESPA by supporting harware ata prefetching where a cache miss translates into a request for the missing cache line as well as aitional cache lines that are preicte to soon be accesse. Due to the preictable memory access patterns in our benchmarks simple sequential prefetching that loas the next DPK cache lines is effective, reucing the time spent servicing misses to just 4% of execution time [12]. Using the DPV parameter instructs VESPA to prefetch only for vector memory instructions with low stries an to prefetch either a constant or a multiple of the current vector length elements into the cache. All of these methos yiel very similar results. Figure 4 shows the 42% performance boost of our best overall prefetching configuration which loas 8 times the current vector length elements into the cache. By using the vector length to etermine the number of cache lines to prefetch, we guarantee no more than one miss per vector instruction regarless of the length of the vector. The cost of the prefetcher is less than 2% of the area ue primarily to buffering irty cache lines evicte by prefetche lines. 6. REDUCING THE AREA GAP In harware, we implement only the functional units require by the application an match them to the bit-with of the ata operans. VESPA is equippe with parameters that allow it to perform similar application-specific customizations. The vector lane with W can be use to reuce the atapath for benchmarks which o not require 32- bit processing. For example, CONVEN requires only a 1-bit

8 HW Spee Avantage HW Area Avantage Full Subsette Subsette+With Reuce Fig. 5. Effect of instruction set subsetting an with reuction on the area an spee gap of VESPA processors versus harware. atapath (see Table 2) an its implementation in harware gains a large area avantage over VESPA because of it. Using the W parameter we can reuce the lane with to 1- bit an reuce VESPA s area by half vector state, control logic, the 32-bit aress space, an the scalar processor limit further reuction. Note our previous work [1] limite the lane with to multiples of 8. VESPA also supports the iniviual isabling of each vector instruction which automatically eliminates harware support for that instruction. This feature allows us to subset the instruction set to that use by the application shown in Table 2. Figure 5 shows the effect of instruction set subsetting as well as the combine effect of subsetting an with reuction on the set of pareto optimal points in our VESPA esign space. We see that compare to the full VESPA processor the area is significantly reuce, in the best case by 45%, an some performance is even gaine from the higher clock spees which reach as high as 153 MHz on the smaller customize VESPA processors. The points move closer to the origin as VESPA shes general purpose overheas an begins to resemble a eicate harware part. It is interesting to note that after trimming this area, the 16 lane VESPA with full memory crossbar, prefetching, an 64B line size has the smallest area-elay prouct which is 561x worse than harware; a substantial improvement over the 892x for the full-size 8 lane VESPA iscusse earlier, an 5.15x better than the scalar soft processor. 7. CONCLUSIONS Our comparisons have emonstrate that C coe executing on a scalar soft processor performs on average 432x slower an is 6.7x larger in area than custom FPGA harware. The VESPA soft vector processor now provies a large esign space of vector processors that, relative to harware, ranges from 192x slower an 13x larger to 17x slower an 64x larger. This large space allows a esigner to choose the area/performance of a system component without laborious harware esign, an can rastically reuce the 432x scalar soft processor performance gap to 17x for ata parallel workloas. In aition, VESPA is shown to have 3x better area-elay prouct than our scalar soft processor. Finally, by eliminating harware in VESPA which is not use by the application, we can reuce the area of VESPA by up to 45%, resulting in a 5.15x reuce area-elay prouct than that of a scalar soft processor. In summary, the quantifie gap an improve soft vector processor can significantly reuce the nee for embee esigners to resort to more challenging manual harware esign. 8. REFERENCES [1] P. Yiannacouras, J. G. Steffan, an J. Rose, Vespa: Portable, scalable, an flexible fpga-base vector processors, in CASES 08: International Conference on Compilers, Architecture an Synthesis for Embee Systems. ACM, [2] J. Yu, G. Lemieux, an C. Eagleston, Vector processing as a soft-core cpu accelerator, in Symposium on Fiel programmable gate arrays. New York, NY, USA: ACM, 2008, pp [3] P. Yiannacouras, J. G. Steffan, an J. Rose, Applicationspecific customization of soft processor microarchitecture, in FPGA 06: Proceeings of the International Symposium on Fiel Programmable Gate Arrays. New York, NY, USA: ACM Press, 2006, pp [4] R. Dimon, O. Mencer, an W. Luk, CUSTARD - A Customisable Threae FPGA Soft Processor an Tools, in International Conference on Fiel Programmable Logic (FPL), August [5] D. Lau, O. Pritchar, P. Molson, an C. Altera Santa Cruz, Automate Generation of Harware Accelerators with Direct Memory Access from ANSI/ISO Stanar C Functions, Fiel-Programmable Custom Computing Machines, pp , [6] The Embee Microprocessor Benchmark Consortium, EEMBC. [7] W. Hart an R. Camposano, Trae-offs in hw/sw coesign, in Workshop on Harware/Software Coesign. ACM, [8] Z. Guo, W. Najjar, F. Vahi, an K. Vissers, A quantitative analysis of the speeup factors of fpgas over processors, in Symposium on Fiel programmable gate arrays. New York, NY, USA: ACM, 2004, pp [9] C. Kozyrakis an D. Patterson, Scalable, vector processors for embee systems, Micro, IEEE, vol. 23, no. 6, pp , [10] J. Fener, J. Rose, an D. R. Galloway, The transmogrifier- 4: An fpga-base harware evelopment system with multigigabyte memory capacity an high host an memory banwith. in IEEE International Conference on Fiel Programmable Technology, 2005, pp [11] R. Cliff, Altera Corporation, Private Comm, [12] P. Yiannacouras, J. G. Steffan, an J. Rose, Improving memory systems for soft vector processors, in WoSPS 08: Workshop on Soft Processor Systems, 2008.

Portable, Flexible, and Scalable Soft Vector Processors

Portable, Flexible, and Scalable Soft Vector Processors IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1429 Portable, Flexible, and Scalable Soft Vector Processors Peter Yiannacouras, Member, IEEE, J. Gregory Steffan,

More information

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama an Hayato Ohwaa Faculty of Sci. an Tech. Tokyo University of Science, 2641 Yamazaki, Noa-shi, CHIBA, 278-8510, Japan hiroyuki@rs.noa.tus.ac.jp,

More information

Table-based division by small integer constants

Table-based division by small integer constants Table-base ivision by small integer constants Florent e Dinechin, Laurent-Stéphane Diier LIP, Université e Lyon (ENS-Lyon/CNRS/INRIA/UCBL) 46, allée Italie, 69364 Lyon Ceex 07 Florent.e.Dinechin@ens-lyon.fr

More information

NAND flash memory is widely used as a storage

NAND flash memory is widely used as a storage 1 : Buffer-Aware Garbage Collection for Flash-Base Storage Systems Sungjin Lee, Dongkun Shin Member, IEEE, an Jihong Kim Member, IEEE Abstract NAND flash-base storage evice is becoming a viable storage

More information

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

Comparison of Methods for Increasing the Performance of a DUA Computation

Comparison of Methods for Increasing the Performance of a DUA Computation Comparison of Methos for Increasing the Performance of a DUA Computation Michael Behrisch, Daniel Krajzewicz, Peter Wagner an Yun-Pang Wang Institute of Transportation Systems, German Aerospace Center,

More information

Just-In-Time Software Pipelining

Just-In-Time Software Pipelining Just-In-Time Software Pipelining Hongbo Rong Hyunchul Park Youfeng Wu Cheng Wang Programming Systems Lab Intel Labs, Santa Clara What is software pipelining? A loop optimization exposing instruction-level

More information

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER

EFFICIENT ON-LINE TESTING METHOD FOR A FLOATING-POINT ADDER FFICINT ON-LIN TSTING MTHOD FOR A FLOATING-POINT ADDR A. Droz, M. Lobachev Department of Computer Systems, Oessa State Polytechnic University, Oessa, Ukraine Droz@ukr.net, Lobachev@ukr.net Abstract In

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Message Transport With The User Datagram Protocol

Message Transport With The User Datagram Protocol Message Transport With The User Datagram Protocol User Datagram Protocol (UDP) Use During startup For VoIP an some vieo applications Accounts for less than 10% of Internet traffic Blocke by some ISPs Computer

More information

Coupling the User Interfaces of a Multiuser Program

Coupling the User Interfaces of a Multiuser Program Coupling the User Interfaces of a Multiuser Program PRASUN DEWAN University of North Carolina at Chapel Hill RAJIV CHOUDHARY Intel Corporation We have evelope a new moel for coupling the user-interfaces

More information

Baring it all to Software: The Raw Machine

Baring it all to Software: The Raw Machine Baring it all to Software: The Raw Machine Elliot Waingol, Michael Taylor, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Srikrishna Devabhaktuni, Rajeev Barua, Jonathan Babb,

More information

Loop Scheduling and Partitions for Hiding Memory Latencies

Loop Scheduling and Partitions for Hiding Memory Latencies Loop Scheuling an Partitions for Hiing Memory Latencies Fei Chen Ewin Hsing-Mean Sha Dept. of Computer Science an Engineering University of Notre Dame Notre Dame, IN 46556 Email: fchen,esha @cse.n.eu Tel:

More information

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control Almost Disjunct Coes in Large Scale Multihop Wireless Network Meia Access Control D. Charles Engelhart Anan Sivasubramaniam Penn. State University University Park PA 682 engelhar,anan @cse.psu.eu Abstract

More information

Supporting Fully Adaptive Routing in InfiniBand Networks

Supporting Fully Adaptive Routing in InfiniBand Networks XIV JORNADAS DE PARALELISMO - LEGANES, SEPTIEMBRE 200 1 Supporting Fully Aaptive Routing in InfiniBan Networks J.C. Martínez, J. Flich, A. Robles, P. López an J. Duato Resumen InfiniBan is a new stanar

More information

6.823 Computer System Architecture. Problem Set #3 Spring 2002

6.823 Computer System Architecture. Problem Set #3 Spring 2002 6.823 Computer System Architecture Problem Set #3 Spring 2002 Stuents are strongly encourage to collaborate in groups of up to three people. A group shoul han in only one copy of the solution to the problem

More information

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu

More information

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help CS2110 Spring 2016 Assignment A. Linke Lists Due on the CMS by: See the CMS 1 Preamble Linke Lists This assignment begins our iscussions of structures. In this assignment, you will implement a structure

More information

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding .. ou Can Do That Unit Computer Organization Design of a imple Clou & Distribute Computing (CyberPhysical, bases, Mining,etc.) Applications (AI, Robotics, Graphics, Mobile) ystems & Networking (Embee ystems,

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Funamentals of Computer Systems an Engineering Fall 017 Datapaths Prof. John Boar Duke University Slies are erive from work by Profs. Tyler Bletch an Anrew Hilton (Duke) an Amir Roth (Penn) What

More information

Study of Network Optimization Method Based on ACL

Study of Network Optimization Method Based on ACL Available online at www.scienceirect.com Proceia Engineering 5 (20) 3959 3963 Avance in Control Engineering an Information Science Stuy of Network Optimization Metho Base on ACL Liu Zhian * Department

More information

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that

More information

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization 1 Offloaing Cellular Traffic through Opportunistic Communications: Analysis an Optimization Vincenzo Sciancalepore, Domenico Giustiniano, Albert Banchs, Anreea Picu arxiv:1405.3548v1 [cs.ni] 14 May 24

More information

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation DEIM Forum 2018 I4-4 Abstract Ranom Clustering for Multiple Sampling Units to Spee Up Run-time Sample Generation uzuru OKAJIMA an Koichi MARUAMA NEC Solution Innovators, Lt. 1-18-7 Shinkiba, Koto-ku, Tokyo,

More information

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency Software Reliability Moeling an Cost Estimation Incorporating esting-effort an Efficiency Chin-Yu Huang, Jung-Hua Lo, Sy-Yen Kuo, an Michael R. Lyu -+ Department of Electrical Engineering Computer Science

More information

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks : a Movement-Base Routing Algorithm for Vehicle A Hoc Networks Fabrizio Granelli, Senior Member, Giulia Boato, Member, an Dzmitry Kliazovich, Stuent Member Abstract Recent interest in car-to-car communications

More information

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions.

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions. Overview Operating Systems I Management Provie Services processes files Manage Devices processor memory isk Simple Management One process in memory, using it all each program nees I/O rivers until 96 I/O

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introuction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threas 6. CPU Scheuling 7. Process Synchronization 8. Dealocks 9. Memory Management 10.Virtual Memory

More information

Coupon Recalculation for the GPS Authentication Scheme

Coupon Recalculation for the GPS Authentication Scheme Coupon Recalculation for the GPS Authentication Scheme Georg Hofferek an Johannes Wolkerstorfer Graz University of Technology, Institute for Applie Information Processing an Communications (IAIK), Inffelgasse

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science an Center of Simulation of Avance Rockets University of Illinois at Urbana-Champaign

More information

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method Southern Cross University epublications@scu 23r Australasian Conference on the Mechanics of Structures an Materials 214 Transient analysis of wave propagation in 3D soil by using the scale bounary finite

More information

Learning Polynomial Functions. by Feature Construction

Learning Polynomial Functions. by Feature Construction I Proceeings of the Eighth International Workshop on Machine Learning Chicago, Illinois, June 27-29 1991 Learning Polynomial Functions by Feature Construction Richar S. Sutton GTE Laboratories Incorporate

More information

Demystifying Automata Processing: GPUs, FPGAs or Micron s AP?

Demystifying Automata Processing: GPUs, FPGAs or Micron s AP? Demystifying Automata Processing: GPUs, FPGAs or Micron s AP? Marziyeh Nourian 1,3, Xiang Wang 1, Xiaoong Yu 2, Wu-chun Feng 2, Michela Becchi 1,3 1,3 Department of Electrical an Computer Engineering,

More information

Learning Subproblem Complexities in Distributed Branch and Bound

Learning Subproblem Complexities in Distributed Branch and Bound Learning Subproblem Complexities in Distribute Branch an Boun Lars Otten Department of Computer Science University of California, Irvine lotten@ics.uci.eu Rina Dechter Department of Computer Science University

More information

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Queueing Moel an Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Marc Aoun, Antonios Argyriou, Philips Research, Einhoven, 66AE, The Netherlans Department of Computer an Communication

More information

MODULE VII. Emerging Technologies

MODULE VII. Emerging Technologies MODULE VII Emerging Technologies Computer Networks an Internets -- Moule 7 1 Spring, 2014 Copyright 2014. All rights reserve. Topics Software Define Networking The Internet Of Things Other trens in networking

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE БСУ Международна конференция - 2 THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE Evgeniya Nikolova, Veselina Jecheva Burgas Free University Abstract:

More information

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. Preface Here are my online notes for my Calculus I course that I teach here at Lamar University. Despite the fact that these are my class notes, they shoul be accessible to anyone wanting to learn Calculus

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources An Algorithm for Builing an Enterprise Network Topology Using Wiesprea Data Sources Anton Anreev, Iurii Bogoiavlenskii Petrozavosk State University Petrozavosk, Russia {anreev, ybgv}@cs.petrsu.ru Abstract

More information

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks Generalize Ege Coloring for Channel Assignment in Wireless Networks Chun-Chen Hsu Institute of Information Science Acaemia Sinica Taipei, Taiwan Da-wei Wang Jan-Jan Wu Institute of Information Science

More information

Recitation Caches and Blocking. 4 March 2019

Recitation Caches and Blocking. 4 March 2019 15-213 Recitation Caches an Blocking 4 March 2019 Agena Reminers Revisiting Cache Lab Caching Review Blocking to reuce cache misses Cache alignment Reminers Due Dates Cache Lab (Thursay 3/7) Miterm Exam

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Control of Scalable Wet SMA Actuator Arrays

Control of Scalable Wet SMA Actuator Arrays Proceeings of the 2005 IEEE International Conference on Robotics an Automation Barcelona, Spain, April 2005 Control of Scalable Wet SMA Actuator Arrays eslie Flemming orth Dakota State University Mechanical

More information

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2 This paper appears in J. of Parallel an Distribute Computing 10 (1990), pp. 167 181. Intensive Hypercube Communication: Prearrange Communication in Link-Boun Machines 1 2 Quentin F. Stout an Bruce Wagar

More information

Coupon Recalculation for the GPS Authentication Scheme

Coupon Recalculation for the GPS Authentication Scheme Coupon Recalculation for the GPS Authentication Scheme Georg Hofferek an Johannes Wolkerstorfer Graz University of Technology, Institute for Applie Information Processing an Communications (IAIK), Inffelgasse

More information

ACE: And/Or-parallel Copying-based Execution of Logic Programs

ACE: And/Or-parallel Copying-based Execution of Logic Programs ACE: An/Or-parallel Copying-base Execution of Logic Programs Gopal GuptaJ Manuel Hermenegilo* Enrico PontelliJ an Vítor Santos Costa' Abstract In this paper we present a novel execution moel for parallel

More information

Automation of Bird Front Half Deboning Procedure: Design and Analysis

Automation of Bird Front Half Deboning Procedure: Design and Analysis Automation of Bir Front Half Deboning Proceure: Design an Analysis Debao Zhou, Jonathan Holmes, Wiley Holcombe, Kok-Meng Lee * an Gary McMurray Foo Processing echnology Division, AAS Laboratory, Georgia

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem

Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem Inuence of Cross-Interferences on Blocke Loops A Case Stuy with Matrix-Vector Multiply CHRISTINE FRICKER INRIA, France an OLIVIER TEMAM an WILLIAM JALBY University of Versailles, France State-of-the art

More information

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) 1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.

More information

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Implementation an Evaluation of AS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Kazuya Matsumoto 1, orihisa Fujita 2, Toshihiro Hanawa 3, an Taisuke Boku 1,2 1 Center for Computational

More information

Politecnico di Torino. Porto Institutional Repository

Politecnico di Torino. Porto Institutional Repository Politecnico i Torino Porto Institutional Repository [Proceeing] Automatic March tests generation for multi-port SRAMs Original Citation: Benso A., Bosio A., i Carlo S., i Natale G., Prinetto P. (26). Automatic

More information

How to Make E-cash with Non-Repudiation and Anonymity

How to Make E-cash with Non-Repudiation and Anonymity How to Make E-cash with Non-Repuiation an Anonymity Ronggong Song an Larry Korba Institute for Information Technology National Research Council of Canaa Ottawa, Ontario K1A 0R6, Canaa {Ronggong.Song, Larry.Korba}@nrc.ca

More information

Image Segmentation using K-means clustering and Thresholding

Image Segmentation using K-means clustering and Thresholding Image Segmentation using Kmeans clustering an Thresholing Preeti Panwar 1, Girhar Gopal 2, Rakesh Kumar 3 1M.Tech Stuent, Department of Computer Science & Applications, Kurukshetra University, Kurukshetra,

More information

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks Robust PIM-SM Multicasting using Anycast RP in Wireless A Hoc Networks Jaewon Kang, John Sucec, Vikram Kaul, Sunil Samtani an Mariusz A. Fecko Applie Research, Telcoria Technologies One Telcoria Drive,

More information

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation Solution Representation for Job Shop Scheuling Problems in Ant Colony Optimisation James Montgomery, Carole Faya 2, an Sana Petrovic 2 Faculty of Information & Communication Technologies, Swinburne University

More information

Professor Lee, Yong Surk. References 고성능마이크로프로세서구조의개요. Topics Microprocessor & microcontroller

Professor Lee, Yong Surk. References 고성능마이크로프로세서구조의개요. Topics Microprocessor & microcontroller 이강좌는 C & S Technology 사의지원으로제작되었으며 copyright가없으므로비영리적인목적에한하여누구든지복사, 배포가가능합니다. 연구실홈페이지에는고성능마이크로프로세서에관련된많은강좌가있으며누구나무료로다운로드받을수있습니다. Professor Lee, Yong Surk 1973 : B.S., Electrical Eng., Yonsei niv. 1981

More information

Fast Fractal Image Compression using PSO Based Optimization Techniques

Fast Fractal Image Compression using PSO Based Optimization Techniques Fast Fractal Compression using PSO Base Optimization Techniques A.Krishnamoorthy Visiting faculty Department Of ECE University College of Engineering panruti rishpci89@gmail.com S.Buvaneswari Visiting

More information

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots Aaptive Loa Balancing base on IP Fast Reroute to Avoi Congestion Hot-spots Masaki Hara an Takuya Yoshihiro Faculty of Systems Engineering, Wakayama University 930 Sakaeani, Wakayama, 640-8510, Japan Email:

More information

Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity

Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 1 Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity Te H. Szymanski, Member, IEEE

More information

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Using Vector and Raster-Based Techniques in Categorical Map Generalization Thir ICA Workshop on Progress in Automate Map Generalization, Ottawa, 12-14 August 1999 1 Using Vector an Raster-Base Techniques in Categorical Map Generalization Beat Peter an Robert Weibel Department

More information

A Parameterized Automatic Cache Generator for FPGAs

A Parameterized Automatic Cache Generator for FPGAs A Parameterized Automatic Cache Generator for FPGAs Peter Yiannacouras and Jonathan Rose Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario,

More information

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks Architecture Design of Mobile Access Coorinate Wireless Sensor Networks Mai Abelhakim 1 Leonar E. Lightfoot Jian Ren 1 Tongtong Li 1 1 Department of Electrical & Computer Engineering, Michigan State University,

More information

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks Coorinating Distribute Algorithms for Feature Extraction Offloaing in Multi-Camera Visual Sensor Networks Emil Eriksson, György Dán, Viktoria Foor School of Electrical Engineering, KTH Royal Institute

More information

On the Placement of Internet Taps in Wireless Neighborhood Networks

On the Placement of Internet Taps in Wireless Neighborhood Networks 1 On the Placement of Internet Taps in Wireless Neighborhoo Networks Lili Qiu, Ranveer Chanra, Kamal Jain, Mohamma Mahian Abstract Recently there has emerge a novel application of wireless technology that

More information

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Questions? Post on piazza, or  Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)! EE122 Fall 2013 HW3 Instructions Recor your answers in a file calle hw3.pf. Make sure to write your name an SID at the top of your assignment. For each problem, clearly inicate your final answer, bol an

More information

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks 1 Backpressure-base Packet-by-Packet Aaptive Routing in Communication Networks Eleftheria Athanasopoulou, Loc Bui, Tianxiong Ji, R. Srikant, an Alexaner Stolyar Abstract Backpressure-base aaptive routing

More information

Considering bounds for approximation of 2 M to 3 N

Considering bounds for approximation of 2 M to 3 N Consiering bouns for approximation of to (version. Abstract: Estimating bouns of best approximations of to is iscusse. In the first part I evelop a powerseries, which shoul give practicable limits for

More information

Skyline Community Search in Multi-valued Networks

Skyline Community Search in Multi-valued Networks Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h

More information

Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles

Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles Impact of cache interferences on usual numerical ense loop nests O. Temam C. Fricker W. Jalby University of Leien INRIA University of Versailles Niels Bohrweg 1 Domaine e Voluceau MASI 2333 CA Leien 78153

More information

Application-Specific Customization of Soft Processor Microarchitecture

Application-Specific Customization of Soft Processor Microarchitecture Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

More information

Finite Automata Implementations Considering CPU Cache J. Holub

Finite Automata Implementations Considering CPU Cache J. Holub Finite Automata Implementations Consiering CPU Cache J. Holub The finite automata are mathematical moels for finite state systems. More general finite automaton is the noneterministic finite automaton

More information

Exploring Context with Deep Structured models for Semantic Segmentation

Exploring Context with Deep Structured models for Semantic Segmentation 1 Exploring Context with Deep Structure moels for Semantic Segmentation Guosheng Lin, Chunhua Shen, Anton van en Hengel, Ian Rei between an image patch an a large backgroun image region. Explicitly moeling

More information

A Plane Tracker for AEC-automation Applications

A Plane Tracker for AEC-automation Applications A Plane Tracker for AEC-automation Applications Chen Feng *, an Vineet R. Kamat Department of Civil an Environmental Engineering, University of Michigan, Ann Arbor, USA * Corresponing author (cforrest@umich.eu)

More information

Frequency Domain Parameter Estimation of a Synchronous Generator Using Bi-objective Genetic Algorithms

Frequency Domain Parameter Estimation of a Synchronous Generator Using Bi-objective Genetic Algorithms Proceeings of the 7th WSEAS International Conference on Simulation, Moelling an Optimization, Beijing, China, September 15-17, 2007 429 Frequenc Domain Parameter Estimation of a Snchronous Generator Using

More information

Top-down Connectivity Policy Framework for Mobile Peer-to-Peer Applications

Top-down Connectivity Policy Framework for Mobile Peer-to-Peer Applications Top-own Connectivity Policy Framework for Mobile Peer-to-Peer Applications Otso Kassinen Mika Ylianttila Junzhao Sun Jussi Ala-Kurikka MeiaTeam Department of Electrical an Information Engineering University

More information

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks Hinawi Publishing Corporation International Journal of Distribute Sensor Networks Volume 2014, Article ID 936379, 17 pages http://x.oi.org/10.1155/2014/936379 Research Article REALFLOW: Reliable Real-Time

More information

Shift-map Image Registration

Shift-map Image Registration Shift-map Image Registration Svärm, Linus; Stranmark, Petter Unpublishe: 2010-01-01 Link to publication Citation for publishe version (APA): Svärm, L., & Stranmark, P. (2010). Shift-map Image Registration.

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Algorithm for Intermodal Optimal Multidestination Tour with Dynamic Travel Times

Algorithm for Intermodal Optimal Multidestination Tour with Dynamic Travel Times Algorithm for Intermoal Optimal Multiestination Tour with Dynamic Travel Times Neema Nassir, Alireza Khani, Mark Hickman, an Hyunsoo Noh This paper presents an efficient algorithm that fins the intermoal

More information

Optimizing the quality of scalable video streams on P2P Networks

Optimizing the quality of scalable video streams on P2P Networks Optimizing the quality of scalable vieo streams on PP Networks Paper #7 ASTRACT The volume of multimeia ata, incluing vieo, serve through Peer-to-Peer (PP) networks is growing rapily Unfortunately, high

More information

An In Depth Look at VOLK

An In Depth Look at VOLK An In Depth Look at VOLK The Vector-Optimize Library of Kernels Nathan West U.S. Naval Research Laboratory 26 August 2015 (U) 26 August 2015 1 / 19 A brief look at VOLK organization VOLK is a sub-project

More information

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique International OPEN ACCESS Journal Of Moern Engineering Research (IJMER) An Aaptive Routing Algorithm for Communication Networks using Back Pressure Technique Khasimpeera Mohamme 1, K. Kalpana 2 1 M. Tech

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Design Space Exploration Using Parameterized Cores

Design Space Exploration Using Parameterized Cores RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE

More information

Lab work #8. Congestion control

Lab work #8. Congestion control TEORÍA DE REDES DE TELECOMUNICACIONES Grao en Ingeniería Telemática Grao en Ingeniería en Sistemas e Telecomunicación Curso 2015-2016 Lab work #8. Congestion control (1 session) Author: Pablo Pavón Mariño

More information

A Highly Scalable Parallel Boundary Element Method for Capacitance Extraction

A Highly Scalable Parallel Boundary Element Method for Capacitance Extraction A Highly Scalable Parallel Bounary Element Metho for Capacitance Extraction The MIT Faculty has mae this article openly available. Please share how this access benefits you. Your story matters. Citation

More information

Overlap Interval Partition Join

Overlap Interval Partition Join Overlap Interval Partition Join Anton Dignös Department of Computer Science University of Zürich, Switzerlan aignoes@ifi.uzh.ch Michael H. Böhlen Department of Computer Science University of Zürich, Switzerlan

More information

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 3 Sofia 017 Print ISSN: 1311-970; Online ISSN: 1314-4081 DOI: 10.1515/cait-017-0030 Particle Swarm Optimization Base

More information

On Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation

On Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation On Effectively Determining the Downlink-to-uplink Sub-frame With Ratio for Mobile WiMAX Networks Using Spline Extrapolation Panagiotis Sarigianniis, Member, IEEE, Member Malamati Louta, Member, IEEE, Member

More information