An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

Size: px
Start display at page:

Download "An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers"

Transcription

1 An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. & Electrical Engineering Dept. Universit of California, Los Angeles * {cong, xiao}@cs.ucla.edu ABSTRACT High-level snthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highl pipelined register-transfer level (RTL) specifications. The highthroughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memor partitioning is explored to increase the memor bandwidth b scheduling multiple simultaneous memor accesses to different memor banks. Prior work on memor partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an earl-stage exploration of non-uniform memor partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case stud to show the potential benefits of non-uniform memor partitioning. Our novel method can alwas achieve the minimum memor size and the minimum number of memor banks, which cannot be guaranteed in an prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator. 1. INTRODUCTION Accelerator-centric architectures can bring x energ efficienc b offloading computation from general-purpose CPU cores to application-specific accelerators [1]. The engineering cost of designing massive heterogeneous accelerators is high, but can be much reduced b raising their abstraction level beond RTL to C b high-level snthesis (HLS) [2]. Though HLS tools are good at scheduling computation to the timing slots of clock ccles, the do not optimize data accesses. This weakness motivates recent work on data reuse [3, 4] and memor partitioning [5 9] in HLS. External memor bandwidth is a dominant bottleneck for sstem performance and power consumption. Data reuse is an efficient technique of using on-chip memories to reduce external memor accesses. When an application contains a data arra with multiple references, we can allocate a reuse buffer and keep each arra element in the buffer from its first access until its last access. Then each arra element needs to be fetched from the external memor onl once, and the off-chip traffic is reduced to the minimum. Loop Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To cop otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 14, June , San Francisco, CA, USA. Copright 2014 ACM /14/06...$ transformation can be applied to improve data localit and reduce the size of the data reuse buffer [3, 4]. When the innermost loop of an application is full pipelined, an accelerator needs to perform multiple load operations from the same reuse buffer ever clock ccle. To avoid contention on memor ports, memor partitioning of the reuse buffer is required. Since the transistor count of memor control logics is proportional to the number of memor banks after partitioning, the optimization goal of memor partitioning is to minimize the number of memor banks. The constraint is that the multiple arra elements to be loaded ever clock ccle are alwas stored in different memor banks. The work in [5, 6] provides solid frameworks of memor partitioning. Further optimizations, including memor access rescheduling [7] and multi-dimension arras [8], are also proposed. But none of them can guarantee the optimal solution for a given case in terms of the number of banks. The reason is that their optimization space is limited within uniform memor partitioning, i.e., all the memor banks have to be of the same size. It is an unnecessar constraint which was inherited from commodit HLS tools, e.g., [10]. Other work partitions different fields of a single data structure into multiple memor banks for data parallelism based on profiling results [9]. It is orthogonal to the problems on multiple arra references in [5 8] and this paper. In this paper, we go beond the limitation of uniform memor partitioning, and propose a novel method based on non-uniform memor partitioning. We found that after removal of this limitation, we can achieve fewer memor banks than the optimal solutions in prior work [5 8]. As an earl-stage exploration of nonuniform memor partitioning, in this paper we focus on stencil computation, a popular communication-intensive application domain. We develop a microarchitecture with novel structures of memor sstems which achieve the theoretical minimum number of memor banks for an stencil access patterns. Experimental results show that we can reduce % of variable resources including BRAMs, logic slices, and DSPs compared to prior work [8], along with slightl improved timing. 2. PRELIMINARY 2.1 Stencil Computation Stencil computation comprises an important class of kernels in man application domains, such as image processing, constituent kernels in multigrid methods, and partial differential equation solvers. These kernels often contribute to most workloads in these applications. Even in the recent publications on memor partitioning [7,8] which were developed for general applications, all the benchmarks used in the publications are in fact stencil computation. The data elements accessed in stencil computation are on a large multi-dimensional grid which usuall exceeds on-chip memor capacit. The computation is iterated as a stencil window slides over

2 the grid. In each iteration, the computation kernel accesses all the data points in the stencil window to calculate an output. Both the grid shape and the stencil window can be arbitrar as specific b the given stencil applications. A precise definition of stencil computation can be found in Appendix 9.1. Fig. 1 shows an ex- microarchitecture contains multiple memor sstems, and each is optimized to a data arra with stencil accesses. Since there are no reuse opportunities among different data arras, the memor sstems for different arras are independent of each other. Each memor sstem receives a single data stream which iterates on a multi-dimensional grid without an repeated external access. Each memor sstem contains data reuse buffers, memor controllers and interconnects that have been customized for the access patterns of the data arra used in the target stencil computation. A memor sstem sends data to the computation kernel via each data port associated with each arra reference in the original user code. If all the data are consumed b the computation kernel, the memor sstem will immediatel prepare the data used in the next ccle to feed into the full pipelined computation kernel. With our microarchi- Figure 1: Example C code of a tpical stencil computation (5-point stencil window in the kernel DENOISE in medical imaging [11]). ample stencil computation in the kernel DENOISE in medical imaging [11]. Its grid shape is a rectangle, and 0 j 1024 A[i-1][j] reuse distance = 1023 i A[i][j-1] A[i][j] A[i][j+1] A[i][j+1] 768 Figure 2: Iteration domain of the example stencil computation in Fig. 1. its stencil window contains 5 points, as shown in Fig. 2. Five data elements need to be accessed in each iteration. In addition, man data elements will be repeatedl accessed among these iterations. For example, A[2][2] will be accessed five times, when (i, j) {(1, 2), (2, 1), (2, 2), (2, 3), (3, 2)}. This leads to high onchip memor port contention and off-chip traffic, especiall when the stencil window is large (e.g., after loop fusion of stencil applications for computation reduction as proposed in [12]). Therefore, during the hardware development of a stencil application, a large portion of engineering effort is spent on data reuse and memor partitioning optimization. 2.2 Microarchitecture for Stencil Accesses A[0..767][ ] A[i-1][j] Microarchitecture for Stencil Accesses of Data Memor Subsstem for Data Arra A Data Reuse Buffers Interconnects A[i][j-1] A[i][j] A[i][j+1] Computation Kernel B[i][j] Memor Controllers A[i+1][j] Accelerator Figure 3: The overall architecture of our microarchitecture for stencil computation. It decouples stencil accesses from computation. 1 In this work we develop a microarchitecture to decouple the stencil access patterns from the computation, as shown in Fig. 3. The Figure 4: Example code of the computation kernel where all the memor accesses are offloaded to our microarchitecture. The keword volatile in the code informs HLS tools of potential data change after access. The pragma pipeline is used in Xilinx Vivado HLS [10] to pipeline the innermost loop. tecture, the C code of the computation kernel can be simplified to Fig. 4, where users no longer need to optimize memor accesses, which is offloaded to our microarchitecture. Users can assume that each data access to the points should get the same data from our accelerator microarchitecture 3 as the original load operation. The C code of the computation kernel can be compiled b HLS tools for a full pipelined hardware implementation with the most efficient resource usage. 2.3 Targeting Optimal Design The design target of our microarchitecture has three aspects: 1. Full pipelining. As the stencil window slides ever clock ccle, the microarchitecture is able to send out all the data in the stencil window to the computation kernel and get all the data read for the consecutive accesses in the next ccle. 2. Minimum data reuse buffer size. When a data element is full reused for each access, it stas in on-chip memories from its first access until its last access. Meanwhile, other elements in the arra are loaded from external memor ever clock ccle. Therefore, the theoretical minimum size of the reuse buffer for a data arra is equal to the maximum lifetime of an element in the arra. In the example in Fig. 2, A[2][2] is accessed now b the arra reference A[i + 1][j] for the first time, and is accessed 2048 ccles later b the reference A[i 1][j] for the last time. Therefore, the minimum size of the data reuse buffer for arra A will be Minimum number of reuse buffer banks. Suppose the stencil window of an input arra contains n points, i.e., there are n data references to the arra. It means that each clock ccle, n

3 s # of banks data elements need to be read out, either from reuse buffers or from the external memor. Suppose we use dual-port memories to implement reuse buffers. One port of a buffer bank is occupied b the replacement of an expired data element with a new element from the external memor ever ccle. There is onl one port left in each memor bank for us to read the n elements needed b the stencil window. Suppose one of the n element happens to be the new element from the external memor in a certain smart data reuse mechanism. There are still n 1 data elements to read, and we need at least n 1 memor banks. In the example of Fig. 2, n = 5 indicates that we need at least four memor banks. As the stencil window slides, all five data elements in the window should never be in the same bank. This is a tough constraint to satisf, and prior work [5 8] had to use more banks to eliminate bank conflicts in difficult cases. Fig. 5 shows that the number of row size of the data grid Figure 5: Even for a constant stencil window in Fig. 2, the number of banks varies as the row size of the data grid changes [5]. banks ranges from five to eight in [5] as the row size of the data grid changes, even if the stencil window keeps the constant shape in Fig. 2. The technologies proposed in [7,8] can keep the number of banks consistentl to be five in the case of the stencil window shown in Fig. 2. However when the (a) (b) (c) Figure 6: Example stencil windows where more banks are needed than the # of arra references in [7, 8]. (a) 4-point stencil in BICUBIC [13]. (b) 4-point stencil in RICIAN [14]. (c) 19-point stencil in SEGMENTATION_3D [8]. stencil window changes to some other shape in other applications, e.g., the ones shown in Fig. 6, the methods in [7, 8] will need 5, 5, 20 banks respectivel, which are larger than the minimum values. In this work we will present a generalized microarchitecture that can simultaneousl achieve these optimal design targets for an application that falls in the categor of stencil computation. 3. METHODOLOGY 3.1 Overview The internal structure of a memor sstem in our microarchitecture is illustrated b the example in Fig. 7, which is generated for the stencil computation in Fig. 1. Suppose the stencil window contains n points (n = 5 in the example of Fig. 1). Our memor sstem will contain n 1 data reuse FIFOs as well as n data path splitters and n data filters connected together in the wa shown in Fig. 7. The data reuse FIFOs provide the same storage as conventional data reuse buffers, and the data path splitters and filters work as memor controllers and data interconnects. In contrast to conventional cclic memor partitioning [5, 7, 8] which uses uniform 1 buffer sizes, the sizes of reuse buffers in our design are nonuniform. The are customized to the shape of the stencil window in the target application. 3.2 Denotations To better explain the working principle of our memor sstem, we provide a table of denotations that will be used in the following sections in Table 1. The precise definitions of the denotations are given in Appendix 9.1. Denotations Meanings Example b Fig. 2 i loop iteration vector i = (1, 2) A data arra the arra A with the five references A x arra reference the five references such as A[i + 1][j] h data access index h = (2, 2) being accessed b Ax f x data access offset f x= h x- i constant for A x D Ax data domain {(i, j) 2 i 767, 1 j 1022 } D A input data domain {(i, j) 0 i 767, 0 j 1023 } r Ax A reuse distance vector (1,-1) from A[i + 1][j] to A[i][j + 1] r Ax Az maximum reuse distance 2048 from A[i + 1][j] to A[i 1][j] l lexicographic order (1, 0) l (0, 1) l (0, 0) l ( 1, 0) Table 1: Denotations used in the working principle of our memor sstem. 3.3 Working Principle Since our microarchitecture is based on nonuniform memor partitioning in contrast to uniform partitioning in prior work, the memor controlling mechanism cannot follow the modulo scheduling of data accesses among memor banks in prior work [5,7,8]. Instead, our microarchitecture is a novel design based on data streaming. Each module in our design is autonomous and can work in a full pipeline as long as its upstream module produces a data element and its downstream module consumes an element ever clock ccle. We will discuss the working principle of our design in terms of function correctness, deadlock-free, minimum reuse buffer size, and minimum number of buffer banks Function Correctness Each data path splitter in Fig. 7 reads an existing data element from its precedent FIFO and sends the data element to the successive FIFO as well as to the data filter below. Each data filter customizes the data stream that flows into the computation kernel to fit the access patterns of the associated arra reference. A data filter for an arra reference A x receives the data stream which iterates in D A and sends out the data which iterates in D Ax. For example, filter 0 in Fig. 7 sends the data element in set D A0 = {(i, j) 2 i 767, 1 j 1022} out of the input data domain D A in Table 1 and discards the first two rows in the 2D grid of Fig. 2. This guarantees the correctness of the data set sent to each data port of the computation kernel. Due to the propert of stencil computation, the data elements accessed b an arra reference are in the same lexicographic order as the loop iteration (see Propert 1 in Appendix 9.1). Our microarchitecture based on data streaming enforces this order, as long as the input data stream is also in the lexicographic order (i.e., data iterated from innermost loop to outermost loop). The lexicographic order of input data is usuall realized without hardware overhead since it fits well with burst accesses to external memor or inter-accelerator communication patterns (see discussion in Appendix 9.3). B providing the correct data set and correct data order to the arra references, our design can guarantee that as the stencil window slides, the data elements received b the computation kernel are alwas consistent with the arra references Deadlock-Free Since the modules in our design are autonomous instead of being snchronized b a centralize controller, a critical design challenge is to ensure that these modules are free of deadlock. Deadlock is caused b an potential ccle in the dependenc graph of all the

4 Memor Sstem for Stencil Accesses to Arra A in DENOISE storage space for data reuse S = data path splitter A[0..767][ ] s0 FIFO_0 s1 s2 s3 FIFO_3 s4 FIFO_1 FIFO_2 ilter_0 fi ilter_1 fi ilter_2 fi ilter_3 fi ilter_4 fi A[i+1][j] A[i][j+1] A[i][j] A[i][j-1] A[i-1][j] towards computation kernel Figure 7: The example circuit structure of our memor sstem generated for arra A in the stencil computation of Fig. 1. modules. Due to the chaining structure as shown in Fig. 7, an potential ccle must go through a pair of data filters. For an pair of s x filter_x FIFO empt e 1 e 2 s filter_ s x filter_x FIFO full e 3 e 4 data not data not consumed b A x consumed b A (a) (b) Figure 8: The dependenc graph of data filters filter_x and filter_ where x < is shown in the form of two dependenc ccles. data filters filter_x and filter_ where x <, we can draw all the dependenc relations as shown in Fig. 8. There are four possible dependenc relations, each marked with the condition when the dependenc happens. Obviousl, the conditions of edge e 1 and e 2 are mutuall exclusive, and the same applies to e 3 and e 4. Therefore, onl pair (e 1, e 3) and pair (e 2, e 4) could form two potential dependenc ccles respectivel, as shown in Fig. 8. We can prove that e 1 will be mutuall exclusive to e 3, and e 2 will be mutuall exclusive to e 4, as long as we satisf the following conditions: 1. For data access offsets f x of f of two arra references, s filter_ x < f x l f (1) 2. For the reuse FIFO between adjacent A x and A, buffer_size f x f = rax A (2) Both the conditions have ver clear phsical meanings. The first condition means that a data element accessed b an earl arra reference can be reused b a late arra reference, but not the reverse. The second condition means that if an arra reference reuses data elements accessed b another arra reference, the reuse buffer size should be greater than or equal to the maximum reuse distance between these two references. The detailed proof of these two deadlock-free conditions can be found in Appendix 9.2. To ensure the first condition, we can sort the f of the data arra references in the descending lexicographic order when we map them to data filters from 0 to n 1, e.g., (1, 0) (0, 1) (0, 0) (0, 1) ( 1, 0) in Fig. 7. To ensure the second condition, we calculate the maximum reuse distances of all the pairs of adjacent arra references and allocate reuse FIFO sizes accordingl, as shown in Table 2 for the example in Fig. 1. FIFO ID precedent/successive references FIFO size phsical impl. FIFO 0 A[i + 1][j] A[i][j + 1] 1023 BRAM FIFO 1 A[i][j + 1] A[i][j] 1 register FIFO 2 A[i][j] A[i][j 1] 1 register FIFO 3 A[i][j 1] A[i 1][j] 1023 BRAM Table 2: Reuse FIFOs with nonuniform sizes calculated from maximum reuse distances of adjacent arra references and mapped to different phsical implementations (block memor, distributed memor, or register) if targeted an FPGA platform Design Optimalit Minimum Reuse Buffer Size. Due to the linearit of maximum reuse distances, the sum of the sizes of all the reuse FIFOs is equal to the maximum reuse distance between arra reference A 0 and A n 1. Due to the sorting of arra references b f in the descending lexicographic order, A 0 is the earliest reference and A n 1 is the latest reference. Therefore, the total reuse buffer size is equal to the maximum reuse distance between the earliest and latest references, which is the theoretical minimum. As shown in Table 2 for the example in Fig. 1, the total size is 2048, the same as the minimum value discussed in Section 2.3. If the maximum reuse distance is so large that the buffer sizes exceed the on-chip memor capacit, our microarchitecture allows tradeoff of offchip bandwidth occupation for smaller memor usage (see discussion in Appendix 9.4). Minimum Number of Buffer Banks. The structure of our design guarantees that for n arra references, there are n 1 buffer banks (reuse FIFOs) as described in Section 3.1. It is the theoretical minimum value as discussed in Section 2.3. Since our design achieves both the minimum reuse buffer size and the minimum number of buffer banks, our microarchitecture is optimal. 3.4 Insights Gained From RTL Simulation Our microarchitecture is quite different from conventional designs with centralized controllers [5 8]. The major tasks of a conventional controller includes two aspects: 1. Filling up reuse buffers. Before the computation starts, the controller will first fill reuse buffers with data elements needed b the computation kernel. 2. Evict expired data from reuse buffers. The challenging part in this function is when the reuse distance between arra references changes as the execution advances. This often happens on a skewed data grid. In this case, the number of data elements stored in reuse buffers changes as time goes, and the smmetr between read and write is broken. There is no specific module that takes charge of these ke tasks in our microarchitecture. Instead, b observing the execution of our design in RTL simulation, we found that these ke tasks are done automaticall b the coordination of our distributed modules Automatic Filling of Reuse Buffers

5 The filling process of reuse buffers in our microarchitecture is shown in Table 3. The data filter 4 associated with A[i 1][j] is first stalled at ccle 1. This is when the filter 4 tries to send data A[0][1] to the computation kernel but all the other data filters are discarding this data. As a result, the computation kernel will be waiting for data from the other data filters and will not consume data from the filter 4. This stalling will lead to filling up of data in the FIFO 3 between A[i 1][j] and A[i][j 1]. The other four filters will keep the data stream advancing since all of them discard the first row in the data domain ccles later, the filter 3 will tr to to send data A[1][0] to the computation kernel but will be stalled as well. Then the FIFO 2 will start being filled up, and the data path splitter s3 will stop sending data to FIFO 3. The following process is similar until FIFO 1 and FIFO 0 are filled up consecutivel, as shown in Table 3. Then at ccle 2049, the filter 0 receives A[2][1] and will send the data to the computation kernel. All the data at the five inputs of the computation kernel become valid, and the kernel consumes all the five data to produce the first output. Then all the stalled filters can continue to send new data to the computation kernel ever clock ccle until the end of the iteration domain Automatic Adjustment of Reuse Data Amount An application can have a skewed data grid as shown in Fig. 9. This kind of application is usuall needed when a rectangular grid i A[i-1][j-1] A[i+1][j-1] A[i][j] A[i-1][j+1] A[i+1][j+1] j reuse distance = and a small buffer respectivel. Table 2 shows the heterogeneous mapping of reuse buffers to different phsical implementation Data Filter in Polhedral Domain Note that though the data domains are rectangles in the example of Fig. 1, the could be an polhedrons on a multi-dimensional grid. Comparisons on loop bounds is not a universal solution to data filtering. To select D Ax out of D A, the data filter in our microarchitecture is implemented with a data switch controlled b two counters, let s sa input counter and output counter, as shown in Fig. 10. The input counter iterates over the input data stream D A of i input counter input stream inc D A q data switch = 1 i D A x output counter q inc 0 discard output stream Figure 10: Structure of the data filter which can be applied to general polhedral data domains. arra A, e.g., A[0..767][ ]. The output counter iterates over the data domain D Ax of arra reference A x, e.g., A[2..767][ ]. The input counter proceeds when the filter receives an input data. The output counter proceeds when its value is equal to the input counter. It is also the condition that the data switch forwards the input data to the output. In contrast, when the output counter is not equal to the input counter, the data switch discards the input data. 4. DESIGN AUTOMATION FLOW We develop a design automation flow to generate the complete accelerator for a given stencil application, as shown in Fig. 11. It Figure 9: Non-rectangular iteration domain of an example application with dnamicall changeable reuse distance. is iterated along the 45 direction after certain loop transform [15]. The skewed grid leads to the challenge that the number of data stored in each reuse buffer will change as the iteration goes on, and often requires a complex memor controlling scheme in a centralized design [5, 7, 8]. However, this difficult can be automaticall handled b our distributed modules. Following the design schematic in Section 3.1, we will order the arra references and map them to five filters 0 4. Among them, filter 2 is for reference A[i][j] and filter 3 is for reference A[i 1][j]. Note that h 2 of filter 2 advances h 3 of filter 3 b one row when h 2 and h 3 are snchronized b the computation kernel, as shown in Fig. 9. At each turn around to the next row in the data domain, the filter 3 will fetch one more data from the FIFO splitter than the filter 2 since the filter 3 is iterating over a longer row. Then the number of data stored in FIFO 2 between filter 2 and filter 3 will be reduced b one. This achieves the dnamic adaption of the number of data stored in a reuse buffer to the change of reuse distance in the case of a skewed data grid. 3.5 Miscellaneous Design Issues Heterogeneous Mapping of Reuse Buffers Reuse buffers in our design have different sizes, as shown in Table 2, and ma prefer different phsical implementations. For example, if the target platform is FPGA, the phsical implementation candidates include block memor, distributed memor and slice registers. The are efficient for a large buffer, a medium buffer 6 Figure 11: Design automation flow of accelerator generation for stencil computation. starts from the original source codes of a user application, e.g., the code in Fig. 1. In the left branch, we first appl polhedral analsis to extract the polhedrons of data arras with stencil accesses. We calculate the data domain of each arra reference and the reuse distance of each pair of adjacent arra references. This information is used to instantiate the data filters and reuse FIFOs in our microarchitecture. Then the flow generates a microarchitecture instance, e.g., the design in Fig. 7, with the memor sstems optimized for

6 clock data in filter status (forwarding/discarding/stalled) FIFO status (# of data) ccle stream filter 0 filter 1 filter 2 filter 3 filter 4 FIFO 0 FIFO 1 FIFO 2 FIFO 3 1 A[0][1] d d d d f s A[1][0] d d d f s s A[1][1] d d f s s s A[1][2] d f s s s s A[2][1] d f s f s f s f s f A[2][2]... f f f f f Table 3: The execution flow of our microarchitecture in the example of Fig. 1. The latenc among the data streams at different modules is ignored here for demonstration purpose onl. stencil accesses in user applications. In the right branch, we first appl source-to-source code transformation to extract the kernel code with pure computation, e.g., Fig. 4. Then high-level snthesis is applied on the transformed code for a full pipelined hardware implementation of the computation kernel in RTL. Finall, we integrate the microarchitecture with the computation kernel for a complete accelerator with full pipelining and data reuse, e.g., the design in Fig EXPERIMENTS 5.1 Experiment Setup Our polhedral analsis in Fig. 11 is implemented b the LLVM- Poll framework [16]. The kernel transformation is performed b the open source compiler infrastructure ROSE [17], and the highlevel snthesis is performed b Xilinx Vivado HLS [10]. Although our methodolog is applicable to both ASIC and FPGA designs, we choose FPGA as the target device in this work due to the availabilit of downstream behavioral snthesis and implementation tools. The Xilinx Virtex7 FPGA XC7VX485T and ISE 14.2 tool suite [18] are used in our experiments. The target clock frequenc is set at 200MHz. The benchmarks used in prior memor partitioning work [7, 8] make up a rich set of real-life stencil computation kernels. Among them, we select the more challenging benchmarks with non-rectangular stencil windows for our experiments. DENOISE (2D/3D), RICIAN (2D), and SEGMENTATION (3D) are from medical imaging [11]. BICUBIC (2D) is from bicubic interpolation process [13]. SOBEL (2D) is from Sobel edge detection algorithm [14]. We choose the more recent memor partitioning work [8] as our experiment baseline. 5.2 Results Original Target # of Banks Total Size II II [8] Ours [8] Ours DENOISE RICIAN SOBEL BICUBIC DENOISE_3D SEGMENTATION Table 4: High-level partitioning results. The comparison results of memor partitioning are shown in Table 4. We list the pipeline II of the original user codes which suffer memor port contentions before memor partitioning, which is equal to the number of memor load operations on the data arra. We also list the II that the computation kernel targets to achieve via memor partitioning. The number and total size of reuse buffer banks are reported for both [8] and our method. The buffer size is in the unit of data element. As shown in Tabel 4, our method saves the partitioning bank number of all of the six benchmarks. In addition, our method does not need the padding technique in [8] which increases the grid size at certain dimensions to relax the partitioning complexit. Our methods saves the buffer size, especiall when the padding introduces more overhead in a high-dimensional data grid, e.g., Fig. 6(c). BRAM Slice DSP CP (ns) [8] DENOISE ours comp.(%) [8] RICIAN ours comp.(%) [8] SOBEL ours comp.(%) [8] BICUBIC ours comp.(%) [8] DENOISE_3D ours comp.(%) [8] SEGMENTATION_3D ours comp.(%) Average(%) Table 5: Snthesis experimental results. The post-snthesis results are listed in Table 5. Phsical resource usage (block RAMs, logic slices, and DSPs) and timing information are extracted from Xilinx ISE report. As shown in Table 5, we use 66% fewer block RAMs than [8]. This stems from 1) the minimum number of buffer banks achieved, and 2) the heterogeneous mapping of buffer banks to variable resources in addition to block RAMs as demonstrated in Table 2. We also use 25% fewer logic slices than [8], even though we implement some of the small reuse buffers in registers. That is because we avoid the modulo scheduling in conventional uniform memor partitioning which generates a hardware transformer to map the original data address to the bank ID and local address via a complex calculation involving multiplication and division. Instead our memor sstem onl needs counters iterating over the data domains in the lexicographic order. This advantage is also reflected b the complete elimination of DPSs in our method. The clock period does not show too much difference between [8] and our method since the back-end flow will stop optimization as long as it meets the 200MHz target. However, our method generall has larger slacks from the target 5.0ns as shown in Table 5. It is mainl due to the distributed structure in our method. We tried to use the Xilinx XPower Analzer for power estimation, but found that the FPGA power is dominated b the static power, and is almost invariant with custom circuits. If power gating is available in FPGA, the FPGA power will be proportional to resource usage, which is covered b Table CONCLUSIONS AND FUTURE WORK In this work, we propose non-uniform partitioning which opens a new design space compared to conventional cclic partitioning framework. As a starting point, we use stencil computation as the initial design target and show a novel memor sstem that works with non-uniform sizes of reuse buffer. The memor sstem in the extended design space can achieve the optimal solution with the minimum reuse buffer size and the minimum number of buffer banks. We develop a design automation flow that generates a microarchitecture with our memor sstems and integrates it with the computation kernel for a complete design. Experimental results

7 show that our method outperforms the recent memor partitioning work in terms of utilization of variable FPGA resources. As this is the first work on non-uniform memor partitioning, our primar goal is to show the potential of this new approach. Though stencil computation is a popular application domain and attracts the attention of most memor partitioning work, it is still important to extend non-uniform memor partitioning to general cases. Our data streaming method ma not be the onl solution for utilizing the non-uniform reuse buffers. A modified modulo scheduling extended from conventional uniform memor partitioning is also a good candidate. We believe that there are man opportunities in future research. 7. ACKNOWLEDGEMENTS The authors would like to thank Yuxin Wang for sharing the snthesizable codes transformed b [8]. The authors would also like to thank Xilinx for equipment donation and financial contributions. This work is partiall supported b the Center for Domain- Specific Computing (CDSC), and the NSF grant CCF It is also supported b C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored b MARCO and DARPA. 8. REFERENCES [1] Y.-t. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou, Accelerator-Rich CMPs: From Concept to Real Hardware, in International Conference on Computer Design, [2] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, High-Level Snthesis for FPGAs: From Prototping to Deploment, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sstems, vol. 30, no. 4, pp , Apr [3] J. Cong, P. Zhang, and Y. Zou, Optimizing memor hierarch allocation with loop transformations for high-level snthesis, in Design Automation Conference, 2012, pp [4] L.-N. Pouchet, P. Zhang, P. Sadaappan, and J. Cong, Polhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international smposium on Field programmable gate arras - FPGA 13, p. 29, [5] J. Cong, W. Jiang, B. Liu, and Y. Zou, Automatic memor partitioning and scheduling for throughput and power optimization, in International Conference on Computer-Aided Design, 2009, p [6] Y. Wang, P. Zhang, C. Xu, and J. Cong, An integrated and automated memor optimization flow for FPGA behavioral snthesis, in Asia and South Pacific Design Automation Conference, Jan. 2012, pp [7] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong, Memor partitioning and scheduling co-optimization in behavioral snthesis, in International Conference on Computer-Aided Design, 2012, pp [8] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, Memor partitioning for multidimensional arras in high-level snthesis, in Design Automation Conference, 2013, p. 1. [9] Y. Ben-Asher and N. Rotem, Automatic memor partitioning, in International Conference on Hardware/Software Codesign and Sstem Snthesis, 2010, p [10] Xilinx, Vivado High-Level Snthesis. [Online]. Available: [11] J. Cong, V. Sarkar, G. Reinman, and A. Bui, Customizable Domain-Specific Computing, IEEE Design and Test of Computers, vol. 28, no. 2, pp. 6 15, Mar [12] A. A. Nacci, V. Rana, F. Bruschi, D. Sciuto, I. Beretta, and D. Atienza, A high-level snthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices, in Design Automation Conference, 2013, p. 1. [13] Bicubic interpolation. [Online]. Available: [14] S. Verdoolaege, H. Nikolov, and T. Stefanov, pn: A Tool for Improved Derivation of Process Networks, EURASIP Journal on Embedded Sstems, vol. 2007, pp. 1 13, [15] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, Improving high level snthesis optimization opportunit through polhedral transformations, in International Smposium on FPGAs, 2013, p. 9. [16] LLVM-poll. [Online]. Available: [17] ROSE compiler infrastucture. [Online]. Available: [18] Xilinx, Virtex-7 FPGA data sheets. [Online]. Available: [19] P. Feautrier, Some efficient solutions to the affine scheduling problem. I. One-dimensional time, International Journal of Parallel Programming, vol. 21, no. 5, pp , Oct [20] J. Ramanujam and P. Sadaappan, Compile-time techniques for data distribution in distributed memor machines, IEEE Transactions on Parallel and Distributed Sstems, vol. 2, no. 4, pp , SUPPLEMENTARY MATERIALS 9.1 Polhedral Model We support general stencil accesses where neither the data grid nor the stencil window needs to be rectangular. It is often the case when stencil computation is applied with the loop transformation based on polhedral model [3, 4, 15] before memor access optimization. Our work is also built on top of polhedral model for generalit as below. Several properties of stencil computation are identified under the framework of polhedral model as well. DEFINITION 1 (ITERATION DOMAIN [19]). The iteration vector of a multi-level loop nest over a m-dimensional grid is a vector of iteration variable, i = (i 0, i 1,..., i m 1) T, where i 0,...i m 1 are the iteration variables from outermost to innermost loop. Iteration domain D Z m is the set of iteration vectors ofn the loop nest, and is expressed b a set of linear inequalities D = i P i o b. EXAMPLE 1. Consider the loop nest execution in Fig. 1: «1 D = {(i 0, i 1) B 1 0 C 0 1 A B 767 C i 1 A, (i0, i1) Z2 } This definition of iteration domain can describe an polhedral shape on a multi-dimensional data grid, e.g., rectangular, triangular, diamond. DEFINITION 2 (LEXICOGRAPHIC ORDER [19]). Lexicographic order relation l of two iteration vectors i and j is defined as: i l j (i 0 > j 0) (i 0 = j 0 i 1 > j 1) (i 0 = j 0 i 1 = j 1 i 2 > j 2)... (i 0 = j 0... i m 2 = j m 2 i m 1 > j m 1) DEFINITION 3 (ACCESS FUNCTION [19]). For a k-dimensional arra reference, its access function H Z m Z k is the mapping from iteration vector i to the access index h: n H = i h h = H k m i + f o EXAMPLE 2. The access function of the arra reference A[i][j+ 1] in Fig. 1 is ««1 0 0 h = i DEFINITION 4 (STENCIL COMPUTATION). Under the framework of polhedral model, stencil computation is defined as: the access function of each arra reference satisfies 8 >< k = m H k m = I m m >: f b, where I is the identif matrix.

8 It means that for an reference A x of a data arra A, hx = i + f x. (3) DEFINITION 5 (DATA DOMAIN). For an arra reference A x, data domain of D Ax is the set of data elements accessed b A x, and is expressed b a set of linear inequalities n D Ax = h P h fx o b EXAMPLE 3. The data domain of A[i][j + 1] is {(i, j) 2 i 767, 1 j 1022 }. Followed b Eq.(3), we have the following propert for stencil computation. PROPERTY 1 (LEXICOGRAPHIC ACCESS PATTERN). Data elements in D Ax are accessed b the arra reference A x in the lexicographic order. DEFINITION 6 (INPUT DATA DOMAIN). For an data arra A with n references A 0, A 1,...,A n 1, the input data domain of A is the union set of data elements that need to be fetched from external memor, defined as D A = D A0 D A1... D An 1 EXAMPLE 4. The input data domain of arra A is {(i, j) 0 i 767, 0 j 1023 } {(0, 0), (0, 1023), (767, 0), (767, 1023)}. In this paper, we use A[0..767][ ] to represent this input data domain. DEFINITION 7 (REUSE DISTANCE VECTOR [19]). Suppose A x and A are two read references of the same data arra. If the data element accessed b A x in loop iteration i is accessed again b A in iteration j, the reuse distance of vector from A x to A is r Ax A = j i. B Definition 7, we have H x( i) = H ( j). B Definition 4, we have H( i) = i + f. Combining these two, we have the following propert for stencil computation. PROPERTY 2. Reuse distance vector r Ax A = f x f is a constant vector. EXAMPLE 5. The reuse distance vector from A[i 1][j] to A[i + 1][j] in Fig. 1 is (2, 0) T. DEFINITION 8 (REUSE DISTANCE). Reuse distance rax A from arra reference A x to A is the number of data elements between h accessed b A x and h + r accessed b A, defined as rax A h n = g h + r l g l h, g DA o The reuse distance between A[i 1][j] and A[i + 1][j] in Fig. 1 is a constant value DEFINITION 9 (MAXIMUM REUSE DISTANCE). The maximum reuse distance from arra reference A x to A is defined as r Ax A = max rax A h h DAx EXAMPLE 6. The maximum reuse distance between A[i 1][j] and A[i + 1][j] in Fig. 1 is Suppose three data arra references A x, A and A z satisf r Ax A l 0, r A A z l 0 Followed b Definition 8, we have r Ax A z h = rax A h + ra A z h+ rax A Followed b Definition 5 and Propert 2, we have h DAx h + + rax A D A Combining these two with Definition 9, we have the following propert for stencil computation. PROPERTY 3 (LINEARITY OF MAXIMUM REUSE DISTANCE). r Ax A z = r Ax A + r A A z 9.2 Proof of Deadlock h x s x filter_x e 3 h x l h FIFO empt dt data not consumed b A x h f x x (a) l h f e 1 s h g hx l g l h, g DA h x e 2 s x filter_x h f FIFO full data not consumed b A h f l x x s filter_ e h (b) Figure 12: The dependenc graph of data filters filter_x and filter_ where x <, with polhedral expressions of dependenc relations appended. We extend the dependenc graph in Fig. 8 to Fig. 12 b appending the polhedral expressions of the dependenc relations. Suppose at a certain moment, filter_x is iterating over h x D A as shown in Fig. 12a, and filter_ is iterating over h D A, as shown in Fig. 12b. The goal is to prove that the two conditions listed in Section are the sufficient conditions to eliminate dependenc ccles in Fig. 12. First we will explain the expressions of the four dependenc relations. A data filter will stall if there is data emptiness of a precedent reuse FIFO or data fullness of a successive reuse FIFO. When the condition of e 1 is true, there will be no data stored in the reuse FIFO(s) between filter_x and filter_, and filter_ will stall. It will wait for filter_x to pull more data from the data path splitter s_x so that s_x can push more data to the downstream modules. When the condition of e 2 is true, the number of data elements stored in the reuse FIFO(s) between filter_x and filter_ (i.e., the number of data elements between h x and h ) exceeds the sum of the sizes of reuse FIFOs between filter_x and filter_, and filter_x will stall. It will wait for filter_ to pull more data from s_ so that s_ can pull more data from the upstream. A data filter will also stalled if the data element sent to the computation kernel is not consumed. This situation happens when the

9 kernel is waiting for data elements from other data filters. When the condition of e 3 is true, filter_x will stall. It has sent the data element needed b the computation kernel in iteration i x = h x f x, but the computation kernel is waiting for filter_ to send the data element needed in an iteration i = h f, which is earlier than i x. Similarl, when the condition of e 4 is true, filter_ will wait for filter_ to send data. When the first condition x < f x l f as listed in Section is true, we have both and e 1 is true h x l h l h f + f x e 3 is false e 3 is true h x f x l h f l h f x e 1 is false Therefore e 1 and e 3 in Fig. 12a are mutuall exclusive. When the condition in Eq. (2) as listed in Section is true, due to the linearit propert of maximum reuse distances in stencil computation (Propert 3 in Appendix 9.1), the condition applies an pair of arra references. Then we can appl the similar scheme in the first condition to the second condition, and prove that e 2 and e 4 in Fig. 12b are mutuall exclusive. Then all the ccles in the dependenc graph of data filters filter_x and filter_ will not be simultaneousl satisfied, and our distributed sstem is deadlock-free. 9.3 Sstem Integration A[i-1][j] A[i][j-1] A[i][j] A[i][j+1] A[i+1][j] Original Accelerator B[i][j] Fig. 13c, accelerator 1 writes out A[i][j] in the order of i and j. Accelerator 2 reads A[i][j], A[i][j 1], A[i][j +1], A[i 1][j] and A[i + 1][j] in the order of j and i. The load operations of accelerator 2 are ver different from the store operation of accelerator 1. To implement the inter-block communication, accelerator 1 has to write the output data block to an on-chip block memor first. After then accelerator 2 reads data from the block memor. With our microarchitecture, however, the input data stream of accelerator 2 is transformed to A[i][j], which has the same form as the output of accelerator 1. It allows us to appl loop transformation so that the data orders at the output of accelerator 1 and the input of accelerator 2 are the same, e.g., enabling the technolog in [15]. Then data can be directl forwarded from accelerator 1 to accelerator 2 without a large on-chip block memor. In summar, our microarchitecture facilitates sstem-level snthesis in terms of optimization on module-level communication and offchip communication. 9.4 Enabling Bandwidth/Memor Tradeoff There is tradeoff between offchip bandwidth and on-chip memor usage. When we have larger offchip bandwidth, we can use smaller on-chip memor. The extreme case is that when we have sufficientl large offchip bandwidth, we do not need an on-chip memor for data reuse. The different pairs of offchip bandwidth occupation and on-chip memor usage form the points on a design curve. When we move along the curve, the arra references to go through reuse buffers will change. The problem is that conventional memor partitioning does not guarantee an optimal solution for an arbitrar access pattern. Optimalit of memor partitioning ma get lost when we perform bandwidth/memor tradeoff. One of the benefits of our microarchitecture is that it will automaticall reduce the buffer size to the minimum when the offchip bandwidth is increased but keep the optimal design structure. As shown in (a) Original accelerator with multiple data references. Offchip DRAM Offchip Access em Bus Sste Data Burst Prefetcher A[i][j] Our Accelerator (b) Our accelerator with a single data reference can be easil connected to a offchip prefetch module with bus burst accesses. Z[i][j] loop reordering Accelerator1 A[i][j] Accelerator2 B[i][j] (c) Our accelerator with a single data reference can be easil 6 co-optimized with another accelerator b loop transformation for data forwarding. Figure 13: Integration of an accelerator with our microarchitecture in a sstem. 5 B[i][j] Note that our methodolog transforms the original accelerator with multiple data references, e.g., Fig. 13a, to a new accelerator with a single reference, e.g., Fig. 13b. An accelerator with our microarchitecture can be easil connected to an offchip prefetch module with bus burst accesses. The prefetch module can directl forward the data stream from the bus pipeline to the accelerator and onl needs a small buffer to hide the bus latenc. In addition, our microarchitecture can simplif the co-optimization of accelerators for inter-block communication. In the example of Offchip Access (a) (b) Offchip Access 32 1 Figure 14: Break the chain at the largest reuse buffer when the number of offchip accesses is increased. (a) The original microarchitecture with one offchip access per ccle. (b) The new microarchitecture with two offchip accesses per ccle but smaller total buffer size. Fig. 14, when we allow one more offchip access for each clock ccle, we will pick up the largest reuse buffer in our microarchitecture in Fig. 14(a), and replace it with the input data stream from offchip in Fig. 14(b). B doing so, we achieve a graceful degradation of on-chip memor usage with the increase of offchip bandwidth, as shown in Fig. 15 We use the 19-point stencil in SEGMENTA- TION in Fig. 6(c) as an example, and sweep the number of offchip accesses per ccle from 1 to 18. The three phases in Fig. 15 show that we first give up inter-plane data reuse which need large buffers, and then give up inter-row data reuse which need medium buffers, and finall give up intra-row data reuse which need small buffers. 9.5 Difference from Memor Partitioning for Distributed Computing This work mainl targets memor partitioning for behavior snthesis. In parallel, memor partitioning for distributed computing has been studied for decades [20]. However there are fundamental differences between these two scenarios. The first difference is

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

Chain Pattern Scheduling for nested loops

Chain Pattern Scheduling for nested loops Chain Pattern Scheduling for nested loops Florina Ciorba, Theodore Andronikos and George Papakonstantinou Computing Sstems Laborator, Computer Science Division, Department of Electrical and Computer Engineering,

More information

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Yuxin Wang ayerwang@pku.edu.cn Peng Li peng.li@pku.edu.cn Jason Cong,3,, cong@cs.ucla.edu Center for Energy-Efficient Computing

More information

Optimization of Interconnects Between Accelerators and Shared Memories in Dark Silicon

Optimization of Interconnects Between Accelerators and Shared Memories in Dark Silicon Optimization of Interconnects Between Accelerators and Shared Memories in Dark Silicon Jason Cong and Bingjun Xiao Computer Science Department University of California, Los Angeles Los Angeles, California,

More information

Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis

Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis Peng Li, Yuxin Wang, Peng Zhang, 2 Guojie Luo, Tao Wang,,3 Jason Cong,2,3 Center for Energy-Efficient Computing and Applications,

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

Automating Beyond High-Level Synthesis

Automating Beyond High-Level Synthesis Automating Beyond High-Level Synthesis Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 High-Level Synthesis: A

More information

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System ZHEMING JIN and JASON D. BAKOS, University of South Carolina, Columbia Massively parallel memory systems are designed

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Accelerator-Rich CMPs: From Concept to Real Hardware

Accelerator-Rich CMPs: From Concept to Real Hardware Accelerator-Rich CMPs: From Concept to Real Hardware Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou Computer Science Department University of California,

More information

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Eliminating False Loops Caused by Sharing in Control Path

Eliminating False Loops Caused by Sharing in Control Path Eliminating False Loops Caused by Sharing in Control Path ALAN SU and YU-CHIN HSU University of California Riverside and TA-YUNG LIU and MIKE TIEN-CHIEN LEE Avant! Corporation In high-level synthesis,

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

shared memory architecture

shared memory architecture Interconnect Synthesis of Heterogeneous Accelerators in a Shared Memory Architecture Yu-Ting Chen and Jason Cong Computer Science Department, University of California, Los Angeles, CA, USA E-mail: {ytchen,

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

The Memory Hierarchy. Computer Organization 2/12/2015. CSC252 - Spring Memory. Conventional DRAM Organization. Reading DRAM Supercell (2,1)

The Memory Hierarchy. Computer Organization 2/12/2015. CSC252 - Spring Memory. Conventional DRAM Organization. Reading DRAM Supercell (2,1) Computer Organization 115 The Hierarch Kai Shen Random access memor (RM) RM is traditionall packaged as a chip. Basic storage unit is normall a cell (one bit per cell). Multiple RM chips form a memor.

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

1 Introduction. 2 Parallel Approaches for Medical Image Registration using SIMD Processor Arrays. 2.1 SIMD Processor Array Architecture

1 Introduction. 2 Parallel Approaches for Medical Image Registration using SIMD Processor Arrays. 2.1 SIMD Processor Array Architecture Accelerating Medical Image Registration Using a SIMD Arra I. K. Jeong 1, M. S. Kang 1, C. H. Kim 2 and J. M. Kim 1,* 1 School of Electrical Engineering, Universit of Ulsan, Ulsan, South Korea 2 School

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Minimizing Computation in Convolutional Neural Networks

Minimizing Computation in Convolutional Neural Networks Minimizing Computation in Convolutional Neural Networks Jason Cong and Bingjun Xiao Computer Science Department, University of California, Los Angeles, CA 90095, USA {cong,xiao}@cs.ucla.edu Abstract. Convolutional

More information

Design and Implementation of High Performance DDR3 SDRAM controller

Design and Implementation of High Performance DDR3 SDRAM controller Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore

More information

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

More information

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining Architecture-Level Synthesis for Automatic Interconnect Pipelining Jason Cong, Yiping Fan, Zhiru Zhang Computer Science Department University of California, Los Angeles, CA 90095 {cong, fanyp, zhiruz}@cs.ucla.edu

More information

Sparse Matrix-Vector Multiplication FPGA Implementation

Sparse Matrix-Vector Multiplication FPGA Implementation UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Integrating Customer Requirements into Data Product Design

Integrating Customer Requirements into Data Product Design 114 Econom Informatics, 2003 Integrating Customer Requirements into Data Product Design Ec., Ph.D. candidate Otilia PÎRLOG Successful data products are ones that meet customer needs, are innovative and

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Fractal Interpolation Representation of A Class of Quadratic Functions

Fractal Interpolation Representation of A Class of Quadratic Functions ISSN 1749-3889 (print, 1749-3897 (online International Journal of Nonlinear Science Vol.24(2017 No.3, pp.175-179 Fractal Interpolation Representation of A Class of Quadratic Functions Chengbin Shen, Zhigang

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

E V ER-growing global competition forces. Accuracy Analysis and Improvement for Direct Laser Sintering

E V ER-growing global competition forces. Accuracy Analysis and Improvement for Direct Laser Sintering Accurac Analsis and Improvement for Direct Laser Sintering Y. Tang 1, H. T. Loh 12, J. Y. H. Fuh 2, Y. S. Wong 2, L. Lu 2, Y. Ning 2, X. Wang 2 1 Singapore-MIT Alliance, National Universit of Singapore

More information

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty

More information

2.8 Distance and Midpoint Formulas; Circles

2.8 Distance and Midpoint Formulas; Circles Section.8 Distance and Midpoint Formulas; Circles 9 Eercises 89 90 are based on the following cartoon. B.C. b permission of Johnn Hart and Creators Sndicate, Inc. 89. Assuming that there is no such thing

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline.

Register Allocation 3/16/11. What a Smart Allocator Needs to Do. Global Register Allocation. Global Register Allocation. Outline. What a Smart Allocator Needs to Do Register Allocation Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Determine ranges for each variable can benefit from using

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

High-throughput Online Hash Table on FPGA*

High-throughput Online Hash Table on FPGA* High-throughput Online Hash Table on FPGA* Da Tong, Shijie Zhou, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 989 Email: datong@usc.edu,

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Fault Grading FPGA Interconnect Test Configurations

Fault Grading FPGA Interconnect Test Configurations * Fault Grading FPGA Interconnect Test Configurations Mehdi Baradaran Tahoori Subhasish Mitra* Shahin Toutounchi Edward J. McCluskey Center for Reliable Computing Stanford University http://crc.stanford.edu

More information

ISSN Vol.05, Issue.12, December-2017, Pages:

ISSN Vol.05, Issue.12, December-2017, Pages: ISSN 2322-0929 Vol.05, Issue.12, December-2017, Pages:1174-1178 www.ijvdcs.org Design of High Speed DDR3 SDRAM Controller NETHAGANI KAMALAKAR 1, G. RAMESH 2 1 PG Scholar, Khammam Institute of Technology

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Performance Evolution of DDR3 SDRAM Controller for Communication Networks

Performance Evolution of DDR3 SDRAM Controller for Communication Networks Performance Evolution of DDR3 SDRAM Controller for Communication Networks U.Venkata Rao 1, G.Siva Suresh Kumar 2, G.Phani Kumar 3 1,2,3 Department of ECE, Sai Ganapathi Engineering College, Visakhaapatnam,

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks

Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.9, September 2017 139 Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks MINA MAHDAVI

More information

TMS320C6678 Memory Access Performance

TMS320C6678 Memory Access Performance Application Report Lit. Number April 2011 TMS320C6678 Memory Access Performance Brighton Feng Communication Infrastructure ABSTRACT The TMS320C6678 has eight C66x cores, runs at 1GHz, each of them has

More information

An Energy Efficient Location Service for Mobile Ad Hoc etworks

An Energy Efficient Location Service for Mobile Ad Hoc etworks An Energ Efficient Location Service for Mobile Ad Hoc etworks Zijian Wang, Euphan Bulut and Boleslaw K. Szmanski, Department of Computer Science, Rensselaer Poltechnic Institute, Tro, NY 280 USA {wangz,

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

08 - Address Generator Unit (AGU)

08 - Address Generator Unit (AGU) October 2, 2014 Todays lecture Memory subsystem Address Generator Unit (AGU) Schedule change A new lecture has been entered into the schedule (to compensate for the lost lecture last week) Memory subsystem

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

A Circle Detection Method Based on Optimal Parameter Statistics in Embedded Vision

A Circle Detection Method Based on Optimal Parameter Statistics in Embedded Vision A Circle Detection Method Based on Optimal Parameter Statistics in Embedded Vision Xiaofeng Lu,, Xiangwei Li, Sumin Shen, Kang He, and Songu Yu Shanghai Ke Laborator of Digital Media Processing and Transmissions

More information

A novel implementation of tile-based address mapping

A novel implementation of tile-based address mapping A novel implementation of tile-based address mapping Sambuddhi Hettiaratchi and Peter Y.K. Cheung Department of Electrical and Electronic Engineering Imperial College of Science, Technolog and Medicine,

More information

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY SPAA AWARE ERROR TOLERANT 32 BIT ARITHMETIC AND LOGICAL UNIT FOR GRAPHICS PROCESSOR UNIT Kaushal Kumar Sahu*, Nitin Jain Department

More information

Improving Polyhedral Code Generation for High-Level Synthesis

Improving Polyhedral Code Generation for High-Level Synthesis Improving Polyhedral Code Generation for High-Level Synthesis Wei Zuo 1,5 Peng Li 2,3 Deming Chen 5 Louis-Noël Pouchet 4,3 Shunan Zhong 1 Jason Cong 4,3 1 Beijing Institute of Technology 2 Peking University

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory Embedded Systems 8. Hardware Components Lothar Thiele Computer Engineering and Networks Laboratory Do you Remember? 8 2 8 3 High Level Physical View 8 4 High Level Physical View 8 5 Implementation Alternatives

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

Major Half Served First (MHSF) Disk Scheduling Algorithm

Major Half Served First (MHSF) Disk Scheduling Algorithm Vol. II, Issue I, Januar 13 (ISSN: 2278-77) Major Half Served First (MHSF) Disk Scheduling Algorithm Manish Kumar Mishra Department of Information Technolog College of Computing and Informatics Haramaa

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Chapter 8 Memory Basics

Chapter 8 Memory Basics Logic and Computer Design Fundamentals Chapter 8 Memory Basics Charles Kime & Thomas Kaminski 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Overview Memory definitions Random Access

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Three DIMENSIONAL-CHIPS

Three DIMENSIONAL-CHIPS IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna

More information

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University Optical Flow on FPGA Ian Thompson (ijt5), Joseph Featherston (jgf82), Judy Stephen

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

NONCONGRUENT EQUIDISSECTIONS OF THE PLANE

NONCONGRUENT EQUIDISSECTIONS OF THE PLANE NONCONGRUENT EQUIDISSECTIONS OF THE PLANE D. FRETTLÖH Abstract. Nandakumar asked whether there is a tiling of the plane b pairwise non-congruent triangles of equal area and equal perimeter. Here a weaker

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

A new Adaptive PID Control Approach Based on Closed-Loop Response Recognition

A new Adaptive PID Control Approach Based on Closed-Loop Response Recognition roceedings of the 7th WSEAS nternational Conference on Automation & nformation, Cavtat, Croatia, June 13-15, 2006 (pp156-160) A Adaptive Control Approach Based on Closed-Loop Response Recognition MARTN

More information

APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION

APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION Dmitr Brliuk and Valer Starovoitov Institute of Engineering Cbernetics, Laborator of Image Processing and

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures 1 ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures Yu-Ting Chen, Jason Cong, Zhenman Fang, Bingjun Xiao, Peipei Zhou Center for Domain-Specific Computing, University

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Chapter 3 Part 1 Combinational Logic Design

Chapter 3 Part 1 Combinational Logic Design Universit of Wisconsin - Madison EE/omp Sci 352 igital Sstems undamentals Kewal K. Saluja and Yu Hen Hu Spring 22 hapter 3 Part ombinational Logic esign Originals b: harles R. Kime and Tom Kamisnski Modified

More information

Power Efficient Design of DisplayPort (7.0) Using Low-voltage differential signaling IO Standard Via UltraScale Field Programming Gate Arrays

Power Efficient Design of DisplayPort (7.0) Using Low-voltage differential signaling IO Standard Via UltraScale Field Programming Gate Arrays Efficient of DisplayPort (7.0) Using Low-voltage differential signaling IO Standard Via UltraScale Field Programming Gate Arrays Bhagwan Das 1,a, M.F.L Abdullah 1,b, D M Akbar Hussain 2,c, Bishwajeet Pandey

More information