An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

Size: px

Start display at page:

Download "An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers"

Lambert Sherman
6 years ago
Views:

1 An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. & Electrical Engineering Dept. Universit of California, Los Angeles * {cong, xiao}@cs.ucla.edu ABSTRACT High-level snthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highl pipelined register-transfer level (RTL) specifications. The highthroughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memor partitioning is explored to increase the memor bandwidth b scheduling multiple simultaneous memor accesses to different memor banks. Prior work on memor partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an earl-stage exploration of non-uniform memor partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case stud to show the potential benefits of non-uniform memor partitioning. Our novel method can alwas achieve the minimum memor size and the minimum number of memor banks, which cannot be guaranteed in an prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator. 1. INTRODUCTION Accelerator-centric architectures can bring x energ efficienc b offloading computation from general-purpose CPU cores to application-specific accelerators [1]. The engineering cost of designing massive heterogeneous accelerators is high, but can be much reduced b raising their abstraction level beond RTL to C b high-level snthesis (HLS) [2]. Though HLS tools are good at scheduling computation to the timing slots of clock ccles, the do not optimize data accesses. This weakness motivates recent work on data reuse [3, 4] and memor partitioning [5 9] in HLS. External memor bandwidth is a dominant bottleneck for sstem performance and power consumption. Data reuse is an efficient technique of using on-chip memories to reduce external memor accesses. When an application contains a data arra with multiple references, we can allocate a reuse buffer and keep each arra element in the buffer from its first access until its last access. Then each arra element needs to be fetched from the external memor onl once, and the off-chip traffic is reduced to the minimum. Loop Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To cop otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 14, June , San Francisco, CA, USA. Copright 2014 ACM /14/06...$ transformation can be applied to improve data localit and reduce the size of the data reuse buffer [3, 4]. When the innermost loop of an application is full pipelined, an accelerator needs to perform multiple load operations from the same reuse buffer ever clock ccle. To avoid contention on memor ports, memor partitioning of the reuse buffer is required. Since the transistor count of memor control logics is proportional to the number of memor banks after partitioning, the optimization goal of memor partitioning is to minimize the number of memor banks. The constraint is that the multiple arra elements to be loaded ever clock ccle are alwas stored in different memor banks. The work in [5, 6] provides solid frameworks of memor partitioning. Further optimizations, including memor access rescheduling [7] and multi-dimension arras [8], are also proposed. But none of them can guarantee the optimal solution for a given case in terms of the number of banks. The reason is that their optimization space is limited within uniform memor partitioning, i.e., all the memor banks have to be of the same size. It is an unnecessar constraint which was inherited from commodit HLS tools, e.g., [10]. Other work partitions different fields of a single data structure into multiple memor banks for data parallelism based on profiling results [9]. It is orthogonal to the problems on multiple arra references in [5 8] and this paper. In this paper, we go beond the limitation of uniform memor partitioning, and propose a novel method based on non-uniform memor partitioning. We found that after removal of this limitation, we can achieve fewer memor banks than the optimal solutions in prior work [5 8]. As an earl-stage exploration of nonuniform memor partitioning, in this paper we focus on stencil computation, a popular communication-intensive application domain. We develop a microarchitecture with novel structures of memor sstems which achieve the theoretical minimum number of memor banks for an stencil access patterns. Experimental results show that we can reduce % of variable resources including BRAMs, logic slices, and DSPs compared to prior work [8], along with slightl improved timing. 2. PRELIMINARY 2.1 Stencil Computation Stencil computation comprises an important class of kernels in man application domains, such as image processing, constituent kernels in multigrid methods, and partial differential equation solvers. These kernels often contribute to most workloads in these applications. Even in the recent publications on memor partitioning [7,8] which were developed for general applications, all the benchmarks used in the publications are in fact stencil computation. The data elements accessed in stencil computation are on a large multi-dimensional grid which usuall exceeds on-chip memor capacit. The computation is iterated as a stencil window slides over

the grid. In each iteration, the computation kernel accesses all the data points in the stencil window to calculate an output.

1 shows an ex- microarchitecture contains multiple memor sstems, and each is optimized to a data arra with stencil accesses.

Each memor sstem receives a single data stream which iterates on a multi-dimensional grid without an repeated external access.

A memor sstem sends data to the computation kernel via each data port associated with each arra reference in the original user code.

With our microarchi- Figure 1: Example C code of a tpical stencil computation (5-point stencil window in the kernel DENOISE in medical imaging [11]).

2 the grid. In each iteration, the computation kernel accesses all the data points in the stencil window to calculate an output. Both the grid shape and the stencil window can be arbitrar as specific b the given stencil applications. A precise definition of stencil computation can be found in Appendix 9.1. Fig. 1 shows an ex- microarchitecture contains multiple memor sstems, and each is optimized to a data arra with stencil accesses. Since there are no reuse opportunities among different data arras, the memor sstems for different arras are independent of each other. Each memor sstem receives a single data stream which iterates on a multi-dimensional grid without an repeated external access. Each memor sstem contains data reuse buffers, memor controllers and interconnects that have been customized for the access patterns of the data arra used in the target stencil computation. A memor sstem sends data to the computation kernel via each data port associated with each arra reference in the original user code. If all the data are consumed b the computation kernel, the memor sstem will immediatel prepare the data used in the next ccle to feed into the full pipelined computation kernel. With our microarchi- Figure 1: Example C code of a tpical stencil computation (5-point stencil window in the kernel DENOISE in medical imaging [11]). ample stencil computation in the kernel DENOISE in medical imaging [11]. Its grid shape is a rectangle, and 0 j 1024 A[i-1][j] reuse distance = 1023 i A[i][j-1] A[i][j] A[i][j+1] A[i][j+1] 768 Figure 2: Iteration domain of the example stencil computation in Fig. 1. its stencil window contains 5 points, as shown in Fig. 2. Five data elements need to be accessed in each iteration. In addition, man data elements will be repeatedl accessed among these iterations. For example, A[2][2] will be accessed five times, when (i, j) {(1, 2), (2, 1), (2, 2), (2, 3), (3, 2)}. This leads to high onchip memor port contention and off-chip traffic, especiall when the stencil window is large (e.g., after loop fusion of stencil applications for computation reduction as proposed in [12]). Therefore, during the hardware development of a stencil application, a large portion of engineering effort is spent on data reuse and memor partitioning optimization. 2.2 Microarchitecture for Stencil Accesses A[0..767][ ] A[i-1][j] Microarchitecture for Stencil Accesses of Data Memor Subsstem for Data Arra A Data Reuse Buffers Interconnects A[i][j-1] A[i][j] A[i][j+1] Computation Kernel B[i][j] Memor Controllers A[i+1][j] Accelerator Figure 3: The overall architecture of our microarchitecture for stencil computation. It decouples stencil accesses from computation. 1 In this work we develop a microarchitecture to decouple the stencil access patterns from the computation, as shown in Fig. 3. The Figure 4: Example code of the computation kernel where all the memor accesses are offloaded to our microarchitecture. The keword volatile in the code informs HLS tools of potential data change after access. The pragma pipeline is used in Xilinx Vivado HLS [10] to pipeline the innermost loop. tecture, the C code of the computation kernel can be simplified to Fig. 4, where users no longer need to optimize memor accesses, which is offloaded to our microarchitecture. Users can assume that each data access to the points should get the same data from our accelerator microarchitecture 3 as the original load operation. The C code of the computation kernel can be compiled b HLS tools for a full pipelined hardware implementation with the most efficient resource usage. 2.3 Targeting Optimal Design The design target of our microarchitecture has three aspects: 1. Full pipelining. As the stencil window slides ever clock ccle, the microarchitecture is able to send out all the data in the stencil window to the computation kernel and get all the data read for the consecutive accesses in the next ccle. 2. Minimum data reuse buffer size. When a data element is full reused for each access, it stas in on-chip memories from its first access until its last access. Meanwhile, other elements in the arra are loaded from external memor ever clock ccle. Therefore, the theoretical minimum size of the reuse buffer for a data arra is equal to the maximum lifetime of an element in the arra. In the example in Fig. 2, A[2][2] is accessed now b the arra reference A[i + 1][j] for the first time, and is accessed 2048 ccles later b the reference A[i 1][j] for the last time. Therefore, the minimum size of the data reuse buffer for arra A will be Minimum number of reuse buffer banks. Suppose the stencil window of an input arra contains n points, i.e., there are n data references to the arra. It means that each clock ccle, n

3 s # of banks data elements need to be read out, either from reuse buffers or from the external memor. Suppose we use dual-port memories to implement reuse buffers. One port of a buffer bank is occupied b the replacement of an expired data element with a new element from the external memor ever ccle. There is onl one port left in each memor bank for us to read the n elements needed b the stencil window. Suppose one of the n element happens to be the new element from the external memor in a certain smart data reuse mechanism. There are still n 1 data elements to read, and we need at least n 1 memor banks. In the example of Fig. 2, n = 5 indicates that we need at least four memor banks. As the stencil window slides, all five data elements in the window should never be in the same bank. This is a tough constraint to satisf, and prior work [5 8] had to use more banks to eliminate bank conflicts in difficult cases. Fig. 5 shows that the number of row size of the data grid Figure 5: Even for a constant stencil window in Fig. 2, the number of banks varies as the row size of the data grid changes [5]. banks ranges from five to eight in [5] as the row size of the data grid changes, even if the stencil window keeps the constant shape in Fig. 2. The technologies proposed in [7,8] can keep the number of banks consistentl to be five in the case of the stencil window shown in Fig. 2. However when the (a) (b) (c) Figure 6: Example stencil windows where more banks are needed than the # of arra references in [7, 8]. (a) 4-point stencil in BICUBIC [13]. (b) 4-point stencil in RICIAN [14]. (c) 19-point stencil in SEGMENTATION_3D [8]. stencil window changes to some other shape in other applications, e.g., the ones shown in Fig. 6, the methods in [7, 8] will need 5, 5, 20 banks respectivel, which are larger than the minimum values. In this work we will present a generalized microarchitecture that can simultaneousl achieve these optimal design targets for an application that falls in the categor of stencil computation. 3. METHODOLOGY 3.1 Overview The internal structure of a memor sstem in our microarchitecture is illustrated b the example in Fig. 7, which is generated for the stencil computation in Fig. 1. Suppose the stencil window contains n points (n = 5 in the example of Fig. 1). Our memor sstem will contain n 1 data reuse FIFOs as well as n data path splitters and n data filters connected together in the wa shown in Fig. 7. The data reuse FIFOs provide the same storage as conventional data reuse buffers, and the data path splitters and filters work as memor controllers and data interconnects. In contrast to conventional cclic memor partitioning [5, 7, 8] which uses uniform 1 buffer sizes, the sizes of reuse buffers in our design are nonuniform. The are customized to the shape of the stencil window in the target application. 3.2 Denotations To better explain the working principle of our memor sstem, we provide a table of denotations that will be used in the following sections in Table 1. The precise definitions of the denotations are given in Appendix 9.1. Denotations Meanings Example b Fig. 2 i loop iteration vector i = (1, 2) A data arra the arra A with the five references A x arra reference the five references such as A[i + 1][j] h data access index h = (2, 2) being accessed b Ax f x data access offset f x= h x- i constant for A x D Ax data domain {(i, j) 2 i 767, 1 j 1022 } D A input data domain {(i, j) 0 i 767, 0 j 1023 } r Ax A reuse distance vector (1,-1) from A[i + 1][j] to A[i][j + 1] r Ax Az maximum reuse distance 2048 from A[i + 1][j] to A[i 1][j] l lexicographic order (1, 0) l (0, 1) l (0, 0) l ( 1, 0) Table 1: Denotations used in the working principle of our memor sstem. 3.3 Working Principle Since our microarchitecture is based on nonuniform memor partitioning in contrast to uniform partitioning in prior work, the memor controlling mechanism cannot follow the modulo scheduling of data accesses among memor banks in prior work [5,7,8]. Instead, our microarchitecture is a novel design based on data streaming. Each module in our design is autonomous and can work in a full pipeline as long as its upstream module produces a data element and its downstream module consumes an element ever clock ccle. We will discuss the working principle of our design in terms of function correctness, deadlock-free, minimum reuse buffer size, and minimum number of buffer banks Function Correctness Each data path splitter in Fig. 7 reads an existing data element from its precedent FIFO and sends the data element to the successive FIFO as well as to the data filter below. Each data filter customizes the data stream that flows into the computation kernel to fit the access patterns of the associated arra reference. A data filter for an arra reference A x receives the data stream which iterates in D A and sends out the data which iterates in D Ax. For example, filter 0 in Fig. 7 sends the data element in set D A0 = {(i, j) 2 i 767, 1 j 1022} out of the input data domain D A in Table 1 and discards the first two rows in the 2D grid of Fig. 2. This guarantees the correctness of the data set sent to each data port of the computation kernel. Due to the propert of stencil computation, the data elements accessed b an arra reference are in the same lexicographic order as the loop iteration (see Propert 1 in Appendix 9.1). Our microarchitecture based on data streaming enforces this order, as long as the input data stream is also in the lexicographic order (i.e., data iterated from innermost loop to outermost loop). The lexicographic order of input data is usuall realized without hardware overhead since it fits well with burst accesses to external memor or inter-accelerator communication patterns (see discussion in Appendix 9.3). B providing the correct data set and correct data order to the arra references, our design can guarantee that as the stencil window slides, the data elements received b the computation kernel are alwas consistent with the arra references Deadlock-Free Since the modules in our design are autonomous instead of being snchronized b a centralize controller, a critical design challenge is to ensure that these modules are free of deadlock. Deadlock is caused b an potential ccle in the dependenc graph of all the

4 Memor Sstem for Stencil Accesses to Arra A in DENOISE storage space for data reuse S = data path splitter A[0..767][ ] s0 FIFO_0 s1 s2 s3 FIFO_3 s4 FIFO_1 FIFO_2 ilter_0 fi ilter_1 fi ilter_2 fi ilter_3 fi ilter_4 fi A[i+1][j] A[i][j+1] A[i][j] A[i][j-1] A[i-1][j] towards computation kernel Figure 7: The example circuit structure of our memor sstem generated for arra A in the stencil computation of Fig. 1. modules. Due to the chaining structure as shown in Fig. 7, an potential ccle must go through a pair of data filters. For an pair of s x filter_x FIFO empt e 1 e 2 s filter_ s x filter_x FIFO full e 3 e 4 data not data not consumed b A x consumed b A (a) (b) Figure 8: The dependenc graph of data filters filter_x and filter_ where x < is shown in the form of two dependenc ccles. data filters filter_x and filter_ where x <, we can draw all the dependenc relations as shown in Fig. 8. There are four possible dependenc relations, each marked with the condition when the dependenc happens. Obviousl, the conditions of edge e 1 and e 2 are mutuall exclusive, and the same applies to e 3 and e 4. Therefore, onl pair (e 1, e 3) and pair (e 2, e 4) could form two potential dependenc ccles respectivel, as shown in Fig. 8. We can prove that e 1 will be mutuall exclusive to e 3, and e 2 will be mutuall exclusive to e 4, as long as we satisf the following conditions: 1. For data access offsets f x of f of two arra references, s filter_ x < f x l f (1) 2. For the reuse FIFO between adjacent A x and A, buffer_size f x f = rax A (2) Both the conditions have ver clear phsical meanings. The first condition means that a data element accessed b an earl arra reference can be reused b a late arra reference, but not the reverse. The second condition means that if an arra reference reuses data elements accessed b another arra reference, the reuse buffer size should be greater than or equal to the maximum reuse distance between these two references. The detailed proof of these two deadlock-free conditions can be found in Appendix 9.2. To ensure the first condition, we can sort the f of the data arra references in the descending lexicographic order when we map them to data filters from 0 to n 1, e.g., (1, 0) (0, 1) (0, 0) (0, 1) ( 1, 0) in Fig. 7. To ensure the second condition, we calculate the maximum reuse distances of all the pairs of adjacent arra references and allocate reuse FIFO sizes accordingl, as shown in Table 2 for the example in Fig. 1. FIFO ID precedent/successive references FIFO size phsical impl. FIFO 0 A[i + 1][j] A[i][j + 1] 1023 BRAM FIFO 1 A[i][j + 1] A[i][j] 1 register FIFO 2 A[i][j] A[i][j 1] 1 register FIFO 3 A[i][j 1] A[i 1][j] 1023 BRAM Table 2: Reuse FIFOs with nonuniform sizes calculated from maximum reuse distances of adjacent arra references and mapped to different phsical implementations (block memor, distributed memor, or register) if targeted an FPGA platform Design Optimalit Minimum Reuse Buffer Size. Due to the linearit of maximum reuse distances, the sum of the sizes of all the reuse FIFOs is equal to the maximum reuse distance between arra reference A 0 and A n 1. Due to the sorting of arra references b f in the descending lexicographic order, A 0 is the earliest reference and A n 1 is the latest reference. Therefore, the total reuse buffer size is equal to the maximum reuse distance between the earliest and latest references, which is the theoretical minimum. As shown in Table 2 for the example in Fig. 1, the total size is 2048, the same as the minimum value discussed in Section 2.3. If the maximum reuse distance is so large that the buffer sizes exceed the on-chip memor capacit, our microarchitecture allows tradeoff of offchip bandwidth occupation for smaller memor usage (see discussion in Appendix 9.4). Minimum Number of Buffer Banks. The structure of our design guarantees that for n arra references, there are n 1 buffer banks (reuse FIFOs) as described in Section 3.1. It is the theoretical minimum value as discussed in Section 2.3. Since our design achieves both the minimum reuse buffer size and the minimum number of buffer banks, our microarchitecture is optimal. 3.4 Insights Gained From RTL Simulation Our microarchitecture is quite different from conventional designs with centralized controllers [5 8]. The major tasks of a conventional controller includes two aspects: 1. Filling up reuse buffers. Before the computation starts, the controller will first fill reuse buffers with data elements needed b the computation kernel. 2. Evict expired data from reuse buffers. The challenging part in this function is when the reuse distance between arra references changes as the execution advances. This often happens on a skewed data grid. In this case, the number of data elements stored in reuse buffers changes as time goes, and the smmetr between read and write is broken. There is no specific module that takes charge of these ke tasks in our microarchitecture. Instead, b observing the execution of our design in RTL simulation, we found that these ke tasks are done automaticall b the coordination of our distributed modules Automatic Filling of Reuse Buffers

5 The filling process of reuse buffers in our microarchitecture is shown in Table 3. The data filter 4 associated with A[i 1][j] is first stalled at ccle 1. This is when the filter 4 tries to send data A[0][1] to the computation kernel but all the other data filters are discarding this data. As a result, the computation kernel will be waiting for data from the other data filters and will not consume data from the filter 4. This stalling will lead to filling up of data in the FIFO 3 between A[i 1][j] and A[i][j 1]. The other four filters will keep the data stream advancing since all of them discard the first row in the data domain ccles later, the filter 3 will tr to to send data A[1][0] to the computation kernel but will be stalled as well. Then the FIFO 2 will start being filled up, and the data path splitter s3 will stop sending data to FIFO 3. The following process is similar until FIFO 1 and FIFO 0 are filled up consecutivel, as shown in Table 3. Then at ccle 2049, the filter 0 receives A[2][1] and will send the data to the computation kernel. All the data at the five inputs of the computation kernel become valid, and the kernel consumes all the five data to produce the first output. Then all the stalled filters can continue to send new data to the computation kernel ever clock ccle until the end of the iteration domain Automatic Adjustment of Reuse Data Amount An application can have a skewed data grid as shown in Fig. 9. This kind of application is usuall needed when a rectangular grid i A[i-1][j-1] A[i+1][j-1] A[i][j] A[i-1][j+1] A[i+1][j+1] j reuse distance = and a small buffer respectivel. Table 2 shows the heterogeneous mapping of reuse buffers to different phsical implementation Data Filter in Polhedral Domain Note that though the data domains are rectangles in the example of Fig. 1, the could be an polhedrons on a multi-dimensional grid. Comparisons on loop bounds is not a universal solution to data filtering. To select D Ax out of D A, the data filter in our microarchitecture is implemented with a data switch controlled b two counters, let s sa input counter and output counter, as shown in Fig. 10. The input counter iterates over the input data stream D A of i input counter input stream inc D A q data switch = 1 i D A x output counter q inc 0 discard output stream Figure 10: Structure of the data filter which can be applied to general polhedral data domains. arra A, e.g., A[0..767][ ]. The output counter iterates over the data domain D Ax of arra reference A x, e.g., A[2..767][ ]. The input counter proceeds when the filter receives an input data. The output counter proceeds when its value is equal to the input counter. It is also the condition that the data switch forwards the input data to the output. In contrast, when the output counter is not equal to the input counter, the data switch discards the input data. 4. DESIGN AUTOMATION FLOW We develop a design automation flow to generate the complete accelerator for a given stencil application, as shown in Fig. 11. It Figure 9: Non-rectangular iteration domain of an example application with dnamicall changeable reuse distance. is iterated along the 45 direction after certain loop transform [15]. The skewed grid leads to the challenge that the number of data stored in each reuse buffer will change as the iteration goes on, and often requires a complex memor controlling scheme in a centralized design [5, 7, 8]. However, this difficult can be automaticall handled b our distributed modules. Following the design schematic in Section 3.1, we will order the arra references and map them to five filters 0 4. Among them, filter 2 is for reference A[i][j] and filter 3 is for reference A[i 1][j]. Note that h 2 of filter 2 advances h 3 of filter 3 b one row when h 2 and h 3 are snchronized b the computation kernel, as shown in Fig. 9. At each turn around to the next row in the data domain, the filter 3 will fetch one more data from the FIFO splitter than the filter 2 since the filter 3 is iterating over a longer row. Then the number of data stored in FIFO 2 between filter 2 and filter 3 will be reduced b one. This achieves the dnamic adaption of the number of data stored in a reuse buffer to the change of reuse distance in the case of a skewed data grid. 3.5 Miscellaneous Design Issues Heterogeneous Mapping of Reuse Buffers Reuse buffers in our design have different sizes, as shown in Table 2, and ma prefer different phsical implementations. For example, if the target platform is FPGA, the phsical implementation candidates include block memor, distributed memor and slice registers. The are efficient for a large buffer, a medium buffer 6 Figure 11: Design automation flow of accelerator generation for stencil computation. starts from the original source codes of a user application, e.g., the code in Fig. 1. In the left branch, we first appl polhedral analsis to extract the polhedrons of data arras with stencil accesses. We calculate the data domain of each arra reference and the reuse distance of each pair of adjacent arra references. This information is used to instantiate the data filters and reuse FIFOs in our microarchitecture. Then the flow generates a microarchitecture instance, e.g., the design in Fig. 7, with the memor sstems optimized for

6 clock data in filter status (forwarding/discarding/stalled) FIFO status (# of data) ccle stream filter 0 filter 1 filter 2 filter 3 filter 4 FIFO 0 FIFO 1 FIFO 2 FIFO 3 1 A[0][1] d d d d f s A[1][0] d d d f s s A[1][1] d d f s s s A[1][2] d f s s s s A[2][1] d f s f s f s f s f A[2][2]... f f f f f Table 3: The execution flow of our microarchitecture in the example of Fig. 1. The latenc among the data streams at different modules is ignored here for demonstration purpose onl. stencil accesses in user applications. In the right branch, we first appl source-to-source code transformation to extract the kernel code with pure computation, e.g., Fig. 4. Then high-level snthesis is applied on the transformed code for a full pipelined hardware implementation of the computation kernel in RTL. Finall, we integrate the microarchitecture with the computation kernel for a complete accelerator with full pipelining and data reuse, e.g., the design in Fig EXPERIMENTS 5.1 Experiment Setup Our polhedral analsis in Fig. 11 is implemented b the LLVM- Poll framework [16]. The kernel transformation is performed b the open source compiler infrastructure ROSE [17], and the highlevel snthesis is performed b Xilinx Vivado HLS [10]. Although our methodolog is applicable to both ASIC and FPGA designs, we choose FPGA as the target device in this work due to the availabilit of downstream behavioral snthesis and implementation tools. The Xilinx Virtex7 FPGA XC7VX485T and ISE 14.2 tool suite [18] are used in our experiments. The target clock frequenc is set at 200MHz. The benchmarks used in prior memor partitioning work [7, 8] make up a rich set of real-life stencil computation kernels. Among them, we select the more challenging benchmarks with non-rectangular stencil windows for our experiments. DENOISE (2D/3D), RICIAN (2D), and SEGMENTATION (3D) are from medical imaging [11]. BICUBIC (2D) is from bicubic interpolation process [13]. SOBEL (2D) is from Sobel edge detection algorithm [14]. We choose the more recent memor partitioning work [8] as our experiment baseline. 5.2 Results Original Target # of Banks Total Size II II [8] Ours [8] Ours DENOISE RICIAN SOBEL BICUBIC DENOISE_3D SEGMENTATION Table 4: High-level partitioning results. The comparison results of memor partitioning are shown in Table 4. We list the pipeline II of the original user codes which suffer memor port contentions before memor partitioning, which is equal to the number of memor load operations on the data arra. We also list the II that the computation kernel targets to achieve via memor partitioning. The number and total size of reuse buffer banks are reported for both [8] and our method. The buffer size is in the unit of data element. As shown in Tabel 4, our method saves the partitioning bank number of all of the six benchmarks. In addition, our method does not need the padding technique in [8] which increases the grid size at certain dimensions to relax the partitioning complexit. Our methods saves the buffer size, especiall when the padding introduces more overhead in a high-dimensional data grid, e.g., Fig. 6(c). BRAM Slice DSP CP (ns) [8] DENOISE ours comp.(%) [8] RICIAN ours comp.(%) [8] SOBEL ours comp.(%) [8] BICUBIC ours comp.(%) [8] DENOISE_3D ours comp.(%) [8] SEGMENTATION_3D ours comp.(%) Average(%) Table 5: Snthesis experimental results. The post-snthesis results are listed in Table 5. Phsical resource usage (block RAMs, logic slices, and DSPs) and timing information are extracted from Xilinx ISE report. As shown in Table 5, we use 66% fewer block RAMs than [8]. This stems from 1) the minimum number of buffer banks achieved, and 2) the heterogeneous mapping of buffer banks to variable resources in addition to block RAMs as demonstrated in Table 2. We also use 25% fewer logic slices than [8], even though we implement some of the small reuse buffers in registers. That is because we avoid the modulo scheduling in conventional uniform memor partitioning which generates a hardware transformer to map the original data address to the bank ID and local address via a complex calculation involving multiplication and division. Instead our memor sstem onl needs counters iterating over the data domains in the lexicographic order. This advantage is also reflected b the complete elimination of DPSs in our method. The clock period does not show too much difference between [8] and our method since the back-end flow will stop optimization as long as it meets the 200MHz target. However, our method generall has larger slacks from the target 5.0ns as shown in Table 5. It is mainl due to the distributed structure in our method. We tried to use the Xilinx XPower Analzer for power estimation, but found that the FPGA power is dominated b the static power, and is almost invariant with custom circuits. If power gating is available in FPGA, the FPGA power will be proportional to resource usage, which is covered b Table CONCLUSIONS AND FUTURE WORK In this work, we propose non-uniform partitioning which opens a new design space compared to conventional cclic partitioning framework. As a starting point, we use stencil computation as the initial design target and show a novel memor sstem that works with non-uniform sizes of reuse buffer. The memor sstem in the extended design space can achieve the optimal solution with the minimum reuse buffer size and the minimum number of buffer banks. We develop a design automation flow that generates a microarchitecture with our memor sstems and integrates it with the computation kernel for a complete design. Experimental results

7 show that our method outperforms the recent memor partitioning work in terms of utilization of variable FPGA resources. As this is the first work on non-uniform memor partitioning, our primar goal is to show the potential of this new approach. Though stencil computation is a popular application domain and attracts the attention of most memor partitioning work, it is still important to extend non-uniform memor partitioning to general cases. Our data streaming method ma not be the onl solution for utilizing the non-uniform reuse buffers. A modified modulo scheduling extended from conventional uniform memor partitioning is also a good candidate. We believe that there are man opportunities in future research. 7. ACKNOWLEDGEMENTS The authors would like to thank Yuxin Wang for sharing the snthesizable codes transformed b [8]. The authors would also like to thank Xilinx for equipment donation and financial contributions. This work is partiall supported b the Center for Domain- Specific Computing (CDSC), and the NSF grant CCF It is also supported b C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored b MARCO and DARPA. 8. REFERENCES [1] Y.-t. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou, Accelerator-Rich CMPs: From Concept to Real Hardware, in International Conference on Computer Design, [2] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, High-Level Snthesis for FPGAs: From Prototping to Deploment, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sstems, vol. 30, no. 4, pp , Apr [3] J. Cong, P. Zhang, and Y. Zou, Optimizing memor hierarch allocation with loop transformations for high-level snthesis, in Design Automation Conference, 2012, pp [4] L.-N. Pouchet, P. Zhang, P. Sadaappan, and J. Cong, Polhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international smposium on Field programmable gate arras - FPGA 13, p. 29, [5] J. Cong, W. Jiang, B. Liu, and Y. Zou, Automatic memor partitioning and scheduling for throughput and power optimization, in International Conference on Computer-Aided Design, 2009, p [6] Y. Wang, P. Zhang, C. Xu, and J. Cong, An integrated and automated memor optimization flow for FPGA behavioral snthesis, in Asia and South Pacific Design Automation Conference, Jan. 2012, pp [7] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong, Memor partitioning and scheduling co-optimization in behavioral snthesis, in International Conference on Computer-Aided Design, 2012, pp [8] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, Memor partitioning for multidimensional arras in high-level snthesis, in Design Automation Conference, 2013, p. 1. [9] Y. Ben-Asher and N. Rotem, Automatic memor partitioning, in International Conference on Hardware/Software Codesign and Sstem Snthesis, 2010, p [10] Xilinx, Vivado High-Level Snthesis. [Online]. Available: [11] J. Cong, V. Sarkar, G. Reinman, and A. Bui, Customizable Domain-Specific Computing, IEEE Design and Test of Computers, vol. 28, no. 2, pp. 6 15, Mar [12] A. A. Nacci, V. Rana, F. Bruschi, D. Sciuto, I. Beretta, and D. Atienza, A high-level snthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices, in Design Automation Conference, 2013, p. 1. [13] Bicubic interpolation. [Online]. Available: [14] S. Verdoolaege, H. Nikolov, and T. Stefanov, pn: A Tool for Improved Derivation of Process Networks, EURASIP Journal on Embedded Sstems, vol. 2007, pp. 1 13, [15] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, Improving high level snthesis optimization opportunit through polhedral transformations, in International Smposium on FPGAs, 2013, p. 9. [16] LLVM-poll. [Online]. Available: [17] ROSE compiler infrastucture. [Online]. Available: [18] Xilinx, Virtex-7 FPGA data sheets. [Online]. Available: [19] P. Feautrier, Some efficient solutions to the affine scheduling problem. I. One-dimensional time, International Journal of Parallel Programming, vol. 21, no. 5, pp , Oct [20] J. Ramanujam and P. Sadaappan, Compile-time techniques for data distribution in distributed memor machines, IEEE Transactions on Parallel and Distributed Sstems, vol. 2, no. 4, pp , SUPPLEMENTARY MATERIALS 9.1 Polhedral Model We support general stencil accesses where neither the data grid nor the stencil window needs to be rectangular. It is often the case when stencil computation is applied with the loop transformation based on polhedral model [3, 4, 15] before memor access optimization. Our work is also built on top of polhedral model for generalit as below. Several properties of stencil computation are identified under the framework of polhedral model as well. DEFINITION 1 (ITERATION DOMAIN [19]). The iteration vector of a multi-level loop nest over a m-dimensional grid is a vector of iteration variable, i = (i 0, i 1,..., i m 1) T, where i 0,...i m 1 are the iteration variables from outermost to innermost loop. Iteration domain D Z m is the set of iteration vectors ofn the loop nest, and is expressed b a set of linear inequalities D = i P i o b. EXAMPLE 1. Consider the loop nest execution in Fig. 1: «1 D = {(i 0, i 1) B 1 0 C 0 1 A B 767 C i 1 A, (i0, i1) Z2 } This definition of iteration domain can describe an polhedral shape on a multi-dimensional data grid, e.g., rectangular, triangular, diamond. DEFINITION 2 (LEXICOGRAPHIC ORDER [19]). Lexicographic order relation l of two iteration vectors i and j is defined as: i l j (i 0 > j 0) (i 0 = j 0 i 1 > j 1) (i 0 = j 0 i 1 = j 1 i 2 > j 2)... (i 0 = j 0... i m 2 = j m 2 i m 1 > j m 1) DEFINITION 3 (ACCESS FUNCTION [19]). For a k-dimensional arra reference, its access function H Z m Z k is the mapping from iteration vector i to the access index h: n H = i h h = H k m i + f o EXAMPLE 2. The access function of the arra reference A[i][j+ 1] in Fig. 1 is ««1 0 0 h = i DEFINITION 4 (STENCIL COMPUTATION). Under the framework of polhedral model, stencil computation is defined as: the access function of each arra reference satisfies 8 >< k = m H k m = I m m >: f b, where I is the identif matrix.

8 It means that for an reference A x of a data arra A, hx = i + f x. (3) DEFINITION 5 (DATA DOMAIN). For an arra reference A x, data domain of D Ax is the set of data elements accessed b A x, and is expressed b a set of linear inequalities n D Ax = h P h fx o b EXAMPLE 3. The data domain of A[i][j + 1] is {(i, j) 2 i 767, 1 j 1022 }. Followed b Eq.(3), we have the following propert for stencil computation. PROPERTY 1 (LEXICOGRAPHIC ACCESS PATTERN). Data elements in D Ax are accessed b the arra reference A x in the lexicographic order. DEFINITION 6 (INPUT DATA DOMAIN). For an data arra A with n references A 0, A 1,...,A n 1, the input data domain of A is the union set of data elements that need to be fetched from external memor, defined as D A = D A0 D A1... D An 1 EXAMPLE 4. The input data domain of arra A is {(i, j) 0 i 767, 0 j 1023 } {(0, 0), (0, 1023), (767, 0), (767, 1023)}. In this paper, we use A[0..767][ ] to represent this input data domain. DEFINITION 7 (REUSE DISTANCE VECTOR [19]). Suppose A x and A are two read references of the same data arra. If the data element accessed b A x in loop iteration i is accessed again b A in iteration j, the reuse distance of vector from A x to A is r Ax A = j i. B Definition 7, we have H x( i) = H ( j). B Definition 4, we have H( i) = i + f. Combining these two, we have the following propert for stencil computation. PROPERTY 2. Reuse distance vector r Ax A = f x f is a constant vector. EXAMPLE 5. The reuse distance vector from A[i 1][j] to A[i + 1][j] in Fig. 1 is (2, 0) T. DEFINITION 8 (REUSE DISTANCE). Reuse distance rax A from arra reference A x to A is the number of data elements between h accessed b A x and h + r accessed b A, defined as rax A h n = g h + r l g l h, g DA o The reuse distance between A[i 1][j] and A[i + 1][j] in Fig. 1 is a constant value DEFINITION 9 (MAXIMUM REUSE DISTANCE). The maximum reuse distance from arra reference A x to A is defined as r Ax A = max rax A h h DAx EXAMPLE 6. The maximum reuse distance between A[i 1][j] and A[i + 1][j] in Fig. 1 is Suppose three data arra references A x, A and A z satisf r Ax A l 0, r A A z l 0 Followed b Definition 8, we have r Ax A z h = rax A h + ra A z h+ rax A Followed b Definition 5 and Propert 2, we have h DAx h + + rax A D A Combining these two with Definition 9, we have the following propert for stencil computation. PROPERTY 3 (LINEARITY OF MAXIMUM REUSE DISTANCE). r Ax A z = r Ax A + r A A z 9.2 Proof of Deadlock h x s x filter_x e 3 h x l h FIFO empt dt data not consumed b A x h f x x (a) l h f e 1 s h g hx l g l h, g DA h x e 2 s x filter_x h f FIFO full data not consumed b A h f l x x s filter_ e h (b) Figure 12: The dependenc graph of data filters filter_x and filter_ where x <, with polhedral expressions of dependenc relations appended. We extend the dependenc graph in Fig. 8 to Fig. 12 b appending the polhedral expressions of the dependenc relations. Suppose at a certain moment, filter_x is iterating over h x D A as shown in Fig. 12a, and filter_ is iterating over h D A, as shown in Fig. 12b. The goal is to prove that the two conditions listed in Section are the sufficient conditions to eliminate dependenc ccles in Fig. 12. First we will explain the expressions of the four dependenc relations. A data filter will stall if there is data emptiness of a precedent reuse FIFO or data fullness of a successive reuse FIFO. When the condition of e 1 is true, there will be no data stored in the reuse FIFO(s) between filter_x and filter_, and filter_ will stall. It will wait for filter_x to pull more data from the data path splitter s_x so that s_x can push more data to the downstream modules. When the condition of e 2 is true, the number of data elements stored in the reuse FIFO(s) between filter_x and filter_ (i.e., the number of data elements between h x and h ) exceeds the sum of the sizes of reuse FIFOs between filter_x and filter_, and filter_x will stall. It will wait for filter_ to pull more data from s_ so that s_ can pull more data from the upstream. A data filter will also stalled if the data element sent to the computation kernel is not consumed. This situation happens when the

9 kernel is waiting for data elements from other data filters. When the condition of e 3 is true, filter_x will stall. It has sent the data element needed b the computation kernel in iteration i x = h x f x, but the computation kernel is waiting for filter_ to send the data element needed in an iteration i = h f, which is earlier than i x. Similarl, when the condition of e 4 is true, filter_ will wait for filter_ to send data. When the first condition x < f x l f as listed in Section is true, we have both and e 1 is true h x l h l h f + f x e 3 is false e 3 is true h x f x l h f l h f x e 1 is false Therefore e 1 and e 3 in Fig. 12a are mutuall exclusive. When the condition in Eq. (2) as listed in Section is true, due to the linearit propert of maximum reuse distances in stencil computation (Propert 3 in Appendix 9.1), the condition applies an pair of arra references. Then we can appl the similar scheme in the first condition to the second condition, and prove that e 2 and e 4 in Fig. 12b are mutuall exclusive. Then all the ccles in the dependenc graph of data filters filter_x and filter_ will not be simultaneousl satisfied, and our distributed sstem is deadlock-free. 9.3 Sstem Integration A[i-1][j] A[i][j-1] A[i][j] A[i][j+1] A[i+1][j] Original Accelerator B[i][j] Fig. 13c, accelerator 1 writes out A[i][j] in the order of i and j. Accelerator 2 reads A[i][j], A[i][j 1], A[i][j +1], A[i 1][j] and A[i + 1][j] in the order of j and i. The load operations of accelerator 2 are ver different from the store operation of accelerator 1. To implement the inter-block communication, accelerator 1 has to write the output data block to an on-chip block memor first. After then accelerator 2 reads data from the block memor. With our microarchitecture, however, the input data stream of accelerator 2 is transformed to A[i][j], which has the same form as the output of accelerator 1. It allows us to appl loop transformation so that the data orders at the output of accelerator 1 and the input of accelerator 2 are the same, e.g., enabling the technolog in [15]. Then data can be directl forwarded from accelerator 1 to accelerator 2 without a large on-chip block memor. In summar, our microarchitecture facilitates sstem-level snthesis in terms of optimization on module-level communication and offchip communication. 9.4 Enabling Bandwidth/Memor Tradeoff There is tradeoff between offchip bandwidth and on-chip memor usage. When we have larger offchip bandwidth, we can use smaller on-chip memor. The extreme case is that when we have sufficientl large offchip bandwidth, we do not need an on-chip memor for data reuse. The different pairs of offchip bandwidth occupation and on-chip memor usage form the points on a design curve. When we move along the curve, the arra references to go through reuse buffers will change. The problem is that conventional memor partitioning does not guarantee an optimal solution for an arbitrar access pattern. Optimalit of memor partitioning ma get lost when we perform bandwidth/memor tradeoff. One of the benefits of our microarchitecture is that it will automaticall reduce the buffer size to the minimum when the offchip bandwidth is increased but keep the optimal design structure. As shown in (a) Original accelerator with multiple data references. Offchip DRAM Offchip Access em Bus Sste Data Burst Prefetcher A[i][j] Our Accelerator (b) Our accelerator with a single data reference can be easil connected to a offchip prefetch module with bus burst accesses. Z[i][j] loop reordering Accelerator1 A[i][j] Accelerator2 B[i][j] (c) Our accelerator with a single data reference can be easil 6 co-optimized with another accelerator b loop transformation for data forwarding. Figure 13: Integration of an accelerator with our microarchitecture in a sstem. 5 B[i][j] Note that our methodolog transforms the original accelerator with multiple data references, e.g., Fig. 13a, to a new accelerator with a single reference, e.g., Fig. 13b. An accelerator with our microarchitecture can be easil connected to an offchip prefetch module with bus burst accesses. The prefetch module can directl forward the data stream from the bus pipeline to the accelerator and onl needs a small buffer to hide the bus latenc. In addition, our microarchitecture can simplif the co-optimization of accelerators for inter-block communication. In the example of Offchip Access (a) (b) Offchip Access 32 1 Figure 14: Break the chain at the largest reuse buffer when the number of offchip accesses is increased. (a) The original microarchitecture with one offchip access per ccle. (b) The new microarchitecture with two offchip accesses per ccle but smaller total buffer size. Fig. 14, when we allow one more offchip access for each clock ccle, we will pick up the largest reuse buffer in our microarchitecture in Fig. 14(a), and replace it with the input data stream from offchip in Fig. 14(b). B doing so, we achieve a graceful degradation of on-chip memor usage with the increase of offchip bandwidth, as shown in Fig. 15 We use the 19-point stencil in SEGMENTA- TION in Fig. 6(c) as an example, and sweep the number of offchip accesses per ccle from 1 to 18. The three phases in Fig. 15 show that we first give up inter-plane data reuse which need large buffers, and then give up inter-row data reuse which need medium buffers, and finall give up intra-row data reuse which need small buffers. 9.5 Difference from Memor Partitioning for Distributed Computing This work mainl targets memor partitioning for behavior snthesis. In parallel, memor partitioning for distributed computing has been studied for decades [20]. However there are fundamental differences between these two scenarios. The first difference is

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &