DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS.

Size: px

Start display at page:

Download "DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS."

Amberly Cooper
5 years ago
Views:

1 Chapter 1 : Zvi Kedem â Research Output â NYU Scholars Excerpt from Synthesizing Linear-Array Algorithms From Nested for Loop Algorithms We will study linear systolic arrays in this paper, as linear arrays are attractive for their bounded i/o requirements and a simple global clock whose rate is independent of the size of the array. Kedem - ACM Trans. Programming Languages and Systems, " On shared memory parallel computers SMPCs it is natural to focus on decomposing the computation mainly by distributing the iterations of the nested Do-Loops. In contrast, on distributed memory parallel computers DMPCs the decomposition of computation and the distribution of data must both be h In contrast, on distributed memory parallel computers DMPCs the decomposition of computation and the distribution of data must both be handledin order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to optimize the overall execution time. The method relies on a number of novel techniques, also presented in this paper. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, we are able to design algorithms for determin Show Context Citation Context As in general, the number of iterations of a nested Do-Loop is much larger than the number of PEs, a set of iterations called a tile is assigned to each PE, with the property that they can be execut In this paper, we generalize the parameter-based approach of Li and Wah [1] to map n-dimensional uniform recurrences to any k-dimensional processor arrays, where In this paper, we generalize the parameter-based approach of Li and Wah [1] to map n-dimensional uniform recurrences to any k-dimensional processor arrays, where k! In our approach, operations of the target array are captured by a set of parameters, and constraints are derived to avoid computational conflicts and data collisions. We show that the optimal array for any objective function expressed in terms of these parameters can be found by a systematic enumeration over a polynomial search space. In contrast, previous attempts [2, 3] do not guarantee the optimality of the resulting designs. We illustrate our method with optimal single-pass linear arrays for re-indexed WarshallFloyd path-finding algorithm. Finally, we show the application of GPM to practical situations characterized by restriction on resources, such as processors or completion ti Finally, we show the app Data parallelism, in which the same operation is performed on many elements of an n-dimensional array, is one of the most powerful methods of extracting parallelism in scientific computation. One form of data parallelism involves defining a sequence of parallel wavefronts of a computation. Different wavefronts result in different performance, so the question arises how to determine the wavefronts that result in the minimum computation time. Wavefront determination should define also allocation of wavefront elements to processors. In this paper we present efficient algorithms for determining the optimum wavefront and for partitioning it into sections assigned to individual processors. Presented algorithms are applicable to computations that are defined over two or higher dimensional arrays and are executed on distributed memory machines interco Moldovan et al [7] considered a linear transformation, T, of the algorithm to map it efficiently on a VLSI processor array. The linear transformation consists of two parts: Parallel and Distributed Systems, " Processor arrays are frequently used to deliver high performance in many applications with computationally intensive operations. This paper presents the General Parameter Method GP- M, a systematic parameter-based approach for synthesizing such algorithm-specific architectures. GPM can synthesize processor arrays of any lower dimension from a uniform-recurrence description of the algorithm. The design objective is a general non-linear and non-monotonic user-specified function, and depends on attributes such as computation time of the recurrence on the processor array, completion time, load time, and drain time. In addition, bounds on some or all of these attributes can be specified. GPM performs an efficient search of polynomial complexity to find the optimal design satisfying the user-specified design constraints. As an illustration, we show how GPM can be used to find optimal linear processor arrays for computing transitive Page 1

2 closures. We consider design objectives that minimize co Page 2

3 Chapter 2 : CiteSeerX â Citation Query Synthesizing linear array algorithms from nested for loop algorith The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed. The mappings are done by using linear functions to transform the original sequential algorithms into a form suitable for parallel execution on linear arrays. During the course of the last decade, a mathematical model for the parallelization of FOR-loops has become increasingly popular. In this model, a perfect nest of r FOR-loops is represented by a convex polytope in Z r. The boundaries of each loop specify the extent of the polytope in a dis The boundaries of each loop specify the extent of the polytope in a distinct dimension. These transformations have a very intuitive interpretation and can be easily quantified and automated due to their mathematical foundation in linear programming and linear algebra. With the recent availability of massively parallel computers, the idea of loop parallelization is gaining significance, since it promises execution speed-ups of orders of magnitude. The polytope model for loop parallelization has its origin in systolic design, but it applies in more general settings and methods based on it will become a part of futur Show Context Citation Context A full-dimensional solution offers a maximum speed-up. Cronquist, Paul Franklin, " Configurable computing has captured the imagination of many architects who want the performance of application-specific hardware combined with the reprogrammability of general-purpose computers. Unfortunately, configurable computing has had rather limited success largely because the FPGAs on which t Unfortunately, configurable computing has had rather limited success largely because the FPGAs on which they are built are more suited to implementing random logic than computing tasks. This paper presents RaPiD, a new coarse-grained FPGA architecture that is optimized for highly repetitive, computation-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiD. These pipelines make much more efficient use of silicon than traditional FPGAs and also yield much higher performance for a wide range of applications. RaPiD is not limited to implementing systolic arrays, however. For example, a pipeline can be constructed which comprises different computations at different stages and at different times. RaPiDs can provide significantly higher performance than general purp RaPiDs can provide significantly higher performance than general purpose processors on a wide range of applications from the areas of video and signal processing, scientific computing, and communications. A RaPiD architecture is optimized for highly repetitive, computationally-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiDs that deliver very high performance for a wide range of applications. RaPiDs achieve this using a coarse-grained reconfigurable architecture that mixes the appropriate amount of static configuration with dynamic control. We describe the fundamental features of a RaPiD architecture, including the linear array of functional units, a programmable segmented bus structure, and a programmable control architecture. In addition, we outline the floorplan of the architecture and provide timing data for the most critical paths. We conclude with performance numbers for several applications on an instance of a RaPiD architecture. The linear structure of the RaPiD datapath was shown in The goal of the RaPiD Reconfigurable Pipelined Datapath architecture is to provide high performance configurable computing for a range of computationally-intensive applications that demand special-purpose hardware. This is accomplished by mapping the computation into a deep pipeline using a config This is accomplished by mapping the computation into a deep pipeline using a configurable array of coarse-grained computational units. A key feature of RaPiD is the combination of static and dynamic control. While the underlying computational pipelines are configured statically, a limited amount of dynamic control is provided which greatly increases the range and capability of applications that can be mapped to RaPiD. This paper illustrates this mapping and configuration for several important applications including a FIR filter, 2-D DCT, motion estimation, and parametric curve generation; it also shows how static and dynamic control are used to perform complex computations. This paper presents the New Systolic Language as a general solution to the problem systolic programming. The language provides a simple programming interface for systolic algorithms Page 3

4 suitable for di erent hardware platforms and software simulators. The New Systolic Language hides the details and pote The New Systolic Language hides the details and potential systolic data streams. Data ows and systolic cell programs for the co-processor are integrated with host functions, enabling a single le to specify a complete systolic program. Configurable computers have attracted considerable attention recently because they promise to deliver the performance of application-specific hardware along with the flexibility of general-purpose computers. Unfortunately, configurable computing has had rather limited success to date. We believe that the FPGAs currently used to construct configurable computers are too general to achieve good cost-performance on computationally-intensive applications that demand special-purpose hardware. This paper describes a new architecture called RaPiD Reconfigurable Pipelined Datapaths, which is optimized for highly repetitive, computationally-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiD that deliver very high performance for a wide range of applications. RaPiD achieves this using a coarse-grained reconfigurable architecture that mixes the appropriate amount of static configuration with dynamic control. Kedem - ACM Trans. Programming Languages and Systems, " On shared memory parallel computers SMPCs it is natural to focus on decomposing the computation mainly by distributing the iterations of the nested Do-Loops. In contrast, on distributed memory parallel computers DMPCs the decomposition of computation and the distribution of data must both be h In contrast, on distributed memory parallel computers DMPCs the decomposition of computation and the distribution of data must both be handledin order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to optimize the overall execution time. The method relies on a number of novel techniques, also presented in this paper. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, we are able to design algorithms for determin As in general, the number of iterations of a nested Do-Loop is much larger than the number of PEs, a set of iterations called a tile is assigned to each PE, with the property that they can be execut In this paper, we generalize the parameter-based approach of Li and Wah [1] to map n-dimensional uniform recurrences to any k-dimensional processor arrays, where In this paper, we generalize the parameter-based approach of Li and Wah [1] to map n-dimensional uniform recurrences to any k-dimensional processor arrays, where k! In our approach, operations of the target array are captured by a set of parameters, and constraints are derived to avoid computational conflicts and data collisions. We show that the optimal array for any objective function expressed in terms of these parameters can be found by a systematic enumeration over a polynomial search space. In contrast, previous attempts [2, 3] do not guarantee the optimality of the resulting designs. We illustrate our method with optimal single-pass linear arrays for re-indexed WarshallFloyd path-finding algorithm. Finally, we show the application of GPM to practical situations characterized by restriction on resources, such as processors or completion ti Lee and Kedem [2, 6] gave a set of necessary and sufficient conditions for the feasibility of a design and conditions to avoid data-link collisions when two data tokens contend for the same link si Data parallelism, in which the same operation is performed on many elements of an n-dimensional array, is one of the most powerful methods of extracting parallelism in scientific computation. One form of data parallelism involves defining a sequence of parallel wavefronts of a computation. Different wavefronts result in different performance, so the question arises how to determine the wavefronts that result in the minimum computation time. Wavefront determination should define also allocation of wavefront elements to processors. In this paper we present efficient algorithms for determining the optimum wavefront and for partitioning it into sections assigned to individual processors. Presented algorithms are applicable to computations that are defined over two or higher dimensional arrays and are executed on distributed memory machines interco The set of index points along with the set of dependence vectors d Parallel and Distributed Systems, " Processor arrays are frequently used to deliver high performance in many applications with computationally intensive operations. This paper presents the General Parameter Method GP- M, a systematic parameter-based approach for synthesizing such Page 4

5 algorithm-specific architectures. GPM can synthesize processor arrays of any lower dimension from a uniform-recurrence description of the algorithm. The design objective is a general non-linear and non-monotonic user-specified function, and depends on attributes such as computation time of the recurrence on the processor array, completion time, load time, and drain time. In addition, bounds on some or all of these attributes can be specified. GPM performs an efficient search of polynomial complexity to find the optimal design satisfying the user-specified design constraints. As an illustration, we show how GPM can be used to find optimal linear processor arrays for computing transitive closures. We consider design objectives that minimize co Important steps towards a formal solution were first made by Lee and Kedem [8]. They presented the concept of data-link collisions two data tokens contending for the same link simultaneously and c Page 5

6 Chapter 3 : ADVIS - Mathematical software - swmath Synthesizing linear-array algorithms from nested for loop algorithms [P Lee, Z Kedem] on theinnatdunvilla.com *FREE* shipping on qualifying offers. This is a reproduction of a book published before This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The methodologies adopted for mapping these algorithms onto parallel hardware often use heuristic search that requires a lot of computational effort to obtain near optimal solutions. The above is used to develop our proposed modified heuristic search to arrive at optimal design and the complexity comparisons are given. The MATLAB results of the new search and the design space trade-off analysis using the high-level synthesis tool are presented for two typical computationally intensive nested loop algorithmsâ the 6D FSBM and the 4D edge detection alternatively known as the 2D filtering algorithm. The management of complexity and tapping the full potential of these RSoC architectures present many challenges [ 1 ]. A large number of heuristic algorithms have been used in developing many novel scheduling and mapping algorithms [ 2 â 5 ]. However, these approaches face difficulties in dealing with large execution times. Systolic array design style can effectively exploit parallelism inherent in the nested loop algorithm and, therefore, reduce processing time [ 2, 3 ]. Often heuristic procedures are used to search for the mapping transformations that are used to map the nested loop algorithms onto array architectures [ 4, 5 ]. Since the effort that goes into heuristic search is large and complex, the challenge lies in improving the process to reduce the computational effort in getting the mapping results. Our main contribution in this paper is that we propose an augmented approach to the heuristic search. A new method of identifying the subspace to which the PE array is to be assigned is proposed based on the directional index of the computational expression that is explained in Section 2. The new vectors and terminologies used in the procedure are defined and elaborated in Section 2. The complexity analysis is performed by comparing the search space used in our method with the search space in [ 4 ]. The high-level synthesis tool GAUT is used to plot the design space trade-off curves to obtain the design space exploration curves. The paper is organized as follows: The 4D nested loop formulation of the 2D filtering problem is explained in Section 4. The methodology and the implementation of the above approach for the 2D filtering algorithm and the mapping results are presented in Section 4. Section 7 discusses the complexity considerations and comparisons. Section 8 gives the conclusion and future work. Page 6

7 Chapter 4 : Mapping of recursive algorithms onto multi-rate arrays - CORE During the course of the last decade, a mathematical model for the parallelization of FOR-loops has become increasingly popular. In this model, a (perfect) nest of r FOR-loops is represented by a convex polytope in Z r. Research reported in more than 50 scientific publications. Has served on program committees of scientific conferences and on editorial boards of scientific journals. Guided more than 15 doctoral dissertations. Has a total of more than doctoral descendants. A complete set of PowerPoint presentations for an introductory Database Management Systems class is available at https: They may be used in lectures, in individual study, downloaded, and printed. However, they may not be modified or material extracted from them or incorporated elsewhere without prior written permission. Optimal surface reconstruction from planar contours. Communications of the ACM, Citations including those in more than distinct journals; majority not in Computer Science: Consistency in hierarchical data base systems. Journal of the ACM, With preliminary version as: Controlling concurrency using locking protocols. On visible surface generation by a-priori tree structures. From exclusive to shared locks. Journal of the ACM, Non-two-phase locking protocols with shared and exclusive locks. Synthesizing linear-array algorithms from nested for loop algorithms. Mapping nested loop algorithms into multi-dimensional systolic arrays. Efficient robust parallel computations. Combining tentative and definite executions for very fast dependable parallel computing. Efficient program transformations for resilient parallel computation via randomization. Parallel processing on networks of workstations: A fault-tolerant high performance approach. Parallel suffix-prefix-matching algorithm and applications. A novel software system for fault-tolerant parallel processing on distributed platforms. An infrastructure for network computing with Java applets. Practice and Experience, An infrastructure for distributed web applications. Metacomputing on the Web. Future Generation Computer Systems, An efficient algorithm for discovering the maximum frequent set. Data image management via emulation of non-volatile storage device. Chapter 5 : CiteSeerX â Citation Query Mapping nested loop algorithms into multidimensional systolic arr The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed. The mappings are done by us. Chapter 6 : Zvi M. Kedem: Brief CV We will study linear systolic arrays in this paper, as linear arrays are attractive for their bounded i/o requirements and a simple global clock whose rate is independent of the size of the array. We will consider the important class of algorithms structured as (depth) p nested for loops. Vlsi. Chapter 7 : Synthesizing Linear-Array Algorithms From Nested for Loop Algorithms The mapping of algorithms structured as depth- p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed. The mappings are done by using linear functions to transform the. Chapter 8 : Mapping Nested Loop Algorithms Into Multi-Dimensional Systolic Arrays Synthesizing linear-array algorithms from nested for loop algorithms Item Preview remove-circle Share or Embed This Item. Page 7

Rapid: A Configurable Architecture for Compute-Intensive Applications

Rapid: A Configurable Architecture for Compute-Intensive Applications Rapid: Configurable rchitecture for Compute-Intensive pplications Carl Ebeling Dept. of Computer Science and Engineering niversity of Washington lternatives for High-Performance Systems SIC se application-specific