THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS

Size: px

Start display at page:

Download "THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS"

Margery Cannon
5 years ago
Views:

1 THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Brucek Khailany June 2003

3 I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. William J. Dally (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Mark Horowitz I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Teresa Meng Approved for the University Committee on Graduate Studies. iii

4 Abstract Media applications such as image processing, signal processing, and graphics require tens to hundreds of billions of arithmetic operations per second of sustained performance for real-time application rates, yet also have tight power constraints in many systems. For this reason, these applications often use special-purpose (fixed-function) processors, such as graphics processors in desktop systems. These processors provide several orders of magnitude higher performance efficiency (performance per unit area and performance per unit power) than conventional programmable processors. In this dissertation, we present the VLSI implementation and evaluation of stream processors, which reduce this performance efficiency gap while retaining full programmability. Imagine is the first implementation of a stream processor. It contains bit arithmetic units supporting floating-point and integer data-types organized into eight SIMD arithmetic clusters. Imagine executes applications stream programs consisting of a sequence of computation kernels operating on streams of data records. The prototype Imagine processor is a 21-million transistor chip, implemented in a 0.15 micron CMOS process. At 232 MHz, a peak performance of 9.3 GFLOPS is achieved while dissipating 6.4 Watts with a die size measuring 16 mm on a side. Furthermore, we extend these experimental results from Imagine to stream processors designed in more area- and energy-efficient custom design methodologies and to future VLSI technologies where thousands of arithmetic units on a single chip will be feasible. Two techniques for increasing the number of arithmetic units in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to provide high performance efficiencies to tens of ALUs per cluster and to hundreds of arithmetic clusters, demonstrating the viability of stream processing for many years to come. iv

5 Acknowledgments During the course of my studies at Stanford University, I have been fortunate to work with a number of talented individuals. First and foremost, thanks goes to my research advisor, Professor William J. Dally. Through his vision and leadership, Bill has always been an inspiration to me and everyone else on the Imagine project. He also provided irreplacable guidance for me when I needed to eventually find a dissertation topic. Professor Dally provided me with the opportunity to take a leadership role on the VLSI implementation of the Imagine processor, an invaluable experience for which I will always be grateful. I would also like to thanks the other members of my reading committee, Professor Mark Horowitz and Professor Teresa Meng, for their valuable feedback regarding the work described in this dissertation and interactions over my years at Stanford. The Imagine project was the product of the hard work of many graduate students in the Concurrent VLSI Architecture group at Stanford. Most notably, I would like to thank Scott Rixner, Ujval Kapasi, John Owens, and Peter Mattson. Together, we formed a team that took the Imagine project from a research idea to a working silicon prototype. More recently, Jung-Ho Ahn, Abhishek Das, and Ben Serebrin have helped with laboratory measurements. Thanks also goes to all of the other team members who helped with the Imagine VLSI implementation, including Jinyung Namkoong, Brian Towles, Abelardo Lopez-Lagunas, Andrew Chang, Ghazi Ben Amor, and Mohamed Kilani. I would also like to thank all of the other members of the CVA group at Stanford, especially my officemates over the years: Ming-Ju Edward Lee, Li-Shiuan Peh, and Patrick Chiang. Many thanks also goes to Pamela Elliot and Shelley Russell, the CVA group administrators while I was a graduate student here. The research described in this dissertation would not have been possible without the v

6 generous funding provide by a number of sources. I would like to specifically thank the Intel Foundation for a one-year fellowship in to support this research. The remainder of my time as a graduate student, I was supported by the Imagine project, which was funded by the Defense Advanced Research Projects Agency under ARPA order E254 and monitored by the Army Intelligence Center under contract DABT63-96-C0037, by ARPA order L172 monitored by the Department of the Air Force under contract F , by Intel Corporation, by Texas Instruments, and by the Interconnect Focus Center Program for Gigascale Integration under DARPA Grant MDA Finally, I can not say enough about the support provided by my friends and family. My parents, Asad (the first Dr. Khailany) and Laura, have been my biggest supporters and for that I am forever grateful. Now that they will no longer be able to ask me when my thesis will be done we will have to find a new subject to discuss on the telephone. My sister and brother, Raygar and Sheilan, have always providing timely encouragement and advice. To all of my friends and family members who have helped me in one way or another over the years, I would like to say thanks. vi

7 Contents Abstract Acknowledgments iv v 1 Introduction Contributions Outline Background Media Applications Compute Intensity Parallelism Locality VLSI Technology Media Processing Special-purpose Processors Microprocessors Digital Signal Processors and Programmable Media Processors Vector Microprocessors Chip Multiprocessors Stream Processing Stream Programming Stream Architecture Stream Processing Related Work vii

8 2.4.4 VLSI Efficiency of Stream Processors Imagine: Microarchitecture and Circuits Instruction Set Architecture Stream-Level ISA Kernel-Level ISA Kernel Instruction Format Microarchitecture Microcontroller Arithmetic Clusters Kernel Execution Pipeline Stream Register File SRF Pipeline Streaming Memory System Network Interface Stream Controller Arithmetic Cluster Function Units ALU Unit MUL Unit DSQ Unit SP Unit COMM Unit JB/VAL Unit Summary Imagine: Design Methodology Schedule Design Methodology Background Imagine Design Methodology Imagine Implementation Results Imagine Clocking Methodology Imagine Verification Methodology viii

9 5 Imagine: Experimental Results Operating Frequency Power Dissipation EnergyEfficiency Sustained Application Performance Summary Stream Processor Scalability: VLSI Costs VLSI Cost Models Stream Processor Cost Models VLSI Cost Evaluation Intracluster Scaling Intercluster Scaling Combined Scaling Custom and Low-Power Stream Processors Stream Processor Scalability: Performance Related Scalability Work Technology Trends Memory Bandwidth Wire Delay Performance Evaluation Kernel Inner-Loop Performance Kernel Short Stream Effects Application Performance Bandwidth Hierarchy Scaling Improving Intercluster and Intracluster Scalability Scalability Summary Conclusions Future Work ix

10 Bibliography 141 x

11 List of Tables 2.1 Media Processor Efficiencies (Normalized to 0.13µ,1.2V) Kernel ISA - Part Kernel ISA - Part JB/VAL Operation for Conditional Output Streams Function Unit Area and Complexity Subchip statistics Imagine placement results Imaginetiming results Energy-Efficiency Comparisons Energy-Delay Comparisons Sustained Application Performance Building Block Areas, Energies, and Delays Scaling Coefficients Kernel Inner Loop Characteristics Scaling Cost Models Building block Areas, Energies, and Delays for ASIC, CUST, and LP ASIC, CUST, and LP performance efficiencies Technology Scaling Parameters Kernels and Applications use for Performance Evaluation Intercluster Scaling Performance Efficiency xi

12 List of Figures 2.1 A Stereo Depth Extractor Stereo depth extractor as a stream program Stream Processor Block Diagram Arithmetic Cluster Block Diagram Imagine Arithmetic Cluster VLIW Instruction Format Microcontroller Block Diagram Function Unit Details Local Register File Implementation Kernel Execution Pipeline Diagram Stream Register File Block Diagram SRF Pipeline Diagram Stream Controller Block Diagram ALU Unit Block Diagram Segmented Carry-Select Adder MUL Unit Block Diagram DSQ Unit Block Diagram Computing the COMM Source Index in the JB/VAL unit Standard ASIC Design Methodology Tiled Region Design Methodology Tiled Region Floorplanning Details Asynchronous FIFO Synchronizer xii

13 5.1 Die Photograph Measured Operating Frequency Measured Ring Delay Measured Core Power Dissipation C sw distribution during Active Operation Measured Energy Efficiency Scalable Grid Floorplan Intracluster Switch Floorplan Area of Intracluster Scaling Energy of Intracluster Scaling Area of Intercluster Scaling Energy of Intercluster Scaling Area of Combined Scaling Effect of Technology Scaling on Die Area and Power Dissipation Effect of Technology Scaling on Energy Efficiency Worst-case Switch Delay with Intracluster Scaling Worst-case Switch Delay with Intercluster Scaling Intracluster Scaling with no Loop Transformations Intracluster Scaling with Software Pipelining Intracluster Scaling with Software Pipelining and Loop Unrolling Inner-Loop Performance per Area with Intracluster Scaling Intercluster Kernel Speedup Kernel Short Stream Effects Application Performance Application Cycles with Intercluster Scaling (N=5) Bandwidth Hierarchy with Intercluster Scaling (N=5) Intercluster Switch Locality with 8x8 Cluster Grid Floorplan Limited-Connectivity Inercluster Switch for 8x8 Cluster Floorplan xiii

14 Chapter 1 Introduction Computing devices and applications have recently emerged to interface with, operate on, and process data from real-world samples classified as media. As media applications operating on these data-types have come to the forefront, the design of processors optimized to operate on these applications have emerged as an important research area. Traditional microprocessors have been optimized to execute applications from desktop computing workloads. Media applications are a workload with significantly different characteristics, meaning that the potential for large improvements in performance, cost, and power efficiency can be achieved by improving media processors. Media applications include workloads from the areas of signal processing, image processing, video encoding and decoding, and computer graphics. These workloads require a large and growing amount of arithmetic performance. For example, many current computer graphics and image processing applications in desktop systems require tens to hundreds of billions of arithmetic operations per second for real-time performance [Rixner, 2001]. As scene complexity, screen resolutions, and algorithmic complexity continues to grow, this demand for absolute performance will continue to increase. Similar examples of large and growing performance requirements can be drawn in the other application areas, such as the need for higher communication bandwidth rates in signal processing and higher video quality in video encoding and decoding algorithms. As a result, media processors must be designed to provide large amounts of absolute performance. While high performance is necessary to meet the computational requirements of media 1

15 CHAPTER 1. INTRODUCTION 2 applications, many media processors will need to be deployed in mobile systems and other systems where cost and power consumption is a key concern. For this reason, low power consumption and high energy efficiency, or high performance per unit power (low average energy dissipated per arithmetic operation), must be a key design goal for any media processor. Fixed-function processors have been able to provide both high performance and good energy-efficiency when compared to their programmable counterparts on media applications. For example, the Nvidia Geforce3 [Montrym and Moreton, 2002; Malachowsky, 2002], a recent graphics processor, provides 1.2 Teraops per second of peak performance at 12 Watts for an energy-efficiency of 10 picojoules per operation. In comparison, programmable digital signal processors and microprocessors are several orders of magnitude worse in absolute performance and in energy efficiency. However, programmability is a key requirement in many systems where algorithms are too complex or change too rapidly to be built into fixed-function hardware. Using programmable rather than fixed-function processors also enables fast time-to-market. Finally, the cost of building fixed-function chips is growing significantly in deep sub-micron technologies, meaning that programmable solutions also have an inherent cost advantage since a single programmable chip can be used in many different systems. For these reasons, a programmable media processor which can provide the performance and energy efficiency of fixed-function media processors is desirable. Stream processors have recently been proposed as a solution that can provide all three of the above: performance, energy efficiency, and programmability. In this dissertation, the design and evaluation of a prototype stream processor, called Imagine is presented. This 21-million transistor processor is implemented in a 5-level metal 0.15 micron CMOS technology with a die size measuring 16 millimeters on a side. At 232 MHz, a peak performance of 9.3 GFLOPS is achieved while dissipating 6.4 Watts. Furthermore, in future VLSI technologies, the scalability of stream processors to Teraops per second of peak performance is demonstrated.

16 CHAPTER 1. INTRODUCTION Contributions This dissertation makes several contributions to the fields of computer architecture and media processing: The design and evaluation of the Imagine stream processor. This is the first VLSI implementation of a stream architecture and provides experimental verification to the VLSI feasibility and performance of stream processors. Analysis on the performance efficiency of stream processors. This analysis demonstrates the potential for providing high performance per unit area and high performance per unit power when compared to other media processor architectures. Analytical models for the area, power, and delay of key components of a stream processor. These models are used to demonstrate the scalability of stream processors to thousands of arithmetic units in future VLSI technologies. An analysis of the performance of media applications as the number of arithmetic units per stream processor are increased. This analysis provides insights into the available parallelism in media applications and explores the tradeoffs in area, power, and performance for different methods of scaling to large numbers of arithmetic units per stream processor. 1.2 Outline Recently, media processing has gained attention in both commercial products and academic research. The important recent trends in media processing are presented in Chapter 2. One such trend which has gained prominence in the research community is stream processing. In Chapter 2, we introduce and explain stream processing, which consists of a programming model and architecture that enables high performance on media applications with fullyprogrammable processors. In order to explore the performance and efficiency of stream processing, a prototype stream processor, Imagine, was designed and implemented in a modern VLSI technology.

17 CHAPTER 1. INTRODUCTION 4 In Chapter 3, the instruction set architecture, microarchitecture, and key arithmetic circuits from Imagine are described. In Chapter 4, the design methodology is presented and finally, in Chapter 5, experimental results are provided. Also in Chapter 5, the energy efficiency of Imagine and a comparison to existing processors is presented. This work on Imagine was then extended to study the scalability of stream processors to future VLSI technologies when thousands of arithmetic units could fit on a single chip. In Chapter 6, analytical models for the area, power, and delay of key components of a stream processor are presented. These models are then used to explore how area and energy efficiency scales with the number of arithmetic units. In Chapter 7, performance scalability is studied by exploring the avaiable parallelism in media applications and by exploring the tradeoffs between different methods of scaling. Finally, conclusions and future work are presented in Chapter 8.

18 Chapter 2 Background Media applications and media processors have recently become an active and important area of research. In this chapter, background and previous work on media processing is presented. First, media application characteristics and previous work on processors for running these applications is presented. Then, stream processors are introduced. Stream processors have recently been proposed as an architecture that exploits media application characteristics to achieve better performance, area efficiency, and energy efficiency than existing programmable processors. 2.1 Media Applications Media applications are programs with real-time performance requirements that are used to process audio, video, still images, and other data-intensive data. Example application domains include image processing, computer-generated graphics, video encoding or decoding, and signal processing. As previous researchers have pointed out, these applications share several important characteristics: compute intensity, parallelism, and locality [Rixner, 2001]. A flow-diagram representation of one such media application, a stereo depth extractor, is shown graphically in Figure 2.1 [Kanade et al., 1996]. In this application, using two images offset by a horizontal disparity as input from two cameras, each row from each image is first filtered and then compared using a sum-of-absolute differences metric to 5

19 CHAPTER 2. BACKGROUND 6 Left Image Center Image Convolution Filter Sum-of- Absolute Differences Depth Map Convolution Filter Figure 2.1: A Stereo Depth Extractor estimate the disparity between objects in the images. From the disparity calculated at each image pixel, the depth of objects in an image can be approximated. This stereo depth extractor will be used to demonstrate the three important characteristics common to most media applications Compute Intensity The first important characteristic is compute intensity, meaning that media applications require a high number of arithmetic operations per memory reference when compared to traditional desktop applications. Rixner studied application characteristics of four media applications: the stereo depth extractor presented above, a video encoder/decoder, a polygon renderer, and a matrix QR decomposition [Rixner, 2001]. On the stereo depth extractor, arithmetic operations in the convolution filter and sum-of-absolute difference calculations were required per inherent memory reference (input, output, and other global data accesses). The other applications ranged between 57.9 and arithmetic operations per memory reference. In comparison, traditional desktop integer applications have ratios of less than 2: arithmetic operations comprise between 2% and 50% of dynamically executed instructions whereas memory loads and stores account for 15% to 80% of instructions in the SPECint2000 benchmark suite [KleinOsowski et al., 2000]. This difference suggests that architectures optimized for integer benchmarks such as general-purpose microprocessors would not be as well-suited to media applications and vice versa.

20 CHAPTER 2. BACKGROUND Parallelism Not only do these applications require large numbers of arithmetic operations per memory reference, but many of these arithmetic operations can be executed in parallel. This available parallelism in media applications can be classified into three categories: instructionlevel parallelism (ILP), data-level parallelism (DLP), and task-level parallelism (TLP). The most plentiful parallelism in media applications is at the data level. DLP refers to computation on different data elements occurring in parallel. Furthermore, DLP in media applications can often be exploited with SIMD execution since the same computation is typically applied to all data elements. For example, in the stereo depth extractor, all output pixels in the depth map could theoretically be computed in parallel by the same fixedfunction hardware element since there are no dependencies between these pixels and the computation required for every pixel is the same. Other media applications also contain large degrees of DLP. Some parallelism also is available at the instruction level. In the stereo depth extractor, ILP refers to the parallel execution of individual arithmetic instructions in the convolution filter or sum-of-absolute differences calculation. For example, the convolution filter computes the product of a coefficient matrix with a sequence of pixels. This matrix-vector product includes a number of multiplies and adds that could be performed in parallel. Such fine-grained parallelism between individual arithmetic operations operating on one data element is classified as ILP and can be exploited in many media applications. As will be shown later in Chapter 7, available ILP in media applications is usually limited to a few instructions per cycle due to dependencies between instructions. Although other researchers have shown that out-of-order superscalar microprocessors are able to execute up to 4.2 instructions per cycle on some media benchmarks [Ranganathan et al., 1999], this is largely due to DLP being converted to ILP with compiler or hardware techniques rather than the true ILP that exists in these applications. Finally, the stereo depth extractor and other media applications also contain task-level, or thread-level, parallelism. TLP refers to different stages of a computation pipeline being overlapped. For example, in the stereo depth extractor, there are four exeuction stages: load image data, convolution filter, sum-of-absolute differences, and store output data. TLP is

21 CHAPTER 2. BACKGROUND 8 available in this application because these execution stages could be set up as a pipeline where each stage concurrently processes different portions of the dataset. For example, a pipeline could be set up where each stage operates on a different row: the fourth image rows are loaded from memory, the convolution filter operates on the third rows, sum-of-absolute differences is computed between the second rows, while the first output row is stored back to memory. Note that ILP, DLP, and TLP are all orthogonal types of parallelism, meaning that all three could theoretically be supported simultaneously Locality In addition to compute intensity and parallelism, the other important media application characteristic is locality of reference for data accesses. This locality can be classified into kernel locality and producer-consumer locality. Kernel locality is temporal and refers to reuse of coefficients or data during the execution of computation kernels such as the convolution filter. Producer-consumer locality is also a form of temporal locality that exists between different stages of a computation pipeline or kernels. It refers to data which is produced, or written, by one kernel and consumed,or read, by another kernel and is never read again. This form of locality is seen very frequently in media applications [Rixner, 2001]. In a traditional microprocessor, kernel locality would most often be captured in a register file or a small first-level cache. Producer-consumer locality on the other hand is not as easily captured by traditional cache hierarchies in microprocessors since it is not well-matched to least-recently-used replacement policies typically utilized in caches. 2.2 VLSI Technology Not only has the typical application domain for programmable processors shifted over the last decade, the technology constraints of modern VLSI (Very Large Scale Integrated Circuits) has evolved as well. In the past, gates used for computation were the critical resource in VLSI design, but in modern technology, computation is cheap and communication between computational elements is expensive. For example, in the Imagine processor [Khailany et al., 2002], a single-precision floating-point multiply-accumulate unit in a

22 CHAPTER 2. BACKGROUND µm technology measures mm 2 and dissipates 185 pj per multiply (0.185 mw per MHz). A thousand of these multipliers could fit on a single die in a 0.13 µm technology. While arithmetic itself is cheap, handling the data and control communication between arithmetic units is expensive. On-chip communication between such arithmetic units requires storage and wires. Small distributed storage elements are not too expensive compared to arithmetic. In the same 0.18 µm technology, a 16-word 32-bit, one-read-port one-write-port SRAM which is mm 2 and dissipates 15pJ per access cycle assuming both ports are active. However, as additional ports are added to this memory, the area cost increases significantly. Furthermore, the drivers and wires for a 32-bit 5 millimeter bus dissipate 24 pj per transfer on average [Ho et al., 2001]. If each multiply requires three multi-ported memory accesses and three 5 millimeter bus transfers (two reads and one write), then the cost of the communication is very similar to the cost of a multiply. Architectures must therefore manage this communication effectively in order to keep its area and energy costs from dominating the computation itself. Off-chip communication is an even more critical resource, since there are only hundreds of pins available in large chips today. In addition, each off-chip communication dissipates a lot of energy (typically over 1 nj for a 32b transfer) when compared to arithmetic operations. Although handling the cost of communication in modern VLSI technology is a challenge, media application characteristics are well-suited to take advantage of cheap computation with highly distributed storage with local communications. Cheap computation can be exploited with large numbers of arithmetic units to take advantage of both compute intensity and parallelism in these applications. Furthermore, producer-consumer locality can be exploited to keep communication local as much as possible, thereby minimizing communication costs. 2.3 Media Processing Processors can exploit application characteristics to provide both high performance and more importantly, performance efficiency. High performance efficiency implies a high ratio of performance per unit area, area efficiency, and a high ratio of performance per unit

23 CHAPTER 2. BACKGROUND 10 power, energy or power efficiency. These metrics are often more important than raw performance in many media processing systems since higher area efficiency leads to low cost and better manufacturability, both important in embedded systems. Energy efficiency implies that for executing a fixed computation task, less energy from a power source such as a battery is used, leading to longer battery life and lower packaging costs in mobile products. In this section, we present previous work on fixed-function and programmable processors for media applications, with data on both performance and performance efficiency Special-purpose Processors Special-purpose, or fixed-function, processors directly map an application s data-flow graph into hardware and can therefore exploit important application characteristics. They contain a large number of computation elements operating in parallel, exploiting both the compute intensity and parallelism in media applications. These computation blocks are then connected together by dedicated wires and memories, exploiting available producer-consumer locality. Using dedicated wires and memories for local storage near the computation elements is very area- and energy-efficient, since it minimizes traversals of long on-chip wires and accesses to large global multi-ported memories. As a result, a large percentage of die area and active power dissipation is allocated to the computation elements rather than control and communication structures. An energy-efficiency comparison between a variety of fixed-function and programmable processors for media applications is shown in Table 2.1. All processors have been normalized to a 0.13 micron, 1.2 Volt technology. Energy efficiency is shown as energy per arithmetic operation and is calculated from peak performance and power dissipation. Although most processors sustain a fraction of peak performance on most applications, sustained performance and power dissipation measurements are not widely available, so peak numbers are used here. The energy efficiency of two special-purpose media processors are listed in the first section of Table 2.1. A polygon rendering chip, the Nvidia Geforce3 [Montrym and Moreton, 2002; Malachowsky, 2002], and a MPEG4 [Ohashi et al., 2002] video decoder are presented. These processors provide energy efficiencies of better than 6 pj per arithmetic

24 CHAPTER 2. BACKGROUND 11 Table 2.1: Media Processor Efficiencies (Normalized to 0.13µ, 1.2 V) Processor Data-type Peak Perf Power Energy/Op Nvidia GeForce3 8-16b 1200 GOPS 6.7 W 5.5 pj MPEG4 Decode 8-16b 2 GOPS 6.2 mw 3.2 pj Intel Pentium 4 FP 12 GFLOPS 51.2 W 4266 pj (3.08 GHz) 16b 24 GOPS 51.2 W 2133 pj SB-1250 FP 12.8 GFLOPS 8.7 W 677 pj (800 MHz) 64b 6.4 GOPS 8.7 W 1354 pj 16b 12.8 GOPS 8.7 W 677 pj TI C67x (225 MHz) FP 1.35 GFLOPS 1.2 W 889 pj TI C64x (600 MHz) 16b 4.8 GOPS 720 mw 150 pj VIRAM FP 1.6 GFLOPS 1.4 W 875 pj 16b 9.6 GOPS 1.4 W 146 pj operation when normalized to a 0.13 µm technology. The other processors in Table 2.1 are all programmable. Although area efficiencies are not provided in the table, comparisons between processors for energy efficiency should be similar to area efficiency. As can be seen, there is an efficiency gap of several orders of magnitude between the special-purpose and programmable processors. The remainder of this section will provide background into these programmable processors and explain their performance efficiency limitations Microprocessors The second section of Table 2.1 includes two microprocessors, a 3.08 GHz Intel Pentium 4 1 [Sager et al., 2001; Intel, 2002] and a SiByte SB-1250, which consists of two on-chip SB-1 CPU cores [Sibyte, 2000]. The Pentium 4 is designed for high performance through deep pipelining and high clock rate. The SiByte processor is targeted specifically for energy efficient operation through extensive use low power design techniques, and has efficiencies simiilar to other low power microprocessors, such as XScale [Clark et al., 2001]. These 1 Gate length for this process is actually nanometers because of poly profiling engineering [Tyagi et al., 2000; Thompson et al., 2001].

25 CHAPTER 2. BACKGROUND 12 processors demonstrate the range of energy efficiencies typically provided by microprocessors, over 500 pj per instruction when normalized to a 0.13 micron technology. Microprocessors have markedly lower efficiencies than special-purpose processors because of deep pipelining and because of the large amount of area and power taken up by control structures and large global memories such as caches. For example, less than 15% of die area in the Pentium 3 [Green, 2000], the predecessor to the Pentium 4, is devoted to the arithmetic execution units. In addition, deep pipelining with over 20 pipeline stages, used in the Pentium 4, requires high clock power, large branch predictors, and speculative hardware in order to achieve high performance at the expense of energy efficiency. The Sibyte processor is limited to more modest pipeline lengths for energy efficiency, but still is based around an architecture with a global register file and global communications through a cache hierarchy. Caches in microprocessors are not optimized to directly take advantage of producer-consumer locality to increase available on-chip bandwidth, but rather are optimized to exploit temporal and spatial locality to reduce average memory latency. In addition to energy inefficiencies in control structures, pipelining, and caches, existing microprocessor architectures are unable to take advantage of the compute intensity or parallelism in media applications. A single unified multi-ported register file does not scale efficiently to tens of arithmetic units, limiting the compute intensity and parallelism that can be exploited. Furthermore, microprocessors are mainly optimized to exploit ILP, less plentiful than the highly available DLP in media applications. Recently, microprocessors have tried to exploit DLP to achieve higher performance and to overcome register file scalability limitations by adding SIMD extensions to their instruction sets. Some example ISA extensions include VIS [Tremblay et al., 1996], MAX-2 [Lee, 1996], MMX[Peleg and Weiser, 1996], Altivec[Phillip, 1998], SSE [Thakkar and Huff, 1999], and others. However, the amount of data parallelism exploited by SIMD extensions is limited to the width of SIMD arithmetic units, typically less than 4 parallel data elements. This means each SIMD instruction can only capture a small percentage of the DLP available in media applications [Kozyrakis, 2002].

26 CHAPTER 2. BACKGROUND Digital Signal Processors and Programmable Media Processors Digital signal processors are listed next in Table 5.1. The first DSP, the TI C67x [TI, 2003], is an 8-way VLIW operating at 225 MHz that targets floating-point applications, and has energy efficiency of 889 pj per instruction. DSPs targeted for lower-precision fixed-point operation such as the TI C64x [Agarwala et al., 2002], a 600 MHz 8-way VLIW, are able to provide improved energy efficiency over floating-point DSPs and microprocessors when normalized to the same technology, achieving 150 pj per 16b operation. This improved efficiency is due to arithmetic units optimized for lower-precision fixed-point operation and with SIMD extensions in the C64x. In addition to C6x DSPs, there are a number of other VLIW DSPs and programmable media processors which achieve similar energy efficiencies such as the Analog TigerSharc [Olofsson and Lange, 2002], Trimedia [Rathnam and Slavenburg, 1996], the Starcore DSP [Brooks and Shearer, 2000], and others. DSPs, programmable media processors, and special-purpose processors provide an energy efficiency advantage over microprocessors because they have kept pipeline lengths small and avoided speculative branch predictors for energy efficiency purposes. However, VLIW DSP architectures are not able to scale to tens of ALUs per processor, because they still rely on global register file and control structures in VLIW or superscalar microarchitectures. They also only exploit ILP and limited amounts of DLP through SIMD extensions, similar to microprocessors. As a result, they have area and energy efficiencies significantly better than general-purpose energy-inefficient microprocessors, but are still one to two orders of magntiude worse than special-purpose processors Vector Microprocessors While SIMD extensions enable microprocessors and DSPs to exploit a small degree of DLP, vector processors [Russell, 1978] can exploit much moredata parallelism directly with vector instructions and vector memory systems. As technology has advanced, vector processors on a single chip, or vector microprocessors have been become feasible [Wawrzynek et al., 1996]. Recently, researchers have studied the use of vector microprocessors for media applications such as VIRAM [Kozyrakis, 2002] and others [Lee and Stoodley, 1998]. The performance and energy efficiency of VIRAM is shown in Table 5.1. It is able to provide

27 CHAPTER 2. BACKGROUND 14 energy efficiencies competitive with DSPs at higher performance rates because of its ability to efficiently exploit DLP and its embedded memory system. Vector processors directly exploit data parallelism by executing vector instructions such as vector adds or multiplies out of a vector register file. These vector instructions are similar to SIMD extensions in that they exploit inner-loop data parallelism in media applications, however, vector lengths are not constrained by the width of the vector units, allowing even more DLP to be exploited. Furthermore, vector memory systems are suitable for media processing because they are optimized for bandwidth and predictable strided accesses rather than conventional processors whose memory systems are optimized for reducing latency. For these reasons, vector processors are able to exploit significant data parallelism and compute intensity in media applications Chip Multiprocessors Whereas vector microprocessors use SIMD execution to exploit DLP and achieve higher compute intensities, another approach to providing high arithmetic performance is chip multiprocessors (CMPs). In these solutions, multiple processor cores on the same chip each have their own thread of execution and mechanisms for on-chip communication and synchronization are provided. Some example research CMPs include RAW [Waingold et al., 1997], Smart Memories [Mai et al., 2000], and others. Other CMPs such as the Cradle 3SOC [Cradle, 2003] and Broadcom s Calisto (formerly Silicon Spice) [Nickolls et al., 2002] have been proposed to specifically target lower-precision digital signal processing applications. During media application execution, CMPs typically use thread-level parallelism to achieve high arithmetic performance by statically assigning tasks to some subset of the available on-chip cores. They can also use SIMD execution of multiple cores to exploit data parallelism within each task. Finally, CMPs are able to exploit producer-consumer locality by passing the output of one task directly to the input of another task without accessing global or off-chip memories. For all of these reasons, CMPs are able to provide arithmetic performance significantly higher than current DSPs or microprocessors by exploiting thread-level parallelism.

28 CHAPTER 2. BACKGROUND 15 As shown above, there are a wide variety of processors that can be used to run media applications. Special-purpose processors are inflexible, but are matched to both VLSI technology and media application characteristics. As a result, there is a large and growing gap between the performance efficiency of these fixed-function processors and programmable processors. The next section introduces stream processors as a way to bridge this efficiency gap. 2.4 Stream Processing Stream processors are fully programmable processors that exploit the compute intensity, parallelism, and producer-consumer locality in media applications to provide performance efficiencies comparable to special-purpose processors [Rixner et al., 1998; Khailany et al., 2001; Rixner, 2001]. With stream processing, applications are expressed as stream programs, exposing the locality and parallelism inherent in media applications. A stream processor can then efficiently exploit the exposed locality with a bandwidth hierarchy of register files and can exploit the exposed parallelism with SIMD arithmetic clusters and multiple arithmetic units per cluster Stream Programming Media applications are naturally cast as stream programs. A stream program organizes data as streams and computation as a sequence of kernels. A stream is a finite sequence of related elements. Stream elements are records, such as 21-word triangles, or single-word RGBA pixels. A kernel reads from a set of input streams, performs the same computation on all elements of a stream, and writes a set of output streams. The stereo depth extractor when mapped into a stream program is shown in Figure 2.2. Arrows represent streams and circles represent kernels. In this application, each stream is a row of grayscale pixels. The convolution stage of the application is broken into two kernels: a 7x7 blurring filter followed by a 3x3 sharpen filter. The resulting streams are sent to the SAD kernel which computes the best disparity match in a row and outputs a row of pixels from a depth map.

29 CHAPTER 2. BACKGROUND 16 Input Data Right Camera Image 7x7 convolve Kernel 3x3 convolve Stream Output Data SAD Depth Map Left Camera Image 7x7 convolve 3x3 convolve Figure 1: Stereo Depth Extraction Figure 2.2: Stereo depth extractor as a stream program Stream programs expose the locality and parallelism in the algorithm to the compiler and hardware. Two key types of locality are exposed: kernel locality and producerconsumer locality. Kernel locality refers to intermediate data values that are live for only a short time during kernel execution, such as temporaries during a convolution filter computation. Producer-consumer locality refers to streams produced by one kernel and consumed by subsequent kernels. Finally, parallelism is exposed because a kernel typically executes the same kernel program on all elements of an input stream. By casting media applications as stream programs, hardware is able to take advantage of the abundant parallelism, compute intensity, and locality in media applications Stream Architecture The Imagine stream processor architecture, which is optimized to take advantage of the application characteristics exposed by the stream programming model is shown graphically in Figure 2.3. A stream processor runs as a coprocessor to a host executing scalar code.

30 CHAPTER 2. BACKGROUND 17 Host Processor Stream Controller Microcontroller ALU Cluster 7 ALU Cluster 6 S D R A M Streaming Memory System Stream Register File ALU Cluster 5 ALU Cluster 4 ALU Cluster 3 ALU Cluster 2 ALU Cluster 1 Stream Processor ALU Cluster 0 Figure 2.3: Stream Processor Block Diagram Instructions sent to the stream processor from the host are sequenced through a stream controller. The stream register file (SRF) is a large on-chip storage for streams. The microcontroller and ALU clusters execute kernels from a stream program. As shown in Figure 2.4, each cluster consists of ALUs fed by two local register files (LRFs) each, external ports for accessing the SRF, and an intracluster switch that connects the outputs of the ALUs and external ports to the inputs of the LRFs. In addition, there is a scratchpad (SP) unit, used for small indexed addressing operations within a cluster, and an intercluster communication (COMM) unit, used to exchange data between clusters. Imagine is a stream processor recently designed at Stanford University that contains six floating-point ALUs per cluster (three adders, two multipliers, and one divide-square-root unit) and eight clusters [Khailany et al., 2001], and was fabricated in a CMOS technology with 0.18 micron metal spacing rules and 0.15 micron drawn gate length.

31 CHAPTER 2. BACKGROUND 18 SP COMM To/From Other Clusters Intracluster Switch To/From SRF Figure 2.4: Arithmetic Cluster Block Diagram Stream processors directly execute stream programs. Streams are loaded and stored from off-chip memory into the SRF. SIMD execution of kernels occurs in the arithmetic clusters. Although the stream processor in Figure 2.3 conatins eight arithmetic clusters, in general, the stream processor architecture can contain an arbitrary number of arithmetic clusters, represented by the variable C. For each iteration of a loop in a kernel, C clusters will read C elements in parallel from an input stream residing in the SRF, perform the exact same series of computations as specified by the kernel inner loop, and write C output elements in parallel back to an output stream in the SRF. Kernels repeat this for several loop iterations until all elements of the input stream have been read and operated on. Data-dependent conditionals in kernels are handled with conditional streams which, like predication, keep control flow in the kernel simple [Kapasi et al., 2000]. However, conditional streams eliminate the extra computation required by predication by converting data-dependent control flow decisions into data-routing decisions. Stream processors exploit parallelism and locality at both the kernel level and application level. During kernel execution, data-level parallelism is exploited withc clusters concurrently operating on C elements and instruction-level parallelism is exploited by VLIW execution within the clusters. At the application level, stream loads and stores can be overlapped with kernel execution, providing more concurrency. Kernel locality is exploited by stream processors because all temporary values produced and consumed during a kernel are stored in the cluster LRFs without accessing the SRF. At the application level,

32 CHAPTER 2. BACKGROUND 19 producer-consumer locality is exploited when streams are passed between subsequent kernels through the SRF, without going back to external memory. The data in media applications that exhibits kernel locality and producer-consumer locality also has high data bandwidth requirements when compared to available off-chip memory bandwidth. Stream processors are able to support these large bandwidth requirements because their register files provide a three-tiered data bandwidth hierarchy. The first tier is the external memory system, optimized to take advantage of the predictable memory access patterns found in streams [Rixner et al., 2000a]. The available bandwidth in this stage of the hierarchy is limited by pin bandwidth and external DRAM bandwidth. Typically, during a stream program, external memory is only referenced for global data accesses such as input/output data. Programs are strip-mined so that the processor reads only one batch of the input dataset at a time. The second tier of the bandwidth hierarchy is the SRF, which is used to transfer streams between kernels in a stream program. Its bandwidth is limited by the available bandwidth of on-chip SRAMs. The third tier of the bandwidth hierarchy is the cluster LRFs and the intracluster switch between the LRFs which forwards intermediate data in a kernel between the ALUs in each cluster during kernel execution. The available bandwidth in this tier of the hierarchy is limited by the number of ALUs one can fit on a chip and the size of the intracluster switch between the ALUs. The peak bandwidth rates of the three tiers of the data bandwidth hierarchy are matched to the bandwidth demands in typical media applications. For example, the Imagine processor contains 40 fully-pipelined ALUs and provides 2.3 GB/s of external memory bandwidth, 19.2 GB/s of SRF bandwidth, and GB/s of LRF bandwidth. As discussed in Section 2.1, some media applications such as the stereo depth extractor require over 400 inherent ALU operations per memory reference. Imagine supports a ratio of ALU operations to memory words referenced of 28. Therefore, not only are stream processors in today s technology with tens of ALUs able to exploit this compute intensity, but as VLSI capacity continues to scale at 70% annually and as memory bandwidth continues to increase at 25% annually, this suggests that stream processors with thousands of ALUs could provide significant speedups on media applications without becoming memory bandwidth limited.

33 CHAPTER 2. BACKGROUND Stream Processing Related Work The stream processor architecture described above builds on previous work in data-parallel architectures and programming models. Stream processors share with vector processors the ability to exploit large amounts of data paralellism and compute intensity, but they differ from vector processors in two key ways. First, vector processors execute simple vector instructions such as vector adds and multiplies on vectors located in the vector register file whereas stream processors execute microcode kernels in SIMD out of the stream register file. Second, the register file storage on a stream processor is split into the stream register file and local register files. These optimizations allow stream processors to both capture producer-consumer locality in the register file hierarchy and to provide improved scalability within the arithmetic clusters with the local register files. Related work in vector processors has explored the use of partitioned register files to improve their scalability [Kozyrakis and Patterson, 2003]. Although designing a programmable architecture to directly execute stream programs is new, programming models similar to the stream model have been proposed in previous work with fixed-function processors. One example of a fixed-function processor that directly executes the stream programming model is Cheops [Bove and Watlington, 1995]. It directly maps an application data-flow exposed by the stream programming model into hardware units and consists of a set of specialized stream processors where each processor accepts one or two data streams as input and produces one or two data streams as output. Data streams are either forwarded directly from one stream processor to the next according to the applications data-flow graph or transferred between memory and the stream processors. Other researchers have proposed designing signal processing systems using signal flow graphs specified in Simulink [Simulink, 2002] or other programming models [Lee and Parks, 1995] that have many similarities with the stream programming model. With these systems, signal flow graphs can be synthesized to software running on DSPs [Bhattacharyya et al., 1996; de Kock et al., 2000] or can be mapped into fixed-function processors using hardware generators [Davis et al., 2001]. Designing fixed-function processors with these techniques allows for high efficiency since available parallelism and producer-consumer

34 CHAPTER 2. BACKGROUND 21 locality can easily be exploited. However, unlike programmable processors, fixed-function processors lack the flexibility to execute a wide variety of applications. Recently, other researchers have applied these same techniques for exploiting parallelism and locality used in fixed-function processors to reconfigurable logic. Streams- C [Gokhale et al., 2000] and others [Caspi et al., 2001] have proposed mapping arithmetic kernels to blocks in FPGAs and mapping streams passed between kernels to FIFO-based communication channels between FPGA blocks. These techniques enable some degree of programmability with a high-level language and are able to exploit large amounts of parallelism in stream programs. However, this approach is inhibited by limitations in reconfigurable logic. When compared to fixed-function transistors, large area and energy overheads are incurred when a design is implemented in reconfigurable logic. Furthermore, since stream programs are being spatially mapped onto a fixed resource such as an FPGA, problems arise when applications are too complex to fit onto this fixed resource. Finally, other researchers have also studied compiling and executing the stream programming model on chip multiprocessors. Streamit is a programming language that implements the stream model on the RAW CMP [Gordon et al., 2002]. Like hardwired stream processors, CMPs executing compiled stream programs can exploit parallelism with threads and producer-consumer locality between processors to manage communication bandwidth effectively. Like CMPs, programmable stream processors also have the ability to exploit parallelism and locality. However, since CMPs are targeted to run a wide variety of applications and rely mostly on thread-level parallelism, they contain more general control and communication structures per processor. In contrast, stream processors are targeted specifically for media applications, and therefore can use data-parallel hardware to efficiently exploit the available parallelism and a register file organization to efficiently exploit the available locality VLSI Efficiency of Stream Processors The bandwidth hierarchy provided by a stream architecture s register file organization allows stream processors to sustain a large percentage of peak performance with very modest off-chip memory bandwidth requirements. However, the other advantage of the register

35 CHAPTER 2. BACKGROUND 22 file organization is the area and energy efficiency derived from partitioning the register file storage into stream register files, arithmetic clusters, and local register files within the arithmetic clusters. This partitioning enables stream processors to scale to thousands of ALUs with significantly modest area and energy costs. The area of a register file is the product of three terms: the number of registers R, the bits per register, and the size of a register cell. Asymptotically, with a large number of ports, each register cell has an area that grows with p 2 because one wire is needed in the word-line direction, and another wire needed in the bit-line direction per register file port. Register file energy per access follows similar trends. Therefore, a highly multi-ported register file has area and power that grows asymptotically with Rp 2 [Rixner et al., 2000b]. A general-purpose processor containing N arithmetic units with a single centralized register file requires approximately 3N ports (two read ports for the operands and one wire port for the result per ALU). However, as N increases, working set sizes would also increase, meaning that R should also grow linearly with N. As a result, a single centralized multiported register file interconnecting N arithmetic units in a general-purpose microprocessor has area and power that grows with N 3, and would quickly begin to dominate processor area and power. As a result, partitioning register files is necessary in order to efficiently scale to large numbers of arithmetic units per processor. Historically, register file partitioning has been used extensively in programmable processors in order to improve scalability, area and energy efficiency, and to reduce wire delay effects. For example, the TI C6x [Agarwala et al., 2002] is a VLIW architecture split into two partitions, each containing a single multi-ported register file connected to four arithmetic units. Even in high-performance microprocessors not necessarily targeted for energy efficient operation, such as the Alpha [Gieseke et al., 1997], register file partitioning has been used. In the stream architecture, register file partitioning occurs along three dimensions: distributed register files within the clusters, SIMD register files across the clusters, and the stream register organization between the clusters and memory. In the remainder of this section, we explain how the register file partition of Imagine along these three dimensions improves area and energy efficiency and is related to previous work on partitioned register files.

36 CHAPTER 2. BACKGROUND 23 Distributed Register Partitioning The first register file partitioning in the stream architecture is along the ILP dimension within a cluster. Given N ALUs per cluster, a VLIW cluster with one centralized register file connected to all of the ALUs would grow with N 3 as explained above. However, by splitting this centralized multi-ported register file into an organization with one twoported LRF per ALU input within each arithmetic cluster, the area and power of the LRFs only grows with N, and the intracluster switch connecting the ALU outputs to the LRF inputs grows with N 2 asymptotically. The exact area efficiency, energy efficiency, and performance when scaling N on a stream architecture will be explored in more detail in Chapter 6. The disadvantage of this approach is that the VLIW compiler must explicitly manage communications across this switch and must deal with replication of data across various LRFs [Mattson et al., 2000]. However, using asymptotic models for area and energy of register files, Rixner et al. showed that for N =8, this distributed register organization provides a 6.7x and an 8.7x reduction on area and energy efficiency respectively in the ALUs, register files, and switches 2 [Rixner et al., 2000b]. Partitioned register files in VLIW processors and explicitly scheduled communications between these partitions were proposed on a number of previous processors. For example, the TI C6x [Agarwala et al., 2002] contains two partitions with four arithmetic units per partition. In addition, a number of earlier architectures used partitioned register files of various granularities. The Polycyclic architecture [Rau et al., 1982], the Cydra [Rau et al., 1989], and Transport-triggered architectures [Janssen and Corporaal, 1995] all had distributed register file organizations. SIMD Register Partitioning Whereas the distributed register partitioning was along the ILP dimension and was handled by the VLIW compiler, the next partitioning in the stream architecture occurs in the DLP 2 Implementation details such as design methodology or available wiring layers would affect the efficiency advantage of certain DRF organizations. For instance, comparing the efficiency of one four-ported LRF per ALU rather to one two-ported LRF per ALU input would provide different results depending on these implementation details.

37 CHAPTER 2. BACKGROUND 24 dimension and corresponds to the SIMD arithmetic clusters. In an architecture with C SIMD clusters, each of these clusters requires interconnecting only N/C ALUs together. Therefore, the area and energy in each cluster s intracluster switch grows much more slowly as there are many fewer ALUs per cycle. The disadvantage is that the complexity of the intercluster switch grows as the number of clusters increases. This tradeoff will be explored in more detail in Chapter 6. The other efficiency advantage of SIMD processing besides register file partitioning comes from amortizing control overhead. Only one instruction fetch unit and sequencer is required for C clusters. The area and efficiency gains achieved through SIMD-partitioned register files and by amortizing control over parallel vector lanes were first proposed on vector microprocessors, and are applied to the stream architecture register file organization as well. Furthermore, SIMD partitioning can be combined with distributed register partitioning, as demonstrated both in the Imagine stream processor and in the CODE vector microarchitecture [Kozyrakis and Patterson, 2003]. Separating the SRF storage from cluster storage The third and final partition in the stream architecture register file is a split between storage for loads and stores and storage for intermediate buffering between individual ALU operations. This is accomplished by separating the SRF storage from the LRFs within each cluster. This splitting between the SRF and LRFs has two main advantages. First, staging data for loads and stores is capacity-limited because of long memory latencies, rather than bandwidth-limited, meaning that large memories with few ports can be used for the SRF whereas the capacity of the LRFs can be kept relatively small. Second, data can be staged in the SRF as streams, meaning that accesses to the SRF will be sequential and predictable. As a result, streambuffers can be used to prefetch data into and out of the SRF, much like streambuffers are often used to prefetch data from main memory in microprocessors [Jouppi, 1990]. As explained in Section 3.2.4, these streambuffers allow accesses to a stream from each SRF client to be aggregated into larger portions of a stream before they are read or written from the SRF, leading to a much more efficient use of the SRF bandwidth and a more area- and energy-efficient design.

38 CHAPTER 2. BACKGROUND 25 VLSI Efficiency Summary The stream architecture register file organization can be viewed as a combination of the above three register partitionings. Overall, these partitions each provide a large benefit in area and energy efficiency. When compared to a 48-ALU processor with a single unified register file, a C =8N =6stream processor takes 195 times less area and 430 times less energy. A performance degradation of 8% over a hypothetical centralized register file architecture is incurred due to SIMD instruction overheads and explicit data transfers between partitions [Rixner et al., 2000b]. In summary, there is a large and growing gap between the area and energy efficiency of special-purpose and programmable processors on media applications. The stream architecture attempts to bridge that gap through its ability to exploit important application characteristics and its efficient register file organization.

39 Chapter 3 Imagine: Microarchitecture and Circuits In the previous chapter, a stream processor architecture [Rixner et al., 1998; Rixner, 2001] was introduced to bridge the efficiency gap between special-purpose an programmable processors. A stream processor s efficiency is derived from several architectural advantages over other programmable processors. The first advantage is a data bandwidth hierarchy for effectively dealing with limited external memory bandwidth that can also exploit compute intensity and producer-consumer locality in media applications. The next advantage is SIMD arithmetic clusters and multiple arithmetic units per cluster that can exploit both DLP and ILP in media processing kernels. Finally, the bandwidth hierarchy and SIMD arithmetic clusters are built around a area- and energy-efficient register file organization. Although the previous analysis qualitatively demonstrates the efficiency of the stream architecture, in order to truly evaluate its performance efficiency, a VLSI prototype Imagine stream processor [Khailany et al., 2001] was developed so that performance, power dissipation, and area could be measured. Not only did this prototype provide a vehicle for experimental measurements, but also, by implementing a stream processor in VLSI, key insights into the effect of technology on the microarchitecture are gained. These insights were then used to study the scalability of stream processors in Chapter 6 and Chapter 7. The next few chapters discuss the Imagine prototype in detail. This chapter presents the instruction set architecture, microarchitecture, and circuits of key components from the Imagine stream processor. Chapter 5 discusses the design methodology used for Imagine, and finally, in Chapter 6, experimental results for Imagine are presented. 26

40 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS Instruction Set Architecture The Imagine processor runs stream programs written in KernelC and StreamC. StreamC specifies how streams are passed between kernels and includes reads and writes from memory and I/O. KernelC contains the mathematical operations for the kernels. Software tools then compile StreamC and KernelC for execution into instructions from the stream-level and kernel-level instruction set architectures (ISAs). StreamC compilation involves highlevel data-flow analysis at the stream level including SRF allocation and memory management [Mattson, 2001; Kapasi et al., 2001]. KernelC compilation includes parsing, instruction scheduling, and managing the communication between ALUs and LRFs across the intracluster switch [Mattson et al., 2000]. Once StreamC and KernelC have been compiled, the Imagine processor directly executes instructions from the stream and kernel level ISAs described below Stream-Level ISA There are six main stream-level instructions: LOAD transfers streams from off-chip SDRAM to the SRF. STORE transfers streams from the SRF to off-chip DRAM. RECEIVE transfers streams from the network to the SRF. SEND transfers streams from the SRF to the network. CLUSTER OP executes a kernel in the arithmetic clusters that reads inputs streams from the SRF, computes output streams, and writes the output streams to the SRF. LOAD MICROCODE loads streams consisting of kernel microcode (576-bit VLIW instructions) from the SRF into the microcontroller instruction store (a total of 2,048 instructions). In addition to the six main instructions listed above, there are other instructions for writes and reads to on-chip control registers which are inserted as needed by the streamlevel compiler. Streams must have lengths that are a multiple of eight (the number of

41 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 28 ADD ADD ADD MUL MUL DSQ SP JB/VAL COM To/From Other Clusters Intracluster Switch IO Ports (To SRF) Figure 3.1: Imagine Arithmetic Cluster clusters) and lengths from 0 to 8K words are supported, where each word is 32 bits. Stream instructions are fetched and dispatched by a host processor to a scoreboard in the on-chip stream controller. As will be described in Section 3.2.8, the stream controller issues stream instructions to the various on-chip units as their dependencies become satisfied and their resources become available Kernel-Level ISA Kernel-level instructions are scheduled and assembled into VLIW instructions at compiletime, are sequenced by a microcontroller, and then are broadcast to and executed in eight SIMD arithmetic clusters. Each arithmetic cluster, detailed in Figure 3.1, contains eight functional units (plus the special JB and VAL units that are used for conditional streams [Kapasi et al., 2000]). A small two-ported local register file (LRF) connects to each input of each functional unit. An intracluster switch connects the outputs of the functional units to the inputs of the LRFs. Each function unit from Figure 3.1 executes instructions from the kernel-level instruction set, shown in Tables 3.1 and 3.2. Instructions are grouped by supported datatypes. A wide range of datatypes from fixed-point or integer to single-precision floating-point are supported in order to accommodate the demands of media applications. The first two columns in the Kernel ISA tables contain the instruction mnemonic and a brief summary of the operation performed. The third column specifies the latency of each operation and the fourth column specifies the supported functional unit(s). As will be explained in Section 3.2, all functional units except the DSQ unit are fully pipelined. In addition to the function unit operations and the stream input/output instructions,

42 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 29 Table 3.1: Kernel ISA - Part 1 Op Description T Unit Ops for Floating-Point Data-types FADD Add 4 ADD FSUB Subtract 4 ADD FABS Absolute value 1 ADD FLT Test < 2 ADD FLE Test 2 ADD FTOI Convert to int (round-to-zero) 3 ADD FFRAC Computes x-ftoi(x) 4 ADD ITOF Convert int to floating-point 4 ADD FMUL Multiply 4 MUL FDIV Divide 17 DSQ FSQRT Square root 16 DSQ Ops for 32b, 16b, and 8b Datatypes IADD Add 2 ADD ISUB Subtract 2 ADD IABD/UABD Absolute difference (integer/unsigned) 2 ADD ILT/ULT Test < (integer/unsigned) 2 ADD ILE/ULE Test (integer/unsigned) 2 ADD IEQ Test == 1 ADD NEQ Test!= 1 ADD AND Bitwise AND 1 ADD OR Bitwise OR 1 ADD XOR Bitwise XOR 1 ADD NOT Bitwise invert 1 ADD Ops for 32b and 16b Datatypes IADDS/UADDS Integer/Unsigned saturating add 2 ADD ISUBS/USUBS Integer/Unsigned saturating subtract 2 ADD SHIFT Logical shift 1 ADD SHIFTA Arithmetic shift 1 ADD ROTATE Rotate 1 ADD IMUL/UMUL Integer/Unsigned multiply 4 MUL IMULR/UMULR Integer/Unsigned multiply & round 4 MUL Ops for 16b Datatypes IMULD/UMULD Integer/Unsigned multiply (32b outputs) 4 MUL Ops for 32b Datatypes IDIV/UDIV Integer/Unsigned divide 22 DSQ IDIVR/UDIVR Integer/Unsigned remainder 23* DSQ

43 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 30 Table 3.2: Kernel ISA - Part 2 Op Description T Unit Data Movement Ops SELECT Multiplex based on cc 1 ALL NSELECT Multiplex based on!cc 1 ALL CCTOI Convert CC to int 1 ALL SHUFFLE Shuffle bytes 1 ADD SHUFFLED Shuffle bytes (two outputs) 1 MUL SPRD Scratchpad read 2 SP SPWR Scratchpad write 2 SP COMM Cluster RF - controlled permute 1 COM COMMUCDATA Same as comm w/ UC data input 1 COM COMMUCPERM UC-controlled permute 1 COM Conditional Stream Ops INIT CISTATE Initialize JBRF entry for conditional input stream 1 JB INIT COSTATE Initialize JBRF entry for conditional output stream 1 JB GEN CISTATE Update JBRF entry with new state 1 JB GEN COSTATE Update JBRF entry with new state 1 JB GEN COSTATE Update JBRF entry with new state 1 JB SPCRD Conditional scratchpad read 2 SP SPCWR Conditional scratchpad write 2 SP INIT VALID Initialize valid unit for new conditional stream 1 VAL GEN CCEND Computes CC for end of conditional stream 1 VAL GEN CCFLUSH Computes CC for end of conditional stream 1 VAL Stream Input/Output Ops DATA IN Read from input stream 1 IO COND IN R Conditional stream read - intermediate word in record 3 IO COND IN D Conditional stream read - last word in record 3 IO DATA OUT Write to output stream 1 IO COND OUT R Conditional stream write - intermediate word in record 1 IO COND OUT D Conditional stream write - last word in record 1 IO Microcontroller Ops LOOP Branch to new PC (if last CHK was true) 3 UC NLOOP Branch to new PC (if last CHK was false) 3 UC UC DATA IN Load immediate into microcontroller RF 1 UC DEC CHK UCR Decrement and zero-check microcontroller RF value 2 UC CHK EOS Check for end of stream 2 UC CHK ANY Check for true cc in any cluster 2 UC CHK ALL Check for true cc s in all clusters 2 UC SYNCH Synchronize with stream controller 1 UC

44 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 31 the kernel-level instructions also control register file accesses and the intracluster switch. Register file reads are handled with an address field in the kernel-level ISA, while writes require both an address field and a software pipeline stage field. Finally, the kernel-level ISA controls the intracluster switch with a bus select field for each write port. This field specifies which function unit output or input port should be written into the register file for this instruction Kernel Instruction Format Once KernelC is mapped into instructions from the kernel-level ISA and scheduled by the VLIW compiler, instructions are then assembled into the 576-bit format specified in Figure 3.2. There are fields for nine functional units (scratchpad, ALUs, MULs, DSQ, COMM, and JBVAL), the condition code register file (CC), explained in Section 3.2, as well as the microcontroller and eight stream input/output units (SB0:SB7). Each function unit field is further subdivided into sub-fields, containing an opcode, a CCRF read address, read addresses for both LRFs (LRF 0 Rd and LRF 1 Rd), write addresses for both LRFs (LRF 0 Wr and LRF 1 Wr), a software pipelining stage field associated with each LRF write port (LRF 0 Stg and LRF 1 Stg), and a bus select field for the LRF inputs that controls the intracluster switch (LRF 0 Bus and LRF 1 Bus). The location of function unit fields within the instruction word corresponds roughly to floorplan placement of arithmetic units within an arithmetic cluster. This alignment reduces the length of control wires as instructions are fetched from the microcode store, decoded, and broadcast to the clusters. 3.2 Microarchitecture In the previous chapter, the arhitecture of the Imagine stream processor, shown in Figure 2.3, and the basic execution of a stream processor was presented. In this section, this discussion is extended with microarchitectural details from the key components of the Imagine architecture. First, the microarchitecture and pipeline diagrams of the microcontroller and arithmetic clusters are presented. These units execute instructions from the

45 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 32 Res SB0:SB7 Microcontroller JB/VAL COMM ALU0 ALU1 CC MUL MUL1 ALU2 DSQ Scratchpad Function Unit Sub-Fields LRF 0 Bus LRF 0 Stg. LRF 0 Wr LRF 0 Rd LRF 1 Bus LRF 1 Stg. LRF 1 Wr LRF 1 Rd CCRF Rd Opcode Figure 3.2: VLIW Instruction Format kernel-level ISA. Next, both the stream register file microarchitecture and its pipeline diagram are described. Finally, we present the stream controller and the streaming memory system, the other major components of a stream processor Microcontroller The microcontroller provides storage for the kernels VLIW instructions, and sequences and issues these instructions to the arithmetic clusters during kernel execution. A block diagram of the Imagine microcontroller is shown in Figure 3.3. It is composed of nine banks of microcode storage as well as blocks for loading the microcode, sequencing instructions using a program counter, and instruction decode. Each bank of microcode storage contains a single-ported SRAM where 64 bits of each 576-bit VLIW kernel instruction are stored. Since each bank contains a 128Kb SRAM, a total of 2K instructions can be stored at one time. In order to allow for microcode to be loaded during kernel execution without a performance penalty, two instructions are read at one time from the SRAM array. The first of these instructions is passed directly to the

46 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 33 From Stream Controller Instruction Sequencer From SRF Microcode Loader Din 1024 x 128b A Din 1024 x 128b A SRAM SRAM Dout Dout Microcode Store Bank 8 Kernel Instruction[575:512] Microcode Store Bank 0 Kernel Instruction[63:0 UCRF Instruction Decoder To Microcontroller, SRF, Arithmetic Clusters Figure 3.3: Microcontroller Block Diagram instruction decoder. The second is stored in a register, so that it can be decoded in the next clock cycle without accessing the SRAM array again. The microcode loader handles the loading of kernel instructions from the SRF to the microcode storage arrays. Since microcode is read from the SRF one word at a time, and 1152 bits of microcode must be written at a time, the microcode loader reads words from a stream in the SRF, then sends them to local buffers in one of the microcode store banks. Once these buffers have all been filled, the microcode loader requests access to write two instructions into the microcode storage banks. A controller, not shown in Figure 3.3, handles this arbitration and also controls the reading of instructions from the microcode storage and intermediate registers during kernel execution. The instruction sequencer contains the program counter which is used to compute the addresses to be read from the microcode storage. At kernel startup, the program counter

47 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 34 is loaded with the address of the first kernel instruction, specified by the stream controller. As kernel execution proceeds, the program counter is either incremented or, on conditional branch instructions, a new address is computed and loaded into the program counter. Conditional branches are handled with the CHK and LOOP/NLOOP instructions. CHK instructions store a true or false value into a register inside the instruction sequencer. Based on the value of this register, LOOP instructions conditionally branch to a relative offset specified in the instruction field. The final component of the microcontroller is the instruction decoder, which handles the squashing of register file writes, a key part of the software pipeline mechanism on Imagine. In the VLIW instruction, each register file write has a corresponding stage field, which allows the kernel scheduler to easily implement software pipeline priming and draining without a loop pre-amble and post-amble. The kernel scheduler assigns all register file writes to a software pipelining stage, and encodes this stage in the VLIW instruction as the LRF Stg. sub-field from Figure 3.2. During loops, the instruction decoder keeps track of which stages are currently active, and squashes register file writes from inactive stages. In addition to squashing register file writes, the instruction decoder also provides pipeline registers and buffers for each ALU and LRF s opcodes before they are distributed to the SIMD ALU clusters and the instruction decoder handles reads and writes from the microcontroller register file, which is used to store constants and cluster permutations in many kernels Arithmetic Clusters As the microcontroller fetches and sequences VLIW instructions from the microcode storage, the eight SIMD arithmetic clusters on Imagine execute these instructions. As was shown in Figure 3.1, each cluster is composed of nine function units (3 ADDs, 2 MULs, 1 DSQ, 1 SP, 1 JB/VAL, 1 COM). A more detailed view of a function unit (FU) and its associated register files is shown in Figure 3.4. Most FUs have two data inputs, one condition code (cc) input, and one output bus. Data in the arithmetic clusters is stored in the LRFs. The LRFs have one read port, one write port, and 16 entries each, except for the multiplier LRFs, which have 32 entries.

48 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 35 FU Result Bus Function Unit Local Copy of CCRF LRFs CC Result Buses FU Result Buses FU Result Buses Figure 3.4: Function Unit Details Write Select[1] Write Select[15] LRF Input [31] (from FU result EN EN bus mux) LRF Output [31] 2-level (To FU) D Q D Q 4:1 Mux Tree LRF Bitslice LRF Input [0] (from FU result EN EN bus mux) LRF Output [0] 2-level (To FU) D Q D Q 4:1 Mux Tree LRF Bitslice Figure 3.5: Local Register File Implementation

49 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 36 Latches were used as the basic storage element for the LRFs, as shown in Figure 3.5. The multiplexer before the LRF output flip flop enables register file bypassing within the LRFs so that data written on one cycle can be read correctly by the FU in the subsequent cycle. Flip flop writes can be disabled by selecting the top feedback path through the multiplexer. Each FU also contains a copy of the condition code register file (CCRF), not shown in Figure 3.1, but shown in the detailed view of Figure 3.4. Condition codes (CCs) are special data values generated by comparison instructions such as IEQ and FLT and are used with SELECT instructions and with conditional streams. Although there is only one CCRF in the ISA, each FU contains a local copy of the CCRF. During writes to the CCRF, data and write addresses are broadcast to each CCRF copy, whereas during reads, each FU reads locally from its own CCRF copy. This structure allows for a CCRF with as many read ports as there are FUs, yet does not incur any wire delay when accessing CCs shared between all of the FUs in a cluster. Finally, data is exchanged between FUs via the intracluster switch. This switch is implemented as a full crossbar where each FU broadcasts its result bus(es) to every LRF in an arithmetic cluster. A multiplexer uses the bus select field for its associated LRF write port to select the correct FU result bus for the LRF write Kernel Execution Pipeline The microcontroller and arithmetic clusters work together to execute kernels. As is typically done in most high-performance microprocessors, they operate in a pipelined manner in order to achieve higher instruction throughput. The kernel execution pipeline diagram in the microcontroller and arithmetic clusters is shown in Figure 3.6. During the first two pipeline stages, FETCH1 and FETCH2, the microcontroller instruction sequencer sends the current program counter to the microcode storage banks and the VLIW instructions are fetched from the SRAMs. During the decode and distribute stage (DECODE/DIST), instructions are decoded and broadcast to the eight arithmetic clusters. Branches are resolved and branch targets are computed in this stage, and the new program counter is computed if necessary. Since this is the third pipeline stage, two branch delay slots are added to all LOOP instructions. During REG READ, more instruction decoding

50 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 37 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 FETCH1 FETCH2 DECODE/ DIST REG READ EX 1 EX N WB FETCH1 FETCH2 DECODE/ DIST REG READ EX 1 EX N WB FETCH1 FETCH2 DECODE/ DIST REG READ EX 1 EX N WB Figure 3.6: Kernel Execution Pipeline Diagram occurs and LRFs are accessed locally in each arithmetic cluster. This is followed by the execute (EX) pipeline stages, which vary in length depending on the operation being executed. The last half-cycle of each function unit s last execute stage is used to traverse the intracluster switch, and then in the writeback (WB) stage, the register write occurs. Although the clusters are statically scheduled by a VLIW compiler and sequenced by a single microcontroller, dynamic events during execution can cause the kernel execution pipeline to stall. Stalls are caused by one of three conditions: the SRF not being ready for a write to an output stream, the SRF not being ready for a read from an input stream, or a SYCNH instruction being executed by the microcontroller for synchronization with the host processor. When one of these stall conditions is encountered, all pipeline registers in the clusters and microcontroller are disabled and writes to machine state are squashed until a later cycle when the stall condition is no longer valid. The microcontroller and arithmetic clusters work together to execute kernels from an application s stream program. They execute VLIW instructions made up of operations from the kernel-level ISA in a six-stage (or more for some operations) execution pipeline. The other main blocks in the Imagine processor are used to sequence and execute stream transfers from the stream-level ISA.

51 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 38 Control to/from client 0 Control to/from client 21 SRAM Requests / Grants SRF Control 22:1 Arbiter SB 0 Control SB 21 Control 22 SBs SRF Bank 7 (4K words) 4 words SB 0 Bank 7 (8 words) SB 21 Bank 7 (8 words) SRF Bank 0 (4K words) 4 words SB 0 Bank 0 (8 words) SB 21 Bank 0 (8 words) Data to/from client 0 Data to/from client 21 Figure 3.7: Stream Register File Block Diagram

52 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS Stream Register File The stream register file (SRF), provides on-chip data storage for streams. The SRF is used during execution to stage data both between stream-level memory and kernel operations and between subsequent kernel operations. As shown in Figure 3.7, the SRF is partitioned into eight parallel banks, where each bank is aligned to an associated cluster. Streams are stored in the SRF with their records strided across the eight banks: bank 0 would contain records 0, 8, 16,..., bank 1 would contain records 1, 9, 17,..., and so on for banks 2 through 8. Each SRF bank can store up to 4K words, for a total of 32K words. Each SRF bank contains a single-ported 128kb SRAM and 22 streambuffer (SB) banks. The SBs are used to interface between the SRF storage and the 22 SRF clients (8 cluster, 8 network, 1 microcontroller, 1 host, 2 memory data, and 2 memory index streams) 1. Using streambuffers as these clients interface to the SRF takes advantage of the predictable streaming nature of accesses to enable an area- and energy-efficient SRF implementation [Rixner et al., 2000b]. Clients make requests to the streambuffers to read or write elements from a stream. SBs in turn make requests to access the location in the SRF storage where that stream resides. These requests are handled by a 22:1 arbiter in SRF control. One SB is granted access per cycle and all eight banks from the chosen SB read or write 4 words into half of their local storage (each SB contains 8 words of storage to allow for double buffering). Finally, the external clients can read or write data from their associated streambuffer at a lower bandwidth. In this manner, the SBs enable the SRAM s single physical port to function as 22 logical ports, but in a more area- and energy-efficient manner than a multi-ported SRAM SRF Pipeline Not only is kernel execution pipelined in order to provide higher instruction throughput, but the SRF is also pipelined to provide high-throughput access to the SRF storage. The pipeline diagram for the SRF is shown in Figure 3.8, and is designed to operate at half the 1 Each client accesses its streambuffers in a slightly different manner. The clusters read or write 8 words from each streambuffer in parallel from each cluster for a peak supported throughput of 8 words per cycle per streambuffer. Although the network has 8 SBs, only 2 can be active on a given cycle, and only 2 words per SB can be read. All of the other streambuffers support 1 word per cycle.

53 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 40 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 SEL MEM WB SEL MEM WB SEL MEM WB Figure 3.8: SRF Pipeline Diagram frequency of the kernel pipeline, in order to ease timing constraints, and therefore reduce overall design effort. The SRF pipeline consists of three stages: stream select (SEL), memory access (MEM), and streambuffer writeback (WB). During the SEL stage, SBs arbitrate for access to the SRAM array, and one of the SBs is granted access. Meanwhile the arbiter state is updated using a last-used-last-served scheme to ensure fairness among SB access. During the MEM stage, the SB that was granted access transfers data between its local storage and the SRAM array. Finally, during the WB stage, which only occurs on SRF reads, data from the SRAM array is written locally to the eight SB banks. While the SRF storage and control operates at half speed, the SBs operate at full speed, so the WB stage only takes one additional clock cycle to complete Streaming Memory System The streaming memory system executes stream load and store instructions from the streamlevel ISA and supports up to two simultaneous instructions. The memory system is composed of two address generators (one for each instruction being executed), four memory banks, each with their own external DRAM interface (memory addresses are interleaved among the four banks), and a reordering streambuffer in the SRF. Rixner provides details

54 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 41 on the memory bank and address generator microarchitecture [Rixner, 2001]. Four types of accesses are supported by the streaming memory system: sequential, strided, indexed, and bit-reversed. The address generators issue at most one word per cycle of memory read or write requests based on these access patterns to the appropriate memory banks. Within the memory banks, accesses are buffered, scheduled, and reordered by a memory controller in order to maximize utilization of the off-chip DRAM [Rixner et al., 2000a]. While the latency of individual memory accesses could increase by this reordering, the overall latency of the stream load or store is reduced since memory bandwidth is improved with this technique. Since memory accesses are issued to the DRAM out of order, stream elements read during loads are reordered when they are written back into the streambuffer within the SRF to ensure proper ordering later during kernel execution Network Interface The network interface on Imagine is used to connect the SRF to other Imagine chips in multiprocessor systems or to read or write from I/O devices. Stream send or receive instructions are used to transfer streams across the network using source routing. Four external network input channels and four output channels are supported. Each channel is able to transfer 2 bytes each clock cycle, for a total network bandwidth of 2 input words and 2 output words per cycle per node. This is matched to the total bandwidth supported by the network streambuffers. Destinations and routes are written from the host processor into an entry in the Network Routing Register File. Since source routing is used, arbitrary network topologies with up to four physical channels per node are supported. One example of a supported topology would be a two-dimensional mesh network. Streams sent across the network are packaged into 64-bit flits and virtual channel flow control is used to manage communication across the network [Dally, 1992]. When the network interface receives a header flit into its ejection queue, it signals the stream controller to start an SRF transfer. As data flits are received, they are written two words at a time into one of the eight network streambuffers. A tail flit signals the end of the stream causing all remaining data flits in the streambuffer to be written to the SRF storage. Sending network streams work in a similar manner but in

55 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 42 (Stream Ops) Op Buffer (32 instrs) Issue and Decode Logic To Microcontroller, SRF, Memory System, Network Interface From Host Processor Arbiter SCTRF Register transfers with microcontroller, SRF (Dependencies) Scoreboard (32 entries) Resource Analyzer Status bits from Microcontroller, SRF, Memory System, Network Interface Figure 3.9: Stream Controller Block Diagram reverse with data being read from the streambuffers and packaged into flits as they are sent into the network interface injection queue Stream Controller The above blocks from the Imagine processor execute instructions from the stream-level ISA. These instructions are issued by the host processor during the execution of stream programs. However, since the execution time of stream instructions is dynamically dependent on stream lengths, memory access patterns, and kernel code, dynamic scheduling of stream instructions is important in order to provide high utilization in both the memory system and arithmetic clusters. The stream controller handles this dynamic scheduling of stream instructions. A block diagram of the stream controller is shown in Figure 3.9. Stream instructions sent by the host processor are written into one of the 32 entries in the operation buffer. Along with the instruction, the host processor sends bitmasks that specify dependencies between this instruction and the other 32 instructions currently in the operation buffer. This information is separated from the actual operation and is stored in the scoreboard. Meanwhile, a resource analyzer monitors status bits from the execution units and sends this information to the scoreboard as well. When a stream instruction s required resources

56 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 43 are free and its dependencies have been satisfied, it makes a request to an arbiter to be issued this cycle. One instruction is granted access and is sent from the operation buffer to the issue and decode logic. The issue and decode logic converts the instruction into control information that start the stream instruction in the individual execution units. A stream controller register file (SCTRF) is used to transfer scalar data such as stream lengths and scalar outputs from kernels if necessary. Once the stream instruction execution completes, its scoreboard entry is freed and subsequent instructions dependent on it can be issued. By using dynamic scheduling of stream instructions, the stream controller ensures that stream execution units can stay highly utilized. This allows Imagine to exploit task-level parallelism by efficiently overlapping memory operations and kernel operations. Furthermore, the 32-entry operation buffer also allows the host processor to work ahead of the stream processor since the host can issue up to 32 stream instructions until it is forced to stall waiting for more scoreboard entries to be free. This buffering mitigates any effect the latency of sending stream instructions to the stream processor would have on performance. 3.3 Arithmetic Cluster Function Units In this section, the design of the arithmetic cluster function units will be discussed. These function units execute the kernel-level instruction set from Table 3.1 and Table 3.2 and were developed with a number of design goals in mind, including low area, high throughput, low design complexity, low power, and low operation latency ALU Unit Each cluster contains three ALU units that execute the addition, shift, and logical instructions listed in Table 3.1. Many of these instructions include support for floating-point, 32-bit integer, dual 16-bit, and quad 8-bit instructions. A block diagram of the ALU is shown in Figure It is divided into three major functional sub-blocks, corresponding to pipeline stages in the execution of four-cycle operations. The ALU X1 sub-block executes integer shifts, logical operations, and the alignment shift portion of floating-point

57 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 44 additions. The ALU X23 sub-block contains two pipeline stages and implements integer additions and the addition portion of floating-point adds. Rounding also occurs in the ALU X23 stage during floating-point adds. Finally, the ALU X4 sub-block executes a normalizing shift operation. Operations requiring floating-point additions, such as FADD, FSUB, and others, are 4- cycle operations and therefore use all three major sub-blocks. The ALU supports floatingpoint arithmetic adhering to the IEEE 754 standard, although only the round-to-nearesteven rounding mode and limited support for denormals and NaNs are supported [Coonen, 1980]. Additions supporting this standard can be implemented with an alignment shifter, a carry-select adder for summing the mantissas and doing the rounding, and a normalizing shifter [Goldberg, 2002; Kohn and Fu, 1989]. This basic architecture was used in the ALU unit. Floating-point operands are comprised of a sign bit, eight bits for an exponent, and 23 bits for a fraction with an implied leading one. In the ALU X1 block, a logarithmic shifter [Weste and Eshraghian, 1993] is used to shift the operand with the smaller exponent to the right by the difference between the two exponents. If the sign bits of the two operands are different, then the shifted result is also bitwise inverted, so that subtraction rather than addition will be computed in the ALU X23 stage. Furthermore, both the unshifted and shifted fractions are then shifted to the left by two bits such that the leading one of the unshifted operand is at bit position 25 in the datapath (there are 32 bit positions numbering 0 to 31). This is necessary because guard, round, and sticky bits must also be added into the two operands in the ALU X23 stage [Goldberg, 2002; Santoro et al., 1989]. In the ALU X23 stage, the shifted and unshifted operands are added together using a carry-select adder [Goldberg, 2002]. A block diagram of this adder isshown in Figure3.11. For each byte in the result, the adder computes two additions in parallel, one assuming the carry-in to the byte was zero and the other assuming it was one. Meanwhile, a two-level tree computes the actual carry-ins to each byte. For integer additions, the carry-ins are based on the results of the group PGKs, the operation type, and the result sign bits. For floating-point adds, the carry-ins are based on the group PGKs and the overflow bit. 32-bit integer and lower-precision subword data-types also use the carry-select adder in

58 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 45 A B Shifter Logical Unit X1 sub-block A B Integer Adder X2 stage X3 stage Saturation/ Rounding X23 sub-block X4 sub-block Alignment Shifter Output Buffers Result Figure 3.10: ALU Unit Block Diagram

59 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 46 A[31] B[31] A[24] B[24] A[23] B[23] A[16] B[16] A[15] B[15] A[8] B[8] A[7] B[7] A[0] B[0] Group PGK Group PGK Group PGK Group PGK COUT Global Carry Chain CIN A[31:24],B[31:24] A[23:16],B[23:16] A[15:8],B[15:8] A[7:0],B[7:0] Conditional Sum 1 0 Conditional Sum 1 0 Conditional Sum 1 0 Conditional Sum 1 0 SUM[31:24] SUM[23:16] SUM[15:8] SUM[7:0] Figure 3.11: Segmented Carry-Select Adder the ALU X23 stage to compute fast additions, subtractions and absolute difference computations. During these operations, the adder also computes two additions in parallel for each byte, but the global carry chain takes into account both the data-type and the operation being executed to determine whether the carry-in to each byte should be zero or one. Furthermore, when an subtraction occurs, the B operand must be inverted (not shown in the figure). Using this carry-select adder architecture, it was possible to design one adder that could be used for floating-point, 32-bit, 16-bit, and 8-bit additions and subtractions with little additional area or complexity over an adder that supports only integer additions MUL Unit Like the ALU, the MUL unit also executes both floating-point and integer operations. A block diagram of the MUL unit is shown in Figure There are two MUL units per cluster. Each unit has four pipeline stages and uses radix-4 booth encoding [Booth, 1951]. Since operands are up to 32 bits long, with radix-4 encoding, 16 partial products must be summed together. These partial products are summed using an architecture based around two half arrays [Kapadia et al., 1995] followed by a 7:2 combiner. In the first pipeline stage, the multiplier operand is analyzed by the booth encoder and

60 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 47 A (multiplier) X1 stage B (multiplicand) Booth Encoder B (multiplicand) X2 stage Lower Half Array Lower Half Array Shifting/ Buffering Shifting/ Buffering Sign Extension / Two's Complement X3 stage X4 stage 7:2 Combiner 64-bit Carry-Select Adder Saturation / Rounding Output Buffers Result Figure 3.12: MUL Unit Block Diagram

61 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 48 control information is sent to the two half arrays. Based on this control information, each partial product contains a shifted version of -2, -1, 0, 1, or 2 times the multiplicand, which can easily be computed with a few logic gates per bit and a 1-bit shifter within the half arrays. Once the partial products have been computed, each half array sums eight of the partial products with 6 rows of full adders. The first row combines 3 of the partial products and each of the other 5 rows add in one more partial product. Three of these additions occur in the X1 pipeline stage and the other three occur in the X2 stage. Once the half arrays have summed the 8 products, each half array sends two 48-bit outputs to a 7:2 combiner. This combiner sums these four values with three other buses from the sign extension and two s complement logic. These three buses ensure a correctly sign extended result and also add a one into the lsb location of partial products that were -2 or -1 times the multiplicand during booth encoding. To keep the half arrays modular and simple, this occurs here rather than in the half arrays. The 7:2 combiner is implemented with 5 full adders: three of the adders are in the X2 stage and two are in the X3 stage. The 7:2 combiner outputs two 64-bit buses that are converted back into non-redundant form with a 64-bit carry-select adder. Its architecture is similar to the 32-bit integer adder shown in Figure 3.11, but is extended to 64 bits. The adder spans two pipeline stages: the actual additions and carry propagation occurs in X3 while the carry select and final multiplexing occurs in X4. This result is then analyzed and sent through muxes which handle alignment shifting during floating-point operations and saturation during some integer operations before it is buffered and broadcast across the intracluster switch. Like the ALU unit, the MUL unit is also designed to execute 16-bit, 32-bit, and floatingpoint additions. 8-bit multiplications were not implemented to reduce design complexity. During floating-point or 32-bit multiplications, the multiplier operates as described above. However, during 16-bit multiplications, some parts of the multiplier half array must be disabled, otherwise partial products from the upper half-word would be added into the result from the lower half-word and vice versa. To avoid this problem, a mode bit is sent to both half arrays so that during 16-bit operation, the upper 16 bits of the multiplicand are set to zero in the lower half array and the lower 16 bits of the multiplicand are set to zero in the upper half array. Although lower-latency 16-bit multiplications could be achieved by summing less partial products together, this optimization was not made in order to minimize

62 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 49 A B 1 to 2 cycles DSQ Pre Processing 13 to 17 cycles Radix-4 Radix-4 SRT SRT Core Core 2 to 5 cycles DSQ Post Processing Result Figure 3.13: DSQ Unit Block Diagram unnecessary design complexity and wiring congestion DSQ Unit The DSQ unit supports floating-point divide and square root operations, as well as integer divide and remainder functions. Its block diagram is shown in Figure 3.13 and is based around a radix-4 SRT iterative divide algorithm [Goldberg, 2002]. The DSQ unit is split into four parts: a pre-processor, two cores, and a post-processor. In the pre-processor, operands are converted to internal formats used by the core, requiring 1 cycle for floatingpoint operations and 2 cycles for integer operations. These results are then passed to one of two cores, which takes 13 to 17 cycles depending on the operation and the data-type to execute the iterative SRT algorithm. Each cycle, the core processes 2 bits of the operands starting with the most significant bits, and continues to iterate until it has processed the least significant bits. Its output is sent in carry-save redundant form to the post-processor which performs several additions in order to compute the final quotient 2. Unlike the ALU and 2 An alignment shift is also required when computing the remainder is necessary.

63 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 50 MUL units, the DSQ is not fully pipelined, but more than one operation can be executed concurrently because once an operation has passed through the pre-processor and into one of the cores, a new operation can be issued and executed in the other core as long as the operations will not conflict in the post-processor stage SP Unit While the ALU, MUL, and DSQ units support all of the arithmetic operations in a cluster, several important non-arithmetic operations are supported by the SP, COMM, and JB/VAL units. The scratchpad (SP) unit provides a small indexable memory within the clusters. This 256-word memory contains one read port and one write port and supports base plus index addressing, where the base is specified in the VLIW instruction word and the index comes from a local LRF. This allows small table lookups to occur in each cluster without using LRF storage or sacrificing SRF bandwidth COMM Unit The next non-arithmetic function unit is the COMM unit. It is used to exchange data between the clusters when kernels are not completely data parallel. The COMM unit is implemented with 9 32-bit repeatered buses that transmit data broadcast from all eight clusters and the microcontroller. Each cluster COMM unit then contains a 9:1 multiplexer which selects which of these buses should be selected and output across the intracluster switch JB/VAL Unit The last cluster function unit is the JB/VAL unit. It is used in coordination with the SP and COMM units to execute conditional streams [Kapasi et al., 2000]. During the execution of conditional input or output streams, condition codes in each cluster specify whether that cluster should execute a conditional input or output on this loop iteration. The COMM unit is used to route data between clusters so that a cluster requesting the next element of a conditional stream will read or write from the correct streambuffer bank. The SP unit is

64 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 51 Table 3.3: JB/VAL Operation for Conditional Output Streams Clusters Loop Iteration 1 Condition codes COMM source cluster X X X Next cluster pointer 5 Ready bit 0 Loop Iteration 2 Condition codes COMM source cluster X X X X 7 Next cluster pointer 1 Ready bit 1 Loop Iteration 3 Condition codes COMM source cluster X X X X X 2 0 X Next cluster pointer 3 Ready bit 0 used as a double buffer in order to stage data between the streambuffers and the COMM unit. Finally, the JB/VAL functional unit manages the control wires that are sent to the streambuffers, the COMM unit, and the SP unit in each cluster during conditional streams. To explain the operation of conditional output streams, consider the example shown in Table 3.3. In this example, single-word records are assumed, so there are five instructions involved with each conditional output stream during each loop iteration: GEN COSTATE, COMM, SPCWR, SPCRD, and COND OUT D. During the first iteration through the loop, condition codes specify that only five clusters have valid data to send to the output stream. In each cluster, the GEN COSTATE instruction in the JB/VAL unit reads these condition codes and computes a COMM source cluster, a next cluster pointer, and a ready bit (the values for the next cluster pointer and the ready bit are the same across all eight clusters). In this case, the five clusters with valid data (clusters 1, 2, 3, 5, and 6) will send their data to the first five clusters (0 through 4). When the COMM is executed, each cluster uses its

65 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 52 COMM source cluster value to read the appropriate data from the intercluster switch and buffers this data locally in its scratchpad using SPCWR. The next cluster pointer keeps track of where the next valid element should be written during subsequent loop iterations. The ready bit keeps track of whether eight new valid elements have been accumulated and should be written to the output streambuffer from the scratchpad. During the first loop iteration, since only five valid elements have been stored in the scratchpad, the next cluster pointer is set to 5 and the ready bit is set to zero. When the SPCRD and COND OUT D are executed this loop iteration, the write to the streambuffer is squashed because the ready bit was set to zero. During the second iteration, four clusters have valid data. In this case, when the JB/VAL unit executes GEN COSTATE, it uses the next cluster pointer (set to 5 by the previous iteration) and new condition codes to compute the source clusters to be used during the COMM. Again, the data is buffered locally in the scratchpad with SPCWR. However, this time since eight valid elements have been accumulated across the clusters (five from the first iteration and three from the second), the ready bit is set to one. When the COND OUT D instruction is executed, these eight values stored in the scratchpad are written to the output streambuffer. Double buffering is used in the scratchpad so that the values written into cluster 0 during the first two iterations do not conflict. The third and final iteration in the example contains only two valid elements from clusters 0 and 2, and in this case, those elements are written into clusters 1 and 2. Subsequent iterations continue in a similar manner, with the JB/VAL unit providing the control information for the streambuffers, COMM unit, and SP unit. Figure 3.14 shows the circuit used in the JB/VAL unit to compute the COMM source cluster. Each cluster computes this by subtracting the next cluster pointer from its cluster number, then using that difference to select one of the source clusters with a valid CC. For example, if the difference were three, then this cluster is looking for the third cluster starting from cluster 0 with a CC set to 1. The selection occurs by converting the 3-bit difference into a one-hot 8-bit value, then using each CC to conditionally shift this one-hot value by one position. Once enough valid CCs have been encountered, the lone one in the one-hot value will have been shifted off the end. Since only one row will shift a one off the end, the COMM source index can be easily computed by encoding the bits shifted off the end back into

66 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 53 Cluster Number Next Cluster Pointer - 3 3:8 Decoder 0 CC from Cluster 0 0 CC from Cluster 1 0 CC from Cluster 2 0 CC from Cluster 3 CC from Cluster 4 0 8:3 Encoder COMM Source Cluster 0 CC from Cluster 5 0 CC from Cluster 6 0 CC from Cluster 7 Figure 3.14: Computing the COMM Source Index in the JB/VAL unit

67 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 54 Table 3.4: Function Unit Area and Complexity Unit Quantity Area Area Standard Cell Area (mm 2 ) (grids) Cells (NAND2s) 16-word LRF K CCRF K ALU K MUL K DSQ K SP K N/A N/A Cluster Crossbar K COMM Switch K SB Bank K SRF Bank K N/A N/A Microcode Store K N/A N/A a 3-bit value. The computations required for the next cluster pointer and ready bit are not shown in Figure 3.14, but they can be computed by simply adding the eight 1-bit CC values together. In addition to computing the COMM source index, next cluster pointer, and ready bit, the JB/VAL unit also keeps track of when the stream ends. This is necessary for padding streams when the total number of valid elements in a conditional stream is not a multiple of the number of clusters. Conditional input streams function similarly to conditional output streams, except buffering in the scratchpad occurs before traversing the intercluster switch rather than vice versa. 3.4 Summary The six function units described above along with the LRFs, CCRFs, and intracluster switch are the components of an arithmetic cluster on the Imagine stream processor. The main design goals for these arithmetic units were low design complexity, low area, high throughput, and low power. Although latency was important to keep limited to a reasonable value, it

68 CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 55 was not a primary design goal. As described in the next chapter, these arithmetic cluster components were implemented in a standard cell CMOS technology with 0.15 micron drawn gate length transistors and five layers of Aluminum with metal spacing typical to a 0.18 micron process. For a number of these arithmetic units and other key compoments, Table 3.4 shows their silicon area (both in mm 2 and in wire grids 3 ), number of standard cells, and total standard cell area if additional area required for wiring between standard cells is discounted(normalized to the area of a NAND2 standard cell). In summary, the ISA, microarchitecture, and functional unit circuits from the Imagine stream processor are designed for directly executing the stream programs in an area- and energy-efficient manner. The next two chapters will describe the design methodology and performance efficiency results achieved when this microarchitecture was implemented in modern VLSI technology. 3 A wire grid in this process is 0.40 square microns

69 Chapter 4 Imagine: Design Methodology To demonstrate the applicability of the Imagine stream processor to modern VLSI technology, a prototype Imagine processor was designed by a collaboration between Stanford University and Texas Instruments (TI). Stanford completed the microarchitecture specification, logic design, logic verification, and did the floorplanning and cell placement. TI completed the layout and layout verification. Imagine was implemented in a standard cell CMOS technology with 0.15 micron drawn gate length transistors and five layers of Aluminum with metal spacing typical to a 0.18 micron process. The key challenge with the VLSI implementation of Imagine was working with the limited resources afforded by a small team of less than five graduate students, yet without sacrificing performance. In total, the final Imagine design included 701,000 unique placeable instances and had an operating frequency of 45 fan-out-of-4 inverter delays, as reported by static timing analysis tools. This was accomplished with a total design effort of 11 person-years on logic design, floorplanning and placement, significantly smaller than the design effort typical to comparable industrial designs [Malachowsky, 2002]. This chapter provides an overview of the design process and experiences for the Imagine processor. Section 4.1 presents the design schedule for Imagine followed by background on the standard-cell design methodologies typically used for large digital VLSI circuits in Section 4.2. Section 4.3 introduces a tiled region design methodology, the approach used for Imagine, where the designer is given fine-grained control over placement of small regions of standard cells in a datapath style. Finally, the clocking and verification 56

70 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 57 methodologies used for Imagine are presented. 4.1 Schedule By summer 1998, the Imagine architecture specification had been defined and a cycleaccurate C++ simulator for Imagine was completed and running. In November 1998, logic design had begun with one Stanford graduate student writing the RTL for an ALU cluster. By December 2000, the team working on Imagine implementation had grown to five graduate students and the entire behavioral RTL model for Imagine had been completed and functionally verified. The Imagine floorplanning, placement, and layout was carried out by splitting the design into five unique subchips and one top-level design. In November 2000, the first trial placement of one of these subchips, an ALU cluster, was completed by Stanford. By August 2001, the final placement of all five subchips and the full-chip design was complete and Stanford handed the design off to TI for layout and layout verification. In total, between November 1998 when behavioral RTL was started and August 2001 when the placed design was handed off to TI, Stanford expended 11 person-years of work on the logic design, floorplanning, and placement of the Imagine processor. Imagine parts entered a TI fab in February First silicon was received in April, 2002 and full functionality was verified in the laboratory in subsequent months. 4.2 Design Methodology Background Typically, with a small design team, an ASIC design methodology is used. This is in contrast to a full-custom design methodology, which is used for more aggressive designs targeting higher clock rates. Although there can be a greater than a factor-of-3 difference in both area and performance between custom and ASIC designs [Dally and Chang, 2000] [Chinnery and Keutzer, 2000] [Chang, 1998], with the small size of the Stanford design team, using a full-custom design methodology was not possible. Hence, logic design was restricted to using gates and register elements from a standard-cell library.

71 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 58 Wire Models Library RTL Synthesis Netlist Place & Route Layout Manual Design Slow Paths Timing Analysis R & C Extractor Figure 4.1: Standard ASIC Design Methodology Figure 4.1 shows the typical ASIC tool flow. RTL is written in a hardware description language such as Verilog and is mapped to a standard-cell library with a logic synthesis tool such as Synopsys Design Compiler [Synopsys, 2000a]. Wire lengths are estimated from statistical models and timing violations are fixed by resynthesizing with new timing constraints or by restructuring the logic. After pre-placement timing convergence, designs are then passed through an automatic place and route tool, which usually uses a timing-driven placement algorithm. After placement, wire lengths from the placed design are extracted and back-annotated to a static timing analysis (STA) tool. However, when actual wire lengths do not match predicted pre-placement statistical-based wire lengths, this can cause a timing problem and can lead to costly design iterations, shown in the bottom feedback loop. Recent work in industry and academia has addressed many of the inefficiencies in ASIC flows. This work can be grouped in two categories: improving timing convergence and incorporating datapath-style design in ASIC flows. Physically-aware synthesis approaches [Synopsys, 2000b] attempt to address the shortcomings of timing convergence in traditional flows by concurrently optimizing the logical and physical design, rather than relying on statistically-based wire-length models. The principal benefit of these techniques is to reduce the number of iterations required for timing convergence, and as a result, deliver modest improvement in timing performance and area. Datapaths are examples of key design structures that ASIC flows handle poorly. There are three limitations. First, aggregating many simple standard cells to create a complex

72 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 59 function is inefficient. Second, the typical logical partitions (functional) often differ from the desirable physical partitions (bit-slices). Finally, since the correct bit-sliced datapath solution is very constrained, small errors in placement and routing during automated optimization can result in spiraling congestion and can quickly destroy the inherent regularity. When developing the design methodology for Imagine, the goal was to keep the inherent advantages of standard-cell design, but to eliminate some of the inefficiencies of ASIC methodologies by retaining datapath structure. Many researchers have demonstrated that identifying and exploiting regularity yields significant improvements in density and performance for datapath structures in comparison to standard ASIC place and route results [Chinnery and Keutzer, 2002]. In particular, researchers have shown numerous automated techniques for extracting datapath structures from synthesized designs and doing datapath-style placement [Kutzschebauch and Stok, 2000] [Nijssen and van Eijk, 1997] [Chowdhary et al., 1999]. However, widespread adoption of these techniques into industry-standard tools had not yet occurred by the time the VLSI design for the Imagine processor was started. 4.3 Imagine Design Methodology Given the small size of the Stanford design team and the need to interface with industrystandard tools, the design methodology for Imagine was constrained to use the basic tool flow shown in Figure 4.1. However, a large percentage of logic from Imagine are structured arrays and arithmetic units that could benefit from datapath-style placement. To take advantage of this datapath regularity and to expedite timing convergence, this tool flow was modified. Physical-aware synthesis techniques were not available while the VLSI design was carried out, so a tiled region design methodology was used. This methodology provides similar advantages in gate density to the techniques presented in Section 4.2 for doing datapath-style design in a standard cell technology. The total Imagine design contains 1.78 million standard cells. However, many of these standard cells are parts of large blocks which are repeated many times, such as arithmetic units or register files. In order to leverage this modularity, and to reduce the maximum design size handled by the CAD tools, the Imagine design was partitioned into five subchips

73 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 60 Table 4.1: Subchip statistics Instances Gate Area # CLUST 130, K 8 UC 27,000 27K 1 SRF 314, M 1 HISCNI 98, K 1 MBANK 57, K 4 Top Level 75, K 1 Full Chip 701, M 1 and one top-level design. In the ASIC methodology used on Imagine, flat placement within a subchip is used, where all of the standard cells in each subchip are placed at once. This is in contrast to hierarchical placement techniques where subcomponents of a subchip are placed first and larger designs are built from smaller sub-designs. After routing each subchip, the top-level design then includes instances of the placed and routed subchips as well as additional standard cells. Table 4.1 shows the number of instances, area, and gate area in equivalent NAND2 gates for each of the five subchips: the ALU cluster (CLUST), the micro-controller (UC), the stream register file (SRF), the host interface / stream controller / network interface (HISCNI), and the memory bank (MBANK). Each of these subchips corresponds directly to units in Figure 2.3 except the MBANK. The streaming memory system is composed of 4 MBANK units: 1 per SDRAM channel. Also shown is the top-level design, which includes glue logic between subchips and I/O interfaces. In addition to the gates listed in Table 4.1, some of the subchips also contain SRAM s instantiated from the TI ASIC library. The UC contains storage for bit VLIW instructions organized as 9 banks of single-ported, 1024-word, 128-bit SRAM s. The SRF contains 128 KBytes of storage for stream data, organized as 8 banks of single-ported, 1024-word, 128-bit SRAM s. There is a dual-ported, 256-word, 32-bit SRAM in each ALU cluster for scratchpad memory. Finally, the HISCNI subchip contains SRAM s for input buffers in the network interface and for stream instruction storage in the stream controller. Several of the subchips listed above benefit from using datapath-style design. Specifically, each ALU cluster contains six 32-bit floating-point arithmetic units and fifteen 32-bit

74 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 61 Short Wire Models Library RTL Structured RTL Synthesis Local Netlists Place & Route Layout Floorplan Structure Regions Wire plan Key Wires Placement & Loads Manual Design Slow Paths Timing Analysis R & C Extractor Figure 4.2: Tiled Region Design Methodology register files. Exploiting the datapath regularity for these units keeps wire lengths within a bitslice very short, which in turn leads to smaller buffers, and therefore a more compact design. In addition, control wires are distributed across a bitslice very efficiently since cells controlled by the same control wires can be optimally aligned. The SRF, which contains 22 8-entry 256-bit streambuffers, also benefits from the use of datapaths. The 256 bits in the streambuffers align to the 8 clusters 32-bit-wide datapath, keeping wires predictable and short and allowing for efficient distribution of control wires. The tiled-region basic flow used on Imagine is shown in Figure 4.2. It is similar to the typical ASIC methodology shown previously in Figure 4.1. However, several key additional steps, shown in gray, have been added in order to allow for datapath-style placement and to reduce costly design iterations. First, in order to make sure that datapath structure is maintained all the way through the flow, two RTL models were used. A second RTL model, labeled structured RTL, was written. It is logically equivalent to the behavioral RTL, but contains additional logical hierarchy in the RTL model. Datapath units such as adders, multipliers, and register files contain submodules that correspond to datapath bitslices. These bitslices correspond to a physical location along the datapath called a region. Regions provide a hard boundary during placement, meaning cells assigned to that region will only be placed within the associated datapath bitslice. Regions are often used in typical

75 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 62 ASIC design methodologies in order to provide constraints on automatic place and route tools, but the tiled-region flow has a much larger number of smaller regions (typically 10 to 50 instances per region) when compared to timing-driven placement flows. In addition to the floorplanning of regions, the subchip designer also must take into account the wire plan for a subchip. The wire plan involves manually annotating all wires of length greater than one millimeter with an estimated capacitance and resistance based on wire length between regions. By using these manual wire-length annotations during synthesis and timing analysis runs, statistical wire models generated during synthesis are restricted to short wires. Manual buffers and repeaters were also inserted in the structured RTL for long wires. With wire planning, pre-placement timing more closely matches postplacement timing with annotated wire resistance and capacitance. A more detailed view of the floorplanning and placement portion of the tiled-region methodology is shown in Figure 4.3. Consider an 8-bit adder. It would be modeled with the statement y=a+b in behavioral RTL. However, the structured RTL is split up by hand into bitslices as shown in Figure 4.3. The structured RTL is then either mapped by hand or synthesized into a standard-cell netlist using Synopsys Design Compiler [Synopsys, 2000a]. In conjunction with the netlist generation, before placement can be run, floorplanning has to be completed. In the tiled-region design methodology, this is done by writing a tile file. An example tile file containing two 8-bit adders is shown in the upper right of Figure 4.3. The tile file contains a mapping between logical hierarchy in the standard cell netlist and a bounding box on the datapath given in x-y coordinates. The example tile file shows how the eight bitslices in each adder would be tiled if the height of each bitslice was 30 units. Arbitrary levels of hierarchy are allowed in a tile file, allowing one to take advantage of modularity in a design when creating the floorplan. In this example, two levels of hierarchy are used, so cells belonging to the adder 1/slice5 region would be placed in the bounding box given by 40 <x<80 and 150 <y<180. Once the floorplan has been completed using a tile file, it is then passed through a tool developed by Stanford called tileparse. Tileparse flattens the hierarchy of the tile file and outputs scripts which are later run by the placer to set up the regions. Once the regions have been set up, but before running placement, the designer can look at the number of

76 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 63 module adder (a,b,y); input [7:0] a,b; output [7:0] y; wire [6:0] c; adder_slice slice2 (a[3],b[3],c[2],c[3],y[3]); adder_slice slice3 (a[4],b[4],c[3],c[4],y[4]);. endmodule module adder_slice(a,b,ci,co,y) input a,b,ci; output co,y; assign y=a^b^ci; assign co=(a&b) (a&ci) (b&ci); endmodule Module adder { region slice0 x1=0 x2=40 y1=0 y1=30 region slice1 x1=0 x2=40 y1=30 y1=60 region slice2 x1=0 x2=40 y1=60 y1=90 region slice3 x1=0 x2=40 y1=90 y1=120 region slice4 x1=0 x2=40 y1=120 y1=150 region slice5 x1=0 x2=40 y1=150 y1=180 region slice6 x1=0 x2=40 y1=180 y1=210 region slice7 x1=0 x2=40 y1=210 y1=240 } inst adder adder_0 x=0 y=0 inst adder adder_1 x=40 y=0 Floorplan Structured RTL Tile File Synthesis Tileparse Standard Cell Netlist Create_groups.scr Place_groups.scr Region-Based Placement Figure 4.3: Tiled Region Floorplanning Details

77 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 64 Table 4.2: Imagine placement results Occ mm 2 # Regions Placement CLUST 65.1% ,556 Tiled-Region UC 56.3% Tiled-Region SRF 54.5% ,640 Tiled-Region HI/SC/NI 38.9% Tiled-Region MBANK 69.1% Timing-Driven Top Level 63.3% N/A 1,095 Tiled-Region cells in a region and iterate by changing region sizes and shapes until a floorplan that fits is found. Finally, the Avant! Apollo-II automatic placement and global route tool [Chen, 1999] is used to generate a trial placement on the whole subchip. These steps are then iterated until a floorplan and placement with satisfactory wiring congestion and timing has been achieved. The steps following placement in the tiled-region design methodology do not differ from the typical ASIC design methodology. 4.4 Imagine Implementation Results Table 4.2 shows the placement results for the subchips and top level design. Standard cell occupancy is given as a ratio of standard cell area to placeable area. Area devoted to large power buses or SRAMs is not considered placeable area. Tiled-region placement was used on all of the subchips except for the smaller MBANK subchip, which did not have logic conducive to datapath-style placement. It is important to note that occupancy is most dependent on the characteristics of the subchip such as overall wire utilization and floorplan considerations. For example, high wiring congestion contributed to the lower occupancies of the HISCNI subchip and low wiring congestion allowed for high occupancies in the MBANK subchip. The SRF has regions of low occupancy for interfacing with the SRAM s and other subchips that reduce its overall occupancy. However, in regions where large numbers of datapaths were used and the designs were less wire-limited such as in the streambuffer datapaths, occupancy was over 80%.

78 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 65 By using tiled-regioning, large subchips such as the SRF and CLST with logic conducive to datapath-style placement were easily managed by the designer. For example, placement runs for the SRF, which contained over 300,000 instances took only around one hour on a 450 MHz Ultrasparc II processor. This meant that when using tiled-region placement on these large subchips, design iterations proceeded very quickly. Furthermore, the designer had fine-grained control over the placement of regions to easily fix wiring congestion problems. For example, the size and aspect ratio of datapath bitslices could be modified as necessary to provide adequate wiring resources. Timing results for each of these subchips are included in Table 4.3. Maximum clock frequency and critical path for each clock domain in fan-out-of-4 inverter delays (FO4s) are shown. Results were measured using standard RC extraction and STA tools at the typical process corner. 4.5 Imagine Clocking Methodology Most ASIC s use a tree-based clock distribution scheme. This approach was also used on Imagine, but distributing a high-speed clock with a large die size and many clock loads with low skew was challenging. Typical high-performance custom designs use latch-based design to enable skew tolerance and time-borrowing. However, a large variety of highperformance latches were not available in Imagine s standard cell library, so an edgetriggered clocking scheme where clock skew affects maximum operating frequency was used. Latches, instead of flip-flops, were used in some register file structures in the ALU clusters in order to reduce area and power dissipation. In order to distribute a clock to loads in several subchips while minimizing skew between the loads, the standard flow in the TI-ASIC methodology was used. First, after each subchip was placed, a clock tree was expanded within each subchip using available locations in the floorplan to place clock buffers and wires. Skew between the clock loads was minimized using Avant! Apollo [Chen, 1999]. Later, when all of the subchips were instantiated in the full-chip design, delay elements were inserted in front of the clock pins for the subchips so that the insertion delay from the inputs of the delay elements to all of the final clock loads would be matched for the average insertion delay case. Next, the same flow

79 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 66 Table 4.3: Imagine timing results Clock Max Freq T cycle (FO4s) Clock Loads iclk 296 MHz K sclk 148 MHz K hclk 175 MHz K mclk 233 MHz K nclkin 296 MHz nclkout 296 MHz used on the subchips was used to synthesize a balanced clock tree to all of the inputs of the delay elements and the leaf-level clock loads for clocked elements in the top-level design. Imagine must interface with several different types of I/O each running at different clock speeds. For example, the memory controller portion of each MBANK runs at the SDRAM clock speed. Rather than coupling the SDRAM clock speed to an integer multiple of the Imagine core clock speed, completely separate clock trees running at arbitrarily different frequencies were used. In total, Imagine has 11 clock domains: the core clock (iclk), a clock running at half the core clock speed (sclk), the memory controller clock (mclk), the host interface clock (hclk), four network input channel clocks (nclkin n,nclkin s,nclkin e, nclkin w), and four network output channel clocks (nclkin n, nclkout s, nclkout e, nclkout w). These clocks and the loads for each clock are shown in Table 4.3, but for clarity, only one of the network channel clocks is shown. The maximum speed of the network clocks were architecturally constrained to be the same speed as iclk, but can operate slower if needed in certain systems. Mclk and hclk are also constrained by the frequency of other chips in the system such as SDRAM chips, rather than the speed of the logic on Imagine. Sclk was used to run the SRF and stream controller at half the iclk speed. The relaxed timing constraints significantly reduced the design effort in those blocks and architectural experiments showed that running these units at half-speed would have little impact on overall performance. The decoupling provided by Imagine s 11 independent clock domains reduces the complexity of the clock distribution problem. Also, non-critical timing violations within one clock domain can be waived without affecting performance of the others. To facilitate these

80 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 67 Write Enable Write Clock Write Ptr Read Ptr Shift Out Read Clock Full Compare Compare Sync Sync Empty Data In FIFO Data Out Figure 4.4: Asynchronous FIFO Synchronizer many clock domains, a synchronizing FIFO was used to pass data back and forth between different clock domains. Figure 4.4 shows the FIFO design used [Dally and Poulton, 1998]. In this design, synchronization delay is only propagated to the external inputs and outputs when going from the full to non-full state or vice versa, and similarly with the empty to non-empty state. Brute force synchronizers were used to do the synchronization. By making the number of entries in the FIFO large enough, write and read bandwidths are not affected by the FIFO design. 4.6 Imagine Verification Methodology Functional verification of the Imagine processor was a challenge given the limited resources available in a university research group. A functional verification test suite was written and run on the behavioral RTL. The same test suite was subsequently run on the structured RTL. Tests in the suite were categorized either as module-level or chip-level tests. Standard industry tools performed RTL-to-netlist and netlist-to-netlist comparisons for functional equivalency using formal methods. Module-level tests exercised individual modules in isolation. These tests were used on modules where functionality was well-defined and did not rely on large amounts of complex control interaction with other modules. Module-level tests that exercised specific

81 CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 68 corner cases were used for testing Imagine s floating-point adder, multiplier, divide-squareroot (DSQ) unit, memory controller, and network interface. In each of these units, significant random testing was also used. For example, in the memory controller, large sequences of random memory reads and writes were issued. In addition, square-root functionality in the DSQ unit was tested exhaustively. Chip-level tests were used to target modules whose control was highly coupled to other parts of the chip and for running portions of real applications. Rather than relying only on end-to-end correctness comparisons in these chip-level tests, a more aggressive comparison methodology was used for these tests. A cycle-accurate C++ simulator had already been written for Imagine. During chip-level tests, a comparison checker verified that the identical writes had occurred to architecturally-visible registers and memory in both the C++ simulator and the RTL model. This technique was very useful due to the large number of architecturally-visible registers on Imagine. Also, since this comparison occurred every cycle, it simplified debugging since any bugs would be seen immediately as a register-write mismatch. A number of chip-level tests were written to target modules such as the stream register file and microcontroller. In order to generate additional test coverage, insertion of random stalls and timing perturbations of some of the control signals were included in nightly regression runs. In total, there were 24 focused tests, 10 random tests, and 11 application portions run nightly as part of a regression suite. Some focused tests included random timing perturbations. Every night 0.7 million cycles of focused tests, 3.6 million cycles of random tests, and 1.3 million cycles of application portions were run as part of the functional verification test suite on the C++ simulator, the behavioral RTL and the structured RTL. These three simulators ran at 600, 75, and 3 Imagine cycles per second respectively when simulated on a 750 MHz UltrasparcIII processor. In summary, the design, clocking, and verification methodologies used on Imagine enabled the design of a 0.7M-instance ASIC without sacrificing performance, and with a considerably smaller design team than comparable industrial designs.

82 Chapter 5 Imagine: Experimental Results In this chapter, experimental results measured from the Imagine stream processor are presented. Imagine was fabricated in a Texas Instruments CMOS process with metal spacing typical to a 0.18 micron process and with 0.15 micron drawn-gate-length transistors. Figure 5.1 shows a die photograph of the Imagine processor with the five subchips presented in Chapter 4 highlighted. Its die size is 16 mm 16 mm. The IO s are peripherally bonded in a 792-pin BGA package. There are 456 signal pins (140 network, 233 memory system, 45 host, 38 core clock and debug), 333 power pins ( V-core, V-IO, V-IO), and 3 voltage reference pins. The additional empty area in the chip plot is either glue logic and buffers between subchips or is devoted to power distribution. 5.1 Operating Frequency The operating frequency for Imagine was tested on a variety of applications and a range of core supply voltages. As presented in Chapter 4, static timing analysis tools predicted Imagine to be fully functional with a clock period of 46 fan-out-of-4 inverter delays, corresponding to 296 MHz operation at the typical process corner at 1.5V and 25 C. (188 MHz at the slow process corner, 1.35V, and 125 C). As shown in Figure 5.2, laboratory measurements for the Imagine processor show significantly slower operation, with a maximum clock speed of 288 MHz at 2.1V, and a clock speed of only 132 MHz at 1.5V. Package temperature was monitored during these measurements, and stayed under 40 C with the 69

83 CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 70 HI SC NI MBANK 3 MBANK 2 MBANK 1 MBANK 0 SRF UC CLUST7 CLUST6 CLUST5 CLUST4 CLUST3 CLUST2 CLUST1 CLUST0 Figure 5.1: Die Photograph

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture