Reconfigurable Computing Systems Cost/Benefit Analysis Model

Size: px

Start display at page:

Download "Reconfigurable Computing Systems Cost/Benefit Analysis Model"

Derick Webster
5 years ago
Views:

1 Reconfigurable Computing Systems Cost/Benefit Analysis Model by William W.C Chu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science in Electrical and Computer Engineering Waterloo, Ontario, Canada, 2005 c William W.C. Chu 2005

2 I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research. William W.C. Chu I further authorize the University of Waterloo to reproduce this thesis by photocopying or other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. William W.C Chu ii

3 Abstract The tradeoff between flexibility and performance had long exist in the world of digital design. Whether to use ASICs for high performance but suffer high development cost and poor flexibility, or to use general purpose processors for its software flexibility in tradeoff by the performance overhead. The introduction of Reconfigurable Computing systems had been a good news, since it offers a good balance between performance and flexibility. The objective of this thesis is to develop a generic analysis model that can be applied virtually to any reconfigurable computing devices and provide a basis for comparison and evaluation. The developed model is applied to several reconfigurable computing systems, demonstrating its usefulness and helpfulness. iii

4 Acknowledgements First I would like to express my gratitude to my supervisor Professor Catherine Gebotys at the University of Waterloo. Thank you for your supervision, support, kindness, and guidance over the past two years. I would also like to thank Associate Professor Sagar Naik and Associate Professor Mark Aagaard. Thank you for your positive and valuable comments and suggestions on my thesis. My very special acknowledgement is to my parents, my sister, and Wilson Fung. Thank you for your persistent love, encouragement, and understanding about my decision on pursuing my Masters at Waterloo, which meant many days and weeks away from home. William W.C. Chu April 2005 Waterloo, Canada iv

5 Contents 1 Introduction Motivation Significance of This Work Thesis Organization Reconfigurable Computing Technology ASICs and Microprocessors What is Reconfigurable Computing? Framework of Reconfigurable Computing Examples of Available Reconfigurable Computing Systems Reconfigurable Computing - Proposed Cost/Benefit Analytical Model Motivation Factors Metrics and Modelling Performance Power Area Proposed Analytical Model Other Metrics Xputer System Motivation and Methodology v

6 4.2 Xputer Machine Paradigm Data Sequencer Scan Caches ralu Residual Control Comparison between Computer and Xputer Software Xputer Prototype - Map-Oriented Machine 3 (MoM-3) KressArray Evaluation Models Performance Model Power Model Area Model Other Metrics Summary NEC DRP System Motivation and Methodology Architecture Software Specifications of DRP-1 Protype Chip Evaluation Model Performance Power Model Area Model Other Metrics Summary MIT Raw System Motivation and Methodology Architecture vi

7 6.2.1 Raw Tiles Static Network Dynamic Network I/O Software Evaluation Model Performance Model Power Model Area Model Others Summary CMU PipeRench System Motivation and Methodology Architecture Switch Fabric Configuration Controller Data Management Hardware Interface On-Chip Memory Software Specifications Evaluation Model Throughput Power Area Model Memory Area Computational Area Others Summary vii

8 8 Analysis based on the Proposed Analytical Model Comparisons between Raw and PipeRench Performance Power Area Comparison by Integration Summary Estimation Limitations Concluding Remarks Discussions Conclusions Future Research viii

9 List of Tables 3.1 Simple Instruction Mix Example Comparison between Computer components and Xputer components [6] Execution time of Selected Benchmark programs in Raw system [29] Average Power Breakdown of the Raw Processor Current Breakdown of Raw Tiles for a highly parallel application Area breakdown of a Raw Tile Power Breakdown of XC4003 [36] Estimated Power Breakdown of PipeRench - Non Virtualization case Estimated Component Power Breakdown of PipeRench - Virtualization case Estimated Operation Power Breakdown of PipeRench using Proposed Model Area breakdown of a PipeRench Chip [3] Area breakdown of a PipeRench PE [3] Component Power Breakdown for PipeRench and Raw Estimated Power Breakdown of PipeRench and Raw, by Operation Phase ix

10 List of Figures 3.1 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model Scan Pattern Examples [6] Architectural Overview of the Xputer Paradigm [6] Mapping of Scan Window following the Scan Pattern in the Data Map [14] Architectural Overview of a Map-Oriented Machine 3 Machine [17] Architectural Overview of KressArray 3 rdpa [15] Structure of a DRP Tile [24] Structure of a DRP Processing Element [24] Raw Tile Structure [5] Tile Interconnects in Raw processor [4] I/O Interface in Raw system [4] Raw Tile Floor Plan [4] Example of Hardware Virtualization - Virtual pipeline stages [2] Example of Hardware Virtualization - Physical pipeline stages [2] Architectural View of Switch Fabric in PipeRench [2] Block Diagram of a Processing Element in PipeRench [3] Architectural Overview of a PipeRench system [31] Power statistics of a FIR filter application running on PipeRench [3] PipeRench Chip Floorplan [3] Floor plan of a PipeRench PE [3] x

11 Chapter 1 Introduction 1.1 Motivation Digital designers face the fundamental tradeoff between flexibility and performance when choosing between different computing systems. Customized hardwire based technology provide high performance and low power consumption from specialization, but lack the flexibility since any changes require redesign and rewiring. Software based solutions operate with software instructions. Great flexibility comes from easy development and maintenance of the software code, but execution of instructions introduces high overhead in performance and area. In the past decade, a new class of technology, the Reconfigurable Computing technology had been introduced. It overcomes the traditional tradeoff and is able to achieve high performance while maintaining flexibility. While this technology sounds so great, an analytical model is needed that can evaluate the benefits and tradeoffs of any reconfigurable computing system. As well, with many reconfigurable computing systems available, it is difficult to compare them due to various innovative designs, architectures, and implementations. The objective of this thesis is to develop a generic analytical model that can be applied virtually to any reconfigurable computing devices and provide a basis for comparison and evaluation. Through analysis of several selected reconfigurable systems, the model will be applied to produce meaningful results for evaluation. 1

12 Introduction Significance of This Work The variety of reconfigurable computing systems give the market and consumers more choices, but also make selection confusing and difficult. The purpose of this thesis is to propose an innovative high level model for evaluation and comparison among different reconfigurable computing systems. Furthermore, insights can be obtained about individual devices through application of the proposed model. This proposed analytical model contains the design knowledge and results from the author s perspective over twenty months of research at the University of Waterloo. 1.3 Thesis Organization This thesis is organized in a flow manner with each chapter written in self contained format. It begins with an introduction to the Reconfigurable Computing technology in Chapter 2. Chapter 3 introduces the proposed Cost/Benefit Analysis Model. The next four chapters, from Chapter 4 to Chapter 7, will introduce four reconfigurable computing systems: Xputer, NEC DRP, MIT RAW, and CMU PipeRench. The chapters will discuss about the design methodologies and hardware architectures for each system. Corresponding software tools will be briefly introduced, but are not the focus of this report. The proposed model will be applied to each system individually. It is unfortunate that insufficient statistics can be collected on the NEC DRP and the Xputer systems due to limited publication. Primarily focus will be put on the Raw system and the PipeRench system. Through integration of the model and collected statistics, more insights can be gained. Chapter 8 will be a comparison between the Raw system and the PipeRench system based on the proposed analytical model. Conclusions and Recommendations will be given in Chapter 9, or the end of the thesis.

13 Chapter 2 Reconfigurable Computing Technology 2.1 ASICs and Microprocessors In conventional computing world, there are two primary methods for execution of algorithms. The first is to use hardwired technology, such as Application Specific Integrated Circuit (ASIC) to perform the operations in hardware. A custom hardware is specially designed to perform an application or dedicated operation. The highly specialization will result in very fast and efficient execution for the designed task. As well, power consumption overhead is minimal since the designers will avoid unnecessary parts. However, huge amount of development work is required to achieve the the extreme efficiency. Also, the hardwired circuit cannot be altered after fabrication. Any changes, modifications, or updates to the circuit require a redesign and refabrication of the chip. This is an expensive process in effort, time, and cost for maintaining hardwired technology. The second execution method is to use software programmed microprocessors. Application is represented in the form of sequenced code. The microprocessor executes the code, or instructions, to perform a computation. Instruction Set Architecture (ISA) interface between the instructions and the execution hardware; changes in the either end will not affect the functionality of the other side as long the ISA specification was followed. Therefore, changes in software instructions can alter the functionality of an operation without the need to change any of the hardware resources. 3

14 Reconfigurable Computing Technology 4 This gives great flexibility to designers to freely modify the software code. This great flexibility is in tradeoff by large performance overhead. During execution, the processor fetches each instruction from memory, decodes its meaning, and only then perform the operation of the instruction. These overheads result in degraded performance and greater power consumption. While ISA serve as a great interface between the software and the hardware, it also limits the potential growth of the system. After chip fabrication, any operations to be implemented must be built based on the ISA specifications. Improvements in the hardware must maintain the full ISA specification, even obsolete ones, to be backward compatible with existing software programs. 2.2 What is Reconfigurable Computing? Reconfigurble Computing technology is introduced to fill the gap between hardware and software based design. The goal is to achieve performance better than that of software, while maintaining greater flexibility than hardware solutions. Reconfigurable computing devices compose of many computational elements whose functionality is determined through programmable configurations. These elements, sometimes known as logic blocks or processing units, are connected by programmable routing resources. The idea of configuration is to map logic functions of a design to the processing units within a reconfigurable device, and use the programmable interconnects to connect processing units together to form the necessary circuit. Great flexibility comes from the programmable nature of processing elements and routings. Performance can be better software based approaches due to reduced execution overhead. Under this definition, Field Programmable Gate Array (FPGA) is a form of reconfigurable computing system. FPGAs and other reconfigurable computing have been shown to accelerate a variety of applications, such as encryption algorithms and streaming applications [1]. While FPGAs demonstrated good performance in various applications, it still has its shortcomings: 1. Logic Granularity: Classic FPGAs have a low granularity in its design. When processing units are chained together to form a bigger operation, the low granularity enables better utilization. However, there will be large overhead to control and connect the many processing units, resulting in performance penalty.

15 Reconfigurable Computing Technology 5 2. Support for Reconfiguration: Configuration of the FPGA is done at initialization. Reconfiguration for a new application usually requires the chip to be taken down and reprogrammed. Certain FPGAs may support run time reconfiguration, but it may take up to hundreds of millisecond second to complete. 3. Hard Constraints: FPGAs can only implement application within the size of its hardware constraints. This size restriction will also make compilation more difficult. The disadvantages of FPGAs make it an unsuitable choice in certain applications. Many reconfigurable computing systems are developed and under research to make up the shortcomings of FPGAs. 2.3 Framework of Reconfigurable Computing While different reconfigurable computing systems have different design goals, methodologies, and implementations, they do share the same design framework: Configurable Structure Programmable processing units with configurable interconnects are the basis of a reconfigurable computing platform. Various configuration combinations can define numerous possible functionalities. Implementations of a processing unit can be a simple microprocessor, or gate level operator such as lookup tables. Interconnects in different system have different structure as well, such as mesh, linear, and crossbar. Compilation Environments A tool is required to map an application onto the reconfigurable computing system. This form is expressed in configuration bits used to define the operation of each processing unit and routes. This compilation tool can range from assisting tool that helps a programmer to perform manual hand mapping, to a fully automated system that can deal with all the configuration work by itself. Reconfiguration The configurable nature of a reconfigurable computing system allows the hardware to be programmed with new sets of configuration to support new operations. Depending on the architecture, some systems can only be reprogrammed in non-executing state, while

16 Reconfigurable Computing Technology 6 some may support dynamic reconfiguration at run time, allowing an operation to be altered during execution. The duration of reconfiguration process will also vary. 2.4 Examples of Available Reconfigurable Computing Systems There are many reconfigurable computing systems available in the market, as well many still undergo research. Following are some examples: CMU PipeRench [2, 3] is a hardware based solution specialized for pipelined application. Through run time reconfiguration of hardware, a large application can be executed using small amount of hardware resources. Although low power is not one of the primary design objective, the efficient architecture and simple implementation of the design dissipates less than one watt of average power while achieving good performance. This architecture is a perfect candidate for pipelined application because of its highly specialized design, small area and low power implementation. MIT Raw [4, 5] is a mesh of interconnected simple RISC processors architecture. Its goal is to benefit parallel execution from multiple microprocessors at a coarse grained environment. The static communication network in the architecture makes good use of predetermined communication pattern at compile time and reduce network latency by well ahead preparation. This architecture can provide great flexibility and processing power beyond that of a single processor. Raw can perform well with random programs, but better with parallel applications. However, high power consumption will result from execution of multiple processors, which is a big drawback of the architecture. Xputer [6, 7] is a computer organization suggested to use data driven control instead of instruction sequence control as in conventional computers. It aims to avoid data latency and data dependency problems by executing in the order of data accessing sequence. Applications with regular data patterns, such as multimedia, streaming and encryption applications can fit well with Xputer organization. NASA Evolvable Hardware [8] is reconfigurable hardware with the configuration under the control of an evolutionary/genetic algorithm. In evolutionary synthesis of analog and digital

17 Reconfigurable Computing Technology 7 circuits, a hardware electronic circuit evolves to realize a design specification dynamically at run time without the need of any predetermined information. The ultimate goal of this research is to develop an architecture that can adapt to any possible environment without any human control; an evolvable intelligent machine that can be used to perform work independently in environment such as space exploration. Drawbacks are the evolution process is very resource demanding and time consuming. NEC DRP [9] is a coarse grained processing reconfigurable system. The system composes of many small processing elements for computations. Repository of contexts are stored on-chip. By choosing a different context, the chip will implement a different datapath to represent a new operation. This feature enables dynamic run time reconfiguration in a single clock cycle. Applications such as network, image, and signal processing work well with the parallel processing environment and fast run time reconfiguration for any dynamic events. IPFlex DAPDNA [10], or dynamically reconfigurable processor, is a dual-core processor including a RISC core paired with a two-dimensional processing matrix. Reconfiguration of the processing matrix is controlled by the RISC core to support different operations to achieve parallel processing efficiently. MathStar FPOA or Field Programmable Object Array [11] system is an enhanced FPGA based solution. Instead of using CLBs or lookup tables as elementary cell in the device, FPOA uses its own building blocks as foundation. Having pre-defined block types allow the blocks to achieve higher performance, less area, and better communication with other blocks. A few of the listed systems will be selected and used for analysis in this thesis. PipeRench is a hardware based pipelined architecture with great flexibility, while Raw is software based general purpose processor approach with enhanced parallelism. The two systems are very representative to the two extremes of design. NEC DRP, IPFlex DAPDNA, and MathStar FPOA are commercial products. They are FPGA based solution with higher granularity and advanced features. NEC DRP is selected for analysis to represent this class of products. Beyond hardware, software, and FPGA based solutions, Xputer system will also be included in the analysis due to its unique design approach. The selected systems: Xputer, NEC DRP, Raw, and PipeRench will be covered in details

18 Reconfigurable Computing Technology 8 about their methodologies, architectures, and analysis in further chapters. Because NEC DRP is a proprietary intellectual properties of NEC Electronics, only limited information on its architecture, operations, tools, and execution statistics can be obtained. Limited knowledge of Xputers can be obtained as well. Through analysis on these two systems are prohibited.

19 Chapter 3 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 3.1 Motivation Each reconfigurable computing system represents a unique combination of design, architecture, and implementation. Conventional evaluation approaches are not sufficient for analysis of reconfigurable computing systems, since they are fundamental different in nature. The objective of this thesis is to develop a generic, high level analytical model that can be used to evaluate virtually any reconfigurable computing systems without resorting to gate level details. The proposed Reconfigurable Computing Cost/Benefit Analytical Model provides a common framework for different reconfigurable systems, allowing comparisons to be made. Through application of the model, it is hoped that users can gain insights into different reconfigurable systems. 3.2 Factors The goal of any form of computational devices is to formalize an application and automate the computation. ASIC, FPGAs, general purpose processors, and reconfigurable systems all share this purpose. The same application on different platforms will typically result in different performance; different applications on the same platform will also yield different performance. This is an outcome 9

20 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 10 based on how well an application fit with a system. Specifically, if a device or system is customized for an application, resources can be better utilized to achieve better performance. Effort can be focused to meet performance requirements, specification constraints, avoid or reduce potential bottlenecks. For example, the amount of on-chip storage dictates how often external communication is required. For large data demanding application, more on-chip memory can be implemented to reduce the load on I/O. The increase in area will be well paid off by the avoidance of I/O bottleneck. To mathematically formalize the situation, an application can be represented by its size, while a device represented by its processing power. 3.3 Metrics and Modelling The three main metrics in the proposed model are: Performance, Power, and Area Performance Why bother to switch from one device to another if the original device can already handle the task? Because more output in a shorter period of time is always required in the next generation of standards. Performance has always been one of the most critical metric to be considered in any form of development. In this cost/benefit analytical model, performance represents the benefit that can be achieved. Different applications and devices have different ways to measure and express performance. Some common performance units are: Throughput is a very representative measurement unit. It is a ratio between the amount of output and the execution time, or Equation 3.1: T hroughput = T otal amount of output T otal execution time (3.1) Throughput presents the amount of processing being done per time unit. A common time unit enables the processing power of different systems to be compared in terms of output generated. Throughput can also be applied to individual systems for insights. Throughput is a very good indicator to reveal the ability of a system in handling different applications. Designers can investigate about potential bottlenecks based on performance results.

21 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 11 CPI or cycle per instruction is an important metric being used for general purpose processors [12]. In software world, an application compiles into sequence of instructions for execution in the processor. The metric CPI is used to represent the number of clock cycles required to execute an instruction. Since there are many different instructions in an ISA and each has different execution time, usually average CPI will be computed based on the occurrence frequency of the instruction, as in Equation 3.2: average CP I = instructions in ISA i=1 (exectuion time i freq i ) (3.2) where i is an index to instructions in the ISA, freq i is how often instruction i occurs relative to the total number of instructions. A simple example can be given. Assume the following instruction mix listed in Table 3.1 for the instruction set: Instruction Proportion to All Instructions Clock Cycles Stores 15% 3 Loads 25% 3 Branches 15% 4 Integer Arithmetics 45% 1 Table 3.1: Simple Instruction Mix Example average CP I = (0.15 3) + (0.25 3) + (0.15 4) + (0.45 1) average CP I = 2.25 cycles per instruction The reciprocal of CPI is IPC, or instructions per cycle. It measures the number of instructions that can be executed in a single clock cycle. IPC is a form of throughput at very low level, where instruction is the amount of processing and clock cycle as time unit. Latency refers to the amount of time required to perform a task. If two systems have same functionalities, the one with smaller execution latency to achieve the result will yield better throughput. Measuring latency of components within a system enables identification of execution bottlenecks as well.

22 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 12 CPI is only a meaningful metric to microprocessors from the same processor family that share the same ISA. To compare different microprocessor architecture families in a fair way, the benchmark program approach is very commonly used nowadays. Given a small program written in high level languages, each processor compiles and executes it into their own instruction set. The amount of time used to complete the execution is the basis for comparison. In other words, the amount of time it takes to complete the same task. Common benchmark programs such as Standard Performance Evaluation Corporation (SPEC) [13] can be found online for further information Power Power consumption of devices become more and more important nowadays. A certain level of performance may be sacrificed for lower power consumption. To gain more insight about power dissipation, power consumption of each system components can be measured. Another perspective to model power dissipation is to measure energy or power consumed for a particular operation phase or execution state. Because different execution states span over different time period, it is best to use energy to add up the amount of work being done. Afterward, average power can be calculated by dividing the amount of work done by the amount of execution time. However, it may be difficult to obtain all measurements and every single details. For high level modelling purpose, accuracy can be sacrificed for simplicity. Instead of modelling with energy and transform to average power afterward, average power can be used directly. The notion of power will refer to average power within the scope of this thesis. To model the power dissipation for each operational stage, the components involved in the process will be identified. 1. Initial Configuration Loading: This step involves the initial loading of configuration bits and data onto the chip. It is mainly used to identify the system s initial I/O activities. During this phase, I/O interface and pins will be involved for the I/O activities. Configuration data being loaded on-chip will either be stored or used for configuration directly depending on the device architecture. Either way will consume power, and this leads to Equation 3.3: P cfg loading = f(p IO interface, P IO pad, P storing the configuration data on chip ) (3.3)

23 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model On-chip Configuration: After configuration data been loaded on-chip, some architectures require the step to map the configuration bits onto the processing units. This phase involves reading the configuration data out from on-chip storage, transfer it to the processing units, and then configure the processing units, as shown in Equation 3.4. Process will be repeated until all necessary processing units are configured. P chip cfg = f(p reading on chip configuration memory, P processing unit configuration ) (3.4) 3. Execution: Execution power is the most critical, because it represents the power consumption related to device computation and processing. The scope of this stage is defined to work only with on-chip data. Any I/O activities, such as loading application data from external devices will be considered as another phase. To model the execution stage, power consumption from processing units is mandatory. Other activities may include reading from and writing to local memory, on-chip inter-component communication, chip controller, and others, depending of a device s design. Equation 3.5 is the power equation for execution stage: P chip execution = f(p execution of processing unit, P local memory access, P on chip communication, P controller ) (3.5) 4. Data Loading: This refers to the I/O activities to read/write application data from/to external memory for execution. It is separated from Initial Configuration Loading to distinguish between I/O activities for initialization and execution. I/O components will be involved in the process for sure. Depending on a device s design, data may be stored/read to/from local memory, or data may be directly send/read to/from the processing units. All components listed are involved in Equation 3.6: P io data loading = f(p IO interface, P IO pad, P local memory access ) (3.6) 5. Reconfiguration: Any power overhead related to reconfiguration will fall under this category. I/O activities to load new configuration data beyond that of initialization, mapping of new configuration bits to processing units, context switching, storing and restoring state information, and related issues will all be consider reconfiguration overheads. There are many

24 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 14 possible forms of reconfiguration due to various nature of reconfigurable design. Equation 3.7 attempts to include some common approaches to reconfiguration but are not limited to them. P reconfiguration = f(p IO interface, P IO pad, P storing the configuration data on chip, P mapping configuration data to processing unit, P context switching, P storing and restoring of state information, P reconfiguration control, P others ) (3.7) In later chapters, these power model are implemented by adding up average power consumption from components involved in that particular phase. The use of average power in the model is not totally accurate, where energy should have been used and transform to average power afterward. However, it is often difficult to obtain full knowledge on energy dissipation and timing information for each operation state. Accuracy is sacrificed for simpler high level power modelling in the case. As well, estimation may be applied to fill the gap between missing information. In later chapters, readers will see K often being used in the power equations, which represent unit power consumption for the particular component Area Area is another cost factor in the proposed model. It is desired to keep area small. The area model is beyond just the total chip area, but more about the area efficiency within a design. For equations being used in this thesis, A denote area. 1. Total Chip Area: Total chip area is the most direct information about area, as well the easiest information to acquire. Beyond the actual area of all device components, a system will also have area overhead due to interconnects, place and route, fragmentation problem, and possibly other reasons. For this reason, total area overhead will be included when modelling area in Equation 3.8. The overhead will be represented by α and expressed as a percentage to the sum of component area such that it is scalable. all system components A system = ( (A componenti )) (1 + α area overhead ) (3.8) i=1

25 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model Computational Area: Different reconfigurable systems have different architectures and different ways to implement their primitive processing units. One approach to compare effective area is by looking at computational area, the amount of area dedicated to computational units. This figure shows how area is being used efficiently for processing over other overheads. To generate the computational area efficiency, add up all the area from computational units, then divide by total chip area, as shown in Equation 3.9. Computational Ef f iciency = all computational components i=1 (A computational componenti ) T otal chip area (3.9) 3. Memory Area: Another interest in the area model is the amount of memory or storage area on-chip. Systems benefit from reduced external memory access time by implementing on-chip memory. This, however, will increase the total chip area. Memory area information can give insights to evaluators the tradeoff of using on-chip memory. Using similar equation as computational efficiency, the proportion of memory usage on-chip can be calculated as Equation 3.10: Memory Usage = all on chip memories i=1 (A memoriesi ) T otal chip area (3.10) 3.4 Proposed Analytical Model The idea of the proposed Cost/Benefit Analytical Model is illustrated in Figure 3.1. By integrating the two factors and the three metrics together, the model becomes: In the model, an application is represented by application size, and a reconfigurable computing system abstracted as processing power. The dotted line is the relationship between two factors, to represent that an application is executing on a reconfigurable system. Through execution, physical performance and power consumption can be measured. It should be noted that the actual measured performance and power values are only valid for a particular on a particular system. Physical area

26 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 16 Performance Application Size Processing Power Power Area Figure 3.1: Reconfigurable Computing - Proposed Cost/Benefit Analytical Model of a chip is fixed from fabrication. After obtaining the experimental measurements for performance, power, and chip, the metric models can be applied to interpret the meaning of the values. Successful modelling of the three main metrics: performance, power, and area enables insights to be gained about a system s strength, weakness, bottlenecks, limitations, and other issues as well. Metric modelling is to look at details of the system. The proposed analytical model also supports looking at a system from high level by computing the benefit/cost ratio shown in Equation Score = Benefits Costs (3.11) One way to implement this ratio is to take the three main metrics with performance as benefit, power and cost as costs. Integration will give a score to a design, as in Equation 3.12: Score = α P erformance P erformance (α P ower P ower) (α Area Area) (3.12) where α represent the importance of the metric to the evaluators. Evaluator can adjust the im-

27 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 17 portance variables based on application needs and design constraints. Experimental measurements can be used to quantitatively represent metrics; for example, throughput to represent performance, total power dissipation to denote the power metric, and total chip area for area metric as well. Implementation of Equation 3.11 is not limited to Equation Evaluators can freely use other benefit and cost variables. For example, Equation 3.13 is another benefit/cost ratio built: Score = (α T hroughput T hroughput) (α Computational Area Efficiency Computational Area Efficiency) α Reconfiguration P ower Reconfiguration P ower (3.13) The purpose of Equation 3.11 is to provide a single quantitative result through application of the analytical model that can be easily interpreted. This quantitative score enables numerical comparison between different reconfigurable computing systems executing the same application. This is particularly useful when many systems are involved, since qualitative comparison requires considering all system combinations. Different applications executing on the same reconfigurable system will yield different score as well, which can reveal how well the system is capable for handling the corresponding application. 3.5 Other Metrics While very representative, the three main metrics: performance, power, and area are still unable to cover all characteristics of a system and how well an application can be executed. To completely analyze a system, other characteristics of the device need to be taken into consideration as well. Following are some common ideas: 1. Critical Path: the longest execution path within a computation dictates the clock period of execution, or achievable clock frequency. This metric is more meaningful when an architecture is built with irregular processing cells. 2. On-Chip Memory: Connection with external memory is necessary to accept user data in various situation. On chip memory allows faster data access in comparison to external memory. Reducing the amount of external memory access also helps reduce the amount of

28 Reconfigurable Computing - Proposed Cost/Benefit Analytical Model 18 I/O activity. The tradeoff is large chip area and increase in power consumption. The amount of memory to be placed on chip must balance these factors. 3. I/O: The ability to communicate with external device is important. Certain devices also provide interface to other external devices, allowing a device to adapt different roles in a large system, such as routing, DMA, co-processor, etc. To support wide range of I/O activities, the available I/O bandwidth becomes the key factor. 4. Scalability: One common design idea in many reconfigurable systems is scalability. Designers put in many effort to come up with a scalable architecture, such that processing power of the system can be scaled or extended with minimal work in the future. These metrics are suggestions only. Each metric has different meaning and importance to different reconfigurable systems. Due to fundamental difference in design nature, a measurement may not be suitable or applicable to a system. The design methodology of a reconfigurable system need to be carefully considered to select representative metrics.

29 Chapter 4 Xputer System 4.1 Motivation and Methodology Xputers is an efficient design model to implement parallel algorithm suggested by researchers from Kaiserslautern University of Technology, Germany [7]. The objective is to develop a design that can support parallel applications with: 1. high throughput; 2. low hardware cost; and 3. good flexibility in design. Customized hardware is able to achieve the the first two criteria, but with significant design and maintenance work. General purpose processors come with great flexibility, with the tradeoff of degradation in performance from overhead caused by data dependency and addressing conflicts during execution. Researchers realized that parallel and concurrent computer systems are still unable to achieve all the objectives due to high communication overhead or inflexibility within the system [6]. The suggested solution is Xputer, a deterministically data-driven computational machine paradigm. It is a processor organization that uses data sequencing as model for parallel computation instead of control sequencing as in conventional computers [7]. In conventional computers, the 19

30 Xputer System 20 concept of control sequencing integrates memory access into the control path; data will be loaded from memory when needed, and then go through the control path for processing. Xputer paradigm took a different approach, based on the idea of data sequencing. The sequence of memory access within an application is identified as scan pattern. This data scan pattern will become the sequence of program execution instead of instruction sequence. examples: Figure 4.1 illustrates some scan pattern Figure 4.1: Scan Pattern Examples [6] (a) Skewed video fill pattern (b) Horizontal video fill pattern (c) Circular loop pattern A component in the system will be responsible for handling the scan pattern. Based on the scan pattern, the component will generate the memory address and initiate memory request to external memory. A processing unit will accept the data and perform the necessary data manipulation. Memory access and data manipulation are only connected by a loosely coupling relationship. The direct sequencing from data avoids most data related problems such as data dependencies and data overhead, providing great improvement on power and performance.

31 Xputer System Xputer Machine Paradigm Figure 4.2 shows a general structure of a Xputers paradigm. Data sequencer is the component that stores data access patterns and initiate memory access. Computations take place in the ralu. Data Sequencer Scan Pattern Generator Addr Residual Control Code SSR soft state register Scan Cache Scan Cache ralu Subnet Subnet Data Memory Data Scan Cache Scan Cache Subnet Subnet Figure 4.2: Architectural Overview of the Xputer Paradigm [6] Data Sequencer The data sequencer is the primary control for Xputer operations. As previously mentioned, the data sequencer is the component that will accept a scan pattern and generate the sequence of memory address. It is equivalent to a program counter in a computer. Because the sequence is now based on data instead of instruction, hence a data-driven design. The main memory space is being abstracted as a two-dimensional data map. This data map serve as a basis for all memory access in the system. To make address generation more efficient, the data sequencer had built in repertory of hardwired generic scan patterns. Some examples are

Xputer System 22 shuffle, butterfly, linear scan, video scan, and others. For non-generic list-driven scan patterns, a next address field is stored with data [6]. 4.2.2 Scan Caches Scan caches is an interface between the data memory and the data manipulator.

32 Xputer System 22 shuffle, butterfly, linear scan, video scan, and others. For non-generic list-driven scan patterns, a next address field is stored with data [6] Scan Caches Scan caches is an interface between the data memory and the data manipulator. After the data sequencer generates an address from the scan pattern, it will load the data onto scan caches. This load is not restricted to the particular data identified by the generated address, but a window of data necessary for an operation. The generated address is actually a relative position to locate the scan window for each computation. Figure 4.3 is an example showing how scan windows are being mapped in the data map. Figure 4.3: Mapping of Scan Window following the Scan Pattern in the Data Map [14] The grey arrow is the starting point of the scan pattern, and the black arrows define the sequence of the scan pattern. Each time an element is selected by the scan pattern, a 2x3 window of data is selected relative to the pointed element. This is the scan window. Data within the scan window will be transferred to scan caches for processing. Scan caches are connected with the processing unit fully in parallel to support high speed read and write. Scan windows are resizable at run-time since they vary for different computations. This

33 Xputer System 23 feature, together with fully deterministic caching strategies supported by the compiler provide almost 100% cache hit rate [6] ralu The reconfigurable Arithmetic Logical Unit (ralu) within a Xputer system is responsible for data manipulation. ralu only needs to perform computations without the need to worry about source and dependencies between data. ralu composes of many small logical processing cells. Through configuration and connection of the small logical cells, the ralu can be configured to perform a single operation, or distribute the cells to handle several independent parallel tasks. Reconfiguration of the cells are designed to be done in high speed. With a good configuration, ralu can achieve high hardware resource utilizations and high optimizations Residual Control Data sequencer can start and continue the task of address generation independently given the initial memory address and relevant information. However, the system still requires minimal control about initialization and termination. Residual control is the insertion of tagged control words (TCWs) in the data sequence. Whenever a TCW is encountered, it will be loaded and decoded into the Soft state registers (SSRs). The encounter of TCWs serves as a control to select the next ralu configuration and next scan pattern [6] Comparison between Computer and Xputer Table 4.1 lists out the corresponding components and key difference between a Computer and a Xputer. 4.3 Software The compiler for a Xputer system accepts its own language as input, but the syntax of the statements follows the C programming language syntax. The compiler will parse the code into two: sequential code and structural code. Sequential code represent the scan pattern. They are mainly

34 Xputer System 24 Computer Components Xputer Components Differences ALU ralu ralu is not hardwired, but instead reconfigurable and adaptable to variable operations Cache Scan caches Scan caches are size-adjustable and fully parallel to ralu Fetch unit and Program Data sequencer Xputer operations are deterministically counter data-driven by data pattern, not by instructions Branch and Jump unit Residual control Only residual control determines the selection of operations and data scan patterns Table 4.1: Comparison between Computer components and Xputer components [6] identified from loops within the program. Sequential code is for use by the data sequencer to generate data access patterns from memory. Structural code define the operation to be performed on the data. It serves as configuration code to reconfigure ralu to process any incoming data. TCWs are generated to synchronize the operation of the two components. For further information about parsing, techniques, mapping of code to hardware, please refer to [15, 14, 16]. 4.4 Xputer Prototype - Map-Oriented Machine 3 (MoM-3) Map-Oriented Machine 3 (MoM-3) is one of the systems that implements the Xputer paradigm. Figure 4.4 is an architectural overview of the implementation: The implementation of MoM-3 adapts the characteristics of the Xputer paradigm described in previous section. The data sequencer composes of multiple generic address generator (GAG). Under hardware control, each generic address generator can produce address sequences which correspond up to three nested loops from a program. Architecturally, the ralu is made up of multiple ralu subnets. Each ralu subnet has its

Xputer System 25 Figure 4.4: Architectural Overview of a Map-Oriented Machine 3 Machine [17] own associated scan caches. A ralu subnet can be further decomposed into ralu control unit and rdpa.

35 Xputer System 25 Figure 4.4: Architectural Overview of a Map-Oriented Machine 3 Machine [17] own associated scan caches. A ralu subnet can be further decomposed into ralu control unit and rdpa. The rdpa is a word-oriented scalable regular array of simple processing elements. The control unit is responsible for driving the subnet, identifying and locating PEs, as well the management of configuration. MoM-3 has direct access to the host s main memory; ralu subnets can receive their data directly from scan caches or via the MoMbus from the main memory KressArray 3 The implementation of rdpa inside a ralu subnet uses the concept of KressArray 3, an implementation of Field Programmable ALU Arrary (FPAA) approach [15]. The rdpa consists of simple, word-oriented processing elements called datapath units (rdpus). rdpa is built of identical units, and is scalable in nature. Figure 4.5 shows a simple rdpa consisting of nine rdpus.

Xputer System 26 Figure 4.5: Architectural Overview of KressArray 3 rdpa [15] rdpu is built as a tiny microprocessor that supports all C operators with built in registers.

36 Xputer System 26 Figure 4.5: Architectural Overview of KressArray 3 rdpa [15] rdpu is built as a tiny microprocessor that supports all C operators with built in registers. They are transport-triggered, which means the availability of data triggers the operation [18]. The rdpus, as well the data path width of the entire architecture is 32-bit wide. This provide a higher granularity for basic function units than usual FPGAs. A rdpu can serve as an arithmetic unit, arithmetic and routing at the same time, as well exclusively for routing purpose. Each rdpu can store up to four configurations, or contexts in its configuration memory. Reconfigurations can be performed very fast by a context switch mechanism. Local interconnects exist to connect each rdpu to its four neighbors. They are used to pass data between rdpus. Local interconnects at the chip boundary can be connected to other devices for creation of bigger system. To reduce total number of pins, the 32-bit local interconnects at the boundary are connected in serial mode. In addition to the local interconnect network, there is also the hierarchical global routing network. The two buses in the global network is used for access to the MoMBus as in Figure 4.4. To avoid bottleneck on the MoMBus, the bus network within each ralu subnet is connected to a

37 Xputer System 27 switch. The switch can be used to control whether the inner buses connect to the external MoMBus or not. If inner buses are isolated from MoMBus, it also means the ralu subnet is isolated from other subnets. The buses can now serve as a communication network within the ralu subnet, allowing non-neighboring rdpus connect to this bus to freely communicate. There are two external parallel buses to the outside of the chip, which can be connected to each inner bus, enabling parallel data transfers from outside the array to achieve more parallelism [15]. 4.5 Evaluation Models Performance Model The role of the compiler is to abstract an application into sequence of tasks, and then map the identified tasks onto the executable processing elements. An application can be broken down into multiple operations. While one processing element may not hold an operation, multiple processing cells can be connected together to support the computation. This gives the realization in Equation 4.1: Application Operation P rocessing Element (4.1) In the context of Xputer, processing element is the rdpu cell. It is the fundamental processing power of the system. In this realization Operation is defined as a task that can be executed independently from any other task given the input. That is, each operation can be executed in parallel to each other. To support a large operation, multiple rdpu cells maybe required to form a small pipelined environment. This formation requires more steps and thus more processing time. Due to branch statements there may be several paths within an operation too. Assuming each rdpu has the same clock period, the longest time to execute one operation is equal to the number of rdpus in the longest path of the operation, or a multiple of this factor. The operation with longest path will become the critical path for processing. Let L denote latency, S represent size, N represent number, LP as longest path in an operation and T as execution time: L operation = LP operation T DP U

38 Xputer System 28 This is an equation for single data to the operation only. Since long operations are implemented as pipeline, output can be generated continuously after first pipelining. Equation 4.2 is used to define the new latency for an operation: L operation = (LP operation + N data ) T DP U (4.2) Condition of data will determine which path it will go through within an operation, not necessary the longest path. Equation 4.2 is considering the worst case in latency. As well, this equation only consider the latency from processing. The latency from scan scan access is omitted for two reasons: 1. Scan caches are built with very high speed access, thus only a small latency; 2. The nature of pipelining allows events to happen concurrently. The time to access scan window will overlap with that of rdpu processing. For simplicity, the scan window access latency will just be omitted. Performance of a Xputer system can be represented by throughput. Computation of throughput requires the amount of output and execution time. They are both tightly related to number and size of operations from an application. Each operation will generate one data output at the end of the manipulation, while the latency is being defined in Equation 4.2. T hroughput operation = N data S data L operation The total throughput given by an ralu subnet, considering context switching as well, leads to Equation 4.3: T hroughput ralu subnet = N operation i=1 N context operationi j=1 (T hroughput operationi,j freq j ) (4.3) The inner summation is used to consider average throughput from all context based on the probability of each context, or freq j. This outer summation then sum up the result for all operations within

39 Xputer System 29 the ralu subnet. Summation of throughput from all ralu subnets will, as shown in Equation 4.4, gives the overall system throughput. T hroughput system = N ralu subnet k=1 (T hroughput ralu subnetk ) (4.4) Measuring throughput for every operation for every ralu subnet is a tedious job. A simpler approach is to measure the average throughput from a ralu subnet and multiply it by the number of available ralu subnets. This can generate an average system throughput in much less time Power Model The approach to model average power by operational stages proposed in Chapter 3 will be used here. Particularly, Equations 3.3 to 3.7 will be used and applied to the Xputer system. For the following power equations, K represent the average unit power consumption per word size. Configuration Loading Configuration of Xputer systems separates into two parts: ralu and GAG. The two parts are loosely coupled while closely related in operations. In the MoM-3 implementation there are eight GAG and eight ralu subnets available. P GAG cfg loading = N GAG a=1 N nested loop GAGa b=1 ((K bus interface + K IO pad + K GAG cfg write ) S GAG cfga,b ) (4.5) In Equation 4.5, one Σ is use to represent all the nested loops within each GAG. The other Σ represents the number of GAG available in the system. S GAG cfga,b is the size of the configuration for each nested loop. With the same idea, Equation 4.6 models the power consumption to load configuration into the ralu. The two Σ represent number of rdpu in a ralu subnet and number of active ralu subnet, respectively. The number of active rdpus include the configuration of all context.

40 Xputer System 30 P ralu cfg loading = N ralu subnet i=1 N rdp U subneti j=1 ((K bus interface +K IO pad +K ralu cfg write ) S ralu cfgi,j ) (4.6) On Chip Configuration Configuration of a GAG is simply loading the predefined data pattern into storage, and read during execution. Therefore no extra configuration mapping is needed. For ralu, the system supports context switching of rdpu during runtime, therefore only the first context will be mapped from storage to rdpus at initialization. Each ralu subnet may be using different number of rdpus for processing. In Equation 4.7 one Σ sums all active rdpus in a ralu subnet, and then second summation for all ralu subnets. P ralu cfg = N ralu subnet i=1 N active rdp U subneti k=1 ((K ralu cfg read + K rdp U cfg ) S active rdp Ui,k ) (4.7) I/O Data Xputer system relies on heavy use of the scan window to maintain a good performance. Data to be used in a ralu subnet are first loaded from external memory into its corresponding scan window, which is a fast cache connected in parallel with the ralu subnet. This allows fast data access by the ralu, as well to avoid memory latency problem and global bus bottleneck. Equation?? illustrates average power consumption for GAG to generate a memory address and initiate memory requests based on the scan window size. Data are transferred from external memory to scan cache via I/O interface. There are two summations in the equation. One to index the number of GAG, and the other related to nested loops within each GAG. P input data = N GAG a=1 N nested loop GAGa b=1 ((K GAG exe + K bus interface + K IO pad + K scan cache write ) N scan pattern elementsa,b S scan windowa,b ) (4.8)

41 Xputer System 31 GAG exe is the execution power of GAG, which is to load up a scan pattern and generate memory accessing addresses. Upon completion working with a set of data, data in scan window will be written back to memory. There is control word within ralu subnet to generate the output memory address and perform the memory writeback. P output data = N ralu subnet i=1 N write back subneti n=1 ((K scan cache read + K bus interface + K IO pad ) N write back elementsi,n S write backi,n ) (4.9) Execution Execution of Xputer is very simple. The processing elements or rdpu units inside ralu subnet, do not concern the source of data. They are build as computational units that performs processing whenever data is available. With the KressArray 3 implementation, rdpu are built as very simple microprocessor with standard cells. Assuming all rdpus consume the same amount of power, processor power consumption can now be modelled based on summation of all active rdpus from all ralu subnets: P ralu exe = N ralu subnet i=1 (K rdp U exe N active rdp U subneti ) (4.10) Equation 4.10 is only the case when no context switching will occur. If context switch happens, the number of active within an ralu subnet may vary. Let j denote the index to different contexts and freq j represent the frequency of context j active being active for execution. The average power consumption over the available contexts will be determined based on their relative usage: P ralu exe = N ralu subnet i=1 N context subneti j=1 (K rdp U exe N activerdp U subnet i,j freq j ) (4.11)

42 Xputer System 32 Reconfiguration Overhead Because the system supports context switching, it is possible that rdpus will be reconfigured during runtime. The only reconfiguration overhead is to map the new context to rdpus and reprogram rdpu connections. After that, execution and data loading will operate the same way as before. Equation 4.12 attempts to model the reconfiguration overhead from context switching. One Σ sums the number of rdpu reconfiguration within a ralu subnet with S recfg rdp Ui,m the size of configuration, and then second summation totals that for all ralu subnets. P ralu recfg = N ralu subnet i=1 N rdp U recfg subneti m=1 (K rdp U cfg S recfg rdp Ui,m ) (4.12) Area Model The two major components in a Xputer system are Data Sequencer and ralu. There are always area overhead within a chip due to place and route, and interconnect issues. To make the area model scalable, this area overhead will be expresses as a percentage of total component area. Let A define the area of a component and N be the number. The total area equation, with consideration of area overhead is illustrated in Equation 4.13: A total = (A data sequencer + A ralu ) (1 + α total overhead ) (4.13) Xputer itself is an organization. The actual area usage will vary from implementation. The following area model is based on the MoM-3 and KressArray-3 implementation. A data sequencer = ((A GAG + A scan pattern storage ) N GAG ) (1 + α data sequencer overhead ) (4.14) A ralu subnet = ((A rdp U N rdp U ) + A ralu subnet ctrl ) (1 + α ralu subnet overhead ) (4.15) A ralu = (A rdp U subnet N ralu subnet ) (1 + α ralu overhead ) (4.16)

43 Xputer System Other Metrics I/O Two 32-bit global bus are implemented and connected to all GAG and ralu subnet. The I/O bandwidth is quite limited in the MoM-3 implementation system. The design approach of Xputer is to have scan caches as a fast buffer between memory and ralu, reducing the possibility of memory bottleneck. For I/O intense programs, such as network data routing, the limited I/O bandwidth from Xputer is insufficient. Scalability Scaling of processing power can be done in two ways. The first is to increase the number of rdpus within each ralu subnet. This requires recompilation of the program, since the structural code to configure the ralu subnets can no longer take advantage of additional resources. Another way is to add more GAGs and ralu subnets to the system. This way of scaling can achieve greater amount of parallelism in execution. A problem with is approach is that more components added will put more pressure to MoMBus, the main I/O system bus. The memory bottleneck created will limit the potential gain from parallel execution. Adaptability Xputer works well with data streaming type of programs, such as video processing applications. Data sequence are easily predictable and awaits for processing in a streaming fashion. As discussed in previous sections, the MoM-3 implementation suffers from I/O limitation and scalability with memory bottleneck problems. streaming applications. It discourages the use of Xputer beyond that of 4.6 Summary Xputer can function well with streaming and data driven application, since the system is designed for this purpose. The avoidance of data latency and data dependency problems, and great software like design flexibility make it a good choice for data driven programs. Although it was mentioned

44 Xputer System 34 that MoM-3 is not a well adaptable design, it has its specialties with data, not adaptability. As well, other implementations than MoM-3 may provide better I/O and scalability. It is unfortunate that insufficient information can be collected about performance, power, area, and other operating statistics of a Xputer system. It limits the amount of analysis that can be performed.

45 Chapter 5 NEC DRP System 5.1 Motivation and Methodology NEC Electronics Corporation announced the development of dynamically reconfigurable logic engine (DRLE) in 1999 [19, 20, 21] and succeed in developing the DRP-1 prototype chip in 2002 [9]. DRP stands for Dynamically Reconfigurable Processor. The main feature of the logic chip is that it can be reconfigured at very high speed at runtime. By pre-programming multiple configuration contexts within the chip, the reconfiguration engine can select from the repository of configurations to implement different data path during execution. [22]: The design of DRP, with its highly dynamic reconfigurable nature, is aimed for the following 1. Support for Parallel Processing: Applications like the packet processing operations of network devices, image applications and signal processing [9] can benefit from the simultaneous parallel operating nature of the processing elements. 2. Support for Specification Revisions: Specifications of an application is being programmed as context within a DRP chip. The specifications or functionalities of the application can be revised through software even after fabrication and shipping. 3. Support for Software-like application: The DRP development environment is based on ASIC development environment which supports C language programming. It provides ease 35

46 NEC DRP System 36 in development and allows reuse of existing software code. The development of DRP does not stop here. The on going research at NEC Electronics is aiming to provide higher performance and lower power consumption from the design. [23] 5.2 Architecture DRP is a coarse grain reconfigurable processor. The primitive unit of the DRP processor is a Tile. A DRP processor consists arbitrary number of tiles, which means the system is scalable by increasing the number of tiles. In current prototype, the DRP-1 chip, each tile consists of 8x8 Processing Elements (PEs), a State Transition Controller (STC), sixteen 2-ported memories (VMEMs, or Vertical Memories), VMEM Controller (VMEMCtrl) and eight 1-ported memories (HMEMs, or Horizontal Memories). The structure of a tile is shown in Figure 5.1: Figure 5.1: Structure of a DRP Tile [24] PEs are the fundamental computational units for the DRP architecture. Each PE has an 8-bit ALU, an 8-bit DMU, an 8-bit x 16-word register file, and an 8-bit flip-flop [25]. Figure 5.2 is an

NEC DRP System 37 architectural overview of a PE: Figure 5.2: Structure of a DRP Processing Element [24] Within each PE is a repository of configuration, or contexts.

47 NEC DRP System 37 architectural overview of a PE: Figure 5.2: Structure of a DRP Processing Element [24] Within each PE is a repository of configuration, or contexts. Current prototype can store up to sixteen contexts in each PE. The state transition controller (STC) within each tile is the main controller to the tile. Implemented as a finite state machine based sequence, STC has full control over the instruction pointer (IP) IP signal. This signal will determine the context to be used by each PE. The DRP operation can be reconfigured by dynamically altering the context to be used by the PEs and by reprogramming the PE interconnections without halting the operation of the device. The on-chip repository of configurations avoids overhead to load new configuration into the system, which enables context switching to be done in one single clock cycle [25]. STCs from different tiles can run as a group or independently by themselves to derive different combinations. This provides great flexibility to support various operations. The prototype DRP-1 was developed as the first chip to implement the DRP architecture. This chip supports a maximum system operating clock frequency of 133 MHz and four phase lock loops (PLLs) are included to support four types of clock input. There are 256 data pins for I/O, plus 48 flag data pins for control purpose. Some pins are shared to reduce the total pin count.

48 NEC DRP System 38 Each DRP-1 chip includes 4x2 DRP tiles, an external SRAM memory controllers, PCI controllers, and eight 32-bit multipliers. The development chip supports various interfaces as well, including PCI, synchronous DRAM, CAM and SRAM memory interfaces, enabling a wide range of system applications [9]. 5.3 Software DRP system has its own compiler to perform application compilation. It is a Linux based design tool that includes synthesis tool and compiler. Application information can be described in C language, then the compiler can generate an optimum circuit from the given information [24]. Source code is synthesized and divided into data path configuration data and control programs, which will be mapped to PEs and STC, respectively. Coordination between the STC and PEs enables data processing tasks to be dynamically reconfigured at each clock cycle [26, 24]. 5.4 Specifications of DRP-1 Protype Chip Specifications are extracted from [9, 24]: Technology: 150-nm CMOS process Package: 696-pin TBGA Operating voltage: 3.3 V (external), 1.5 V (internal) Number of PEs: 512 Clock frequency: 133 MHz (maximum) Number of I/O pins: 256 data pins, 48 control pins Number of Transistors: approximately 22 million Installed memory: 2.2 MB Supported interface: PCI, synchronous DRAM, CAM, and SRAM memory interfaces 5.5 Evaluation Model For the following modelling, let K represent unit power consumption, L denote latency, S represent size, N represent number, LP as longest path in an operation and T as execution time.

49 NEC DRP System Performance The computational fabric between a DRP chip and Xputer system is very similar in nature. Both are made up of arrays of processing cells with context switching support. Because of this, DRP can reuse the performance model previously developed for the Xputer system with minor changes. The domain realization from Equation 4.1 in Chapter 4 Application Operation P rocessingelement (5.1) is as effective for DRP. In DRP, the processing elements are PEs from DRP tiles. With the same domain realization and the same definition of operation, the latency equation for an operation from Equation 4.2 can be reused as well: L operation = (LP operation 1 + N data ) T P E (5.2) Again, the memory access latency is omitted from Equation 5.2 due to the overlapping nature of pipelining. Performance of the DRP system can be expressed in terms of throughput. Adapt Equation 3.1 to the DRP system gives: T hroughput operation = N data S data L operation where N data is the number of data or iteration, S data is the size of each data. Current implementation of PEs can support 8-bit operations. The total throughput given by a DRP tile, considering context switching as well gives Equation 5.3: T hroughput tile = N operation i=1 N context operationi j=1 (T hroughput operationi,j freq j ) (5.3) variable i is used to index operations within a tile, and j as index to one of the sixteen supported contexts on-chip. freq j is the probability that context j will be activated. Summing throughput of all tiles will give throughput of the system, as in Equation 5.4:

50 NEC DRP System 40 T hroughput system = N tiles (T hroughput tilek ) (5.4) Similar to that of Xputer, the average throughput scheme applies as well. Throughput measured for a single tile and then multiply by the number of available tiles can give average performance rating. k= Power Model Adapting the Power Model from Chapter 3 with Equations 3.3 to 3.7 will generate the followings: Configuration Loading At initialization, configuration context will be loaded into each tile, particularly into the PEs from external memory. N tile P cfg loading = ((K IO + K P E cfg write ) N P E tilem + (K IO + K ST C cfg write )) (5.5) m=1 The power consumption for I/O here includes the actual I/O pins and the interface, such as PCI or SRAM. Other than configuration, data will be loaded into on-chip memory as well. N tile P mem loading = ((K IO + K mem write ) N mem tilem ) (5.6) m=1 There are two types on on-chip memory: 2-ported memories (VMEM) and 1-ported ones (HMEM). Further decomposition can be done in the equation if more accurate details is desired. On Chip Configuration In DRP, on chip configuration is the process to configure the context repository within each PE. This is already been done at the time configuration got loaded on chip, as considered in Equation 5.5. A further step is that STC will start functioning and generate the initial IP signal for PEs.

51 NEC DRP System 41 Execution Execution of DRP is very systematic. PEs performs continuous processing if needed. The only issue is with context switching at run-time. For simplicity, power consumption for each PE can be assumed to be equal. Equation 5.7 adds power from STC and all active PE from a tile: N active P E P DRP tile = ( n=1 K P E exe ) + K ST C (5.7) Summing up the power consumption of all active PEs in the tile is the basic approach, when no context switching is occurring. If context switch happens, the number of active PE within a DRP tile may vary. Let j denote the index to different contexts and freq j represent the probability of context j being active. Average power consumption over all available contexts will be determined based on their relative usage: P DRP tile = ( N active context tile j=1 (K P E exe N active P E j freq j )) + K ST C Equation 5.8 sums up all computational components within a DRP processor to give overall power consumption. This includes all the tiles and the multipliers. I/O Data P DRP chip = N active tile i=1 (K DRP tilei ) + N active multiplier k=1 (K multipliersk ) (5.8) Due to insufficient information about this product, the data activities in NEC DRP is not very clear. The follow equations, Equation 5.9 and 5.10 attempt to generalize the basic data operations. The term I/O here includes the actual IO pins and the SRAM interface to be used. P input data = (K IO + K mem write ) N incoming data (5.9) P output data = (K IO + K mem read ) N outcoming data (5.10)

52 NEC DRP System 42 Reconfiguration Overhead Reconfiguration in DRP is as simple as changing the IP signal to each PE. PEs will switch the context to be used for execution. As well, STC will define the new connecting patterns between PEs. The whole process of reconfiguration will be done in a single clock cycle. The overhead from reconfiguration is minimal Area Model The major components in a DRP tile are PEs, STC, and memories. To make the area model scalable, area overhead will be included in the equation and expressed as a percentage of component area. Equation 5.11 and Equation 5.12 generalizes the area of a tile and a chip, respectively. N P E N 2 ported mem N 2 ported mem ctrl A DRP tile = ( (AP E ) + (A2 ported mem ) + (A2 ported mem ctrl ) + N 1 ported mem (A1 ported mem ) + A ST C ) (1 + α DRP tile overhead ) (5.11) N tile N multiplier A DRP chip = ( (ADRP tile ) + (AMultiplier ) + A SRAM ctrl + A P CI interface ) (1 + α DRP chip overhead ) (5.12) Other Metrics On-Chip Memory There are a total of 2MB of on-chip memory in the NEC DRP-1 chip. This is a relative large amount of memory compared to other systems. This means a DRP chip can maintain a large amount of data on chip for processing without the need to access external memory. As well, large amount of storage or buffer is required for applications such as broadband applications. Scalability The array structure of PEs in a tile, or tiles in the chip make it very easy to scale the system by expanding these numbers.

53 NEC DRP System 43 Adaptability The 8-bit processing width was a carefully selected size. It is able to provide higher granularity than classic FPGAs which are bit-oriented to reduce computation overhead, as well a lower granularity than 32-bit to improve utilization. 8-bit as the granularity makes the system able to maintain a good balance between flexibility and utilization. 5.6 Summary The granularity, amount of on-chip memory, fast reconfiguration time, scalability, wide range of interface all shown that NEC DRP chip is a flexible and adaptable design. It will be able to implement many applications efficiently, such as broadband communication, parallel processing and multimedia applications. NEC DRP is a aimed to be a commercial product. Many of its operating states and implementation details will not, at least not yet been released. The lack of information makes it difficult to proceed with further evaluation. However, this chapter shows that high level modelling can still be applied even statistics are not available.

54 Chapter 6 MIT Raw System 6.1 Motivation and Methodology In the past, wire was being abstracted as instantaneous connections at design stage. In other words, wire delay can be ignored at architectural level to simplify design work. This had been a great assumption and worked well, since wire delay was small compared to achievable clock frequency in the past [4]. As technology improves, everything got changed. The clock frequency of processors has risen dramatically over the years. Increase in clock frequency means a decrease in clock cycle period. The amount of wire delay becomes more and more significant compare to clock period. Wire delay must now be taken into consideration even at architectural level. Careful placement and routing to minimize communication length become a very critical issue in modern processor design. However, it is a NP complete problem and does not scale well. To attack the wire delay problem, the methodology of the MIT Raw system is to rely on a simple, highly parallel VLSI architecture. The short interconnect between processing units will limit the length of wires. As well, memories will be distributed to multiple execution units to avoid memory bottlenecks [27]. The hardware architecture will expose all its details to the software through a scalable ISA, allowing the compiler to determine and implement the best allocation of resources for each application [4]. Simple hardware, complex software is the fundamental idea of the Raw design. 44

55 MIT Raw System 45 One key feature of Raw is that its on-chip networks are exposed to the software through the Raw ISA. This exposure grants the compiler the ability to directly program the wiring resources of the processor and schedule data transfers between tiles. With careful planning and scheduling, the compiler is able to determine when data are needed by a tile at compile time. The compiler can plan the data request and setup the communication path for the data transfer ahead of time. When the data is needed, the tile can find the data already available to it without the need to wait. The ability of the compiler to pre-schedule communication pattern can mask out a lot of network latency [27]. 6.2 Architecture Raw Tiles A Raw microprocessor is a set of interconnected RAW Tiles. Each tile has: An eight-stage, in-order, single-issue, MIPS-style processor A four-stage, pipelined, floating-point unit 32 Kbyte data cache 32 Kbyte processor instruction cache 64 Kbyte of software-managed instruction cache for switch one static communication router Two dynamic communication routers The structure of a Raw microprocessor and Raw tile are shown in Figure 6.1. Each tile is a simple RISC-like pipelined processor and is interconnected with other tiles over a pipelined, point-to-point network [4, 28]. The design of the Raw processor is to keep each tile simple and small to maximize the number of tiles on a chip. Doing so help to increase the chip s achievable clock speed and the amount of parallelism. Each tile, as a simple microprocessor, is able to support multi-granular operations. Each tile has its own registers and memory, resulting in a distributed storage structure. On-chip memories distributed across the tiles eliminates the memory bandwidth bottleneck and provides

56 MIT Raw System 46 Figure 6.1: Raw Tile Structure [5] shorter latency to each memory module. With this architecture, a Raw processor can be virtualized as a system with multiple parallel processors. A Raw machine uses a switched interconnect instead of buses. The tiles are interconnected to its neighbor by four 32-bit full-duplex on-chip networks. Two of the networks are static and two are dynamic. This create a network consist of over 12,500 wires on-chip [4]. Figure 6.2 is an architectural view of the on-chip networks. Each tile is connected only to its four neighbors. Every wire is registered to ensure the length of the longest wire will not exceed the length or width of a tile. This is a design property in the Raw

MIT Raw System 47 Figure 6.2: Tile Interconnects in Raw processor [4] system to control the wire lengths. Such minimization of wire length allows a high clock frequency for the system.

57 MIT Raw System 47 Figure 6.2: Tile Interconnects in Raw processor [4] system to control the wire lengths. Such minimization of wire length allows a high clock frequency for the system. The replicated structure of tiles with short interconnects provides great scalability; tiles can be added to the system without the need to concern how to place and route for wires. Since tiles are only connected to its neighbors, any data transfer to farther away tiles need to pass through other connected tiles. Each tile data transfer can be abstracted as one network hop [4]. In a 4x4 tile structure, travelling from one corner to another corner takes six hops, corresponding to six cycles of wire delay. To minimize the latency of inter-tile data transport, the on-chip networks are not only registermapped but also integrated directly into the bypass paths of the processor pipeline. This allows computational result of a tile to reach the network without the need to go through tile registers Static Network The static network is used to route values between tiles in a predefined manner. The static networks provide ordered, flow-controlled, and reliable transfer of single-word operands and data streams

58 MIT Raw System 48 between the functional units of the tiles [4]. In order delivery is used to make sure that correct data are sent. In case of unpredictable events such as cache misses and interrupts, flow control of operands will keep the overall communication in correct order. The static router is a five-stage pipeline that controls two routing crossbars, or two physical networks. Each static router contains a 64KB software-managed instruction cache. At compile time, the compiler generates the routing instructions for the static router. Routing instructions are loaded and cached on-chip at system start. Each instruction is 64 bits encoding a small command and several routing information, one for each crossbar output. Different tiles can have simultaneous communication patterns to support parallel computation. For each word sent between tiles on the static network, there must exist a corresponding instruction in the instruction memory of each router on the word s path. This is a way to guarantee flow control of communication pattern between tiles, to make sure that the correct data is passing through. As well, the static router only proceeds to the next instruction after all of the routes in a particular instruction are completed. This ensures that destination tiles receive incoming data in a know order, even when tiles suffer branch mispredicts, cache misses, interrupts, or other unpredictable events [4]. The identification of parallel instructions, distribution of memory space, and routing instructions are determined at compile time [4]. The static router in a tile will have knowledge of future incoming data. The major advantage is that the router can setup the route even before data arrives. Upon arrival, data will find the routing path already setup and ready for routing. The routing preparation is effectively masked out due to the predetermined static communication pattern Dynamic Network While the static network provides low latency, it is built upon predetermined knowledge at compile time. Any information that cannot be solved at compile time, or unpredictable events at run time are beyond the ability of the static network. The dynamic networks in Raw are put in to support anything that a static router cannot handle, such as cache misses, interrupts, dynamic messages, and other asynchronous events.

59 MIT Raw System 49 The dynamic network is built using a pair of dimension-ordered, wormhole-routed router. Messages going through the dynamic network begins with a header that specifies the destination tile, a user field, and the length of the message. Up to 31 data words can be sent in each message. The message worms its way through the network to the destination tile. Dynamic routers have more latency than the static router because a dynamic route requires arrival and decoding of header information from incoming data [4]. The Raw processor supports external device I/O interrupts. Each tile is able to process an interrupt independently of the others. Upon signal for interrupt, a tile will stop its current operation and service the interrupt. In the mean time the tile will query the interrupt controller for the cause of the interrupt and contact the appropriate device or DRAM [4] I/O On the boundary of the system, the tile interconnects are multiplexed and then mapped onto the pins of the chip, as shown in Figure 6.3: Raw uses a 1657 pin, ceramic-column grid-array package. There are 14 full-duplex, 32-bit, 7.5 Gbps I/O ports at 225 MHz. This gives a total of 25 Gbytes/s of bandwidth [4]. The interface can be connected to DRAM or other external devices. In addition to transferring data directly to the tiles, the Raw system can serve as a router to connect external devices together. 6.3 Software The most important part of the Raw system is not the hardware structure, but the software. The hardware architecture allows arbitrary number of Raw tiles in the system. This enable the platform to benefit scalable processing power by increasing the number of Raw tiles in a chip from technology improvements. Rawcc, the Raw parallelizing compiler, extracts fine-grained instruction level parallelism from sequential programs and distributes it across the Raw tiles. This is a very complicated task, since it must consider not just parallelism, but also data dependencies and dynamic events.

60 MIT Raw System 50 Figure 6.3: I/O Interface in Raw system [4] Maps is the part of the compiler used to manage the distributed memory system. After break down of the code sequence, Maps attempts to ensure that memory references referred by an instruction will reside on the same tile. This is to maximize local memory access on a tile basis and thus minimize communications between tiles. The memory address space are to be distributed to each tile in a balanced manner to enable memory parallelism. A lot of techniques were involved in the compilation process, such as static promotion, equivalence class unification, modulo unrolling, and other advanced techniques. For more information about the role of the compiler, readers may refer to [27, 28]. The complexity of the compiler does not come from instruction or memory distribution, but the consideration of tradeoffs between locality, parallelism, and communication cost. This decision is the key to determine performance achievable by the Raw system.

61 MIT Raw System Evaluation Model Performance Model RAW system consists of arbitrary number of Raw Tiles. Each Raw Tile can be treated as a general purpose RISC processor. There are two proposed scheme to measure performance of a microprocessor. The first one being cycle per instruction (CPI) [12] (or instruction per cycle (IPC), reciprocal to CPI). As name suggested, the measurement is based on the measuring the average number of clock cycles required to complete an instruction. Because each instruction in the ISA has different execution time, the average CPI for a processor is calculated based on the occurrence frequency of the instruction. Using Equation 3.2 from Chapter 3: average CP I = instructions in ISA i=1 (exectuion time i freq i ) where i is an index to instructions in the ISA, freq i is frequency of occurrence for instruction i. Inside the Raw processor, there are 16 Raw tiles executing in parallel. On top of the average CPI, there are latency due to communication between tiles, from both static and dynamic network. Since more latency to overall computation time, the value of CPI will increase. A much easier way for measuring average CPI per tile is to count the total number of execution cycles and total number of instructions executed on a tile, divide them up will give average CPI as well. CPI per tile can give valuable information about each tile s processing. Dividing total system execution time by the total number of instructions executed on all tiles can give overall system CPI as well. Thus CPI can serve as a good indicator in both tile and system level. Ratio between tile CPI and system CPI will be the speedup factor, an indication showing how well the Raw system benefits from parallel processing units for an application. However, the CPI measurement is only meaningful when comparing to similar processors, since the instructions in different ISA are totally different. Instructions in RISC are simpler, while CISC instructions are complex but able to get more work done in a single instruction. As well, achievable clock frequency vary a lot from different processor designs. To compare different microprocessor

62 MIT Raw System 52 families, say the Raw processor to an Intel Pentium processor, the benchmark program approach is very commonly used nowadays. Table 6.1 lists the number of execution clock cycles for Raw in selected benchmark programs. It includes results from both single tile and multiples tiles processing. The achievable speedup from multiple tile execution is also listed in the table. Benchmark Source Cycles on Raw (1 Tile) Cycles on Raw (16 Tiles) Speedup Swim Spec M 14.5M 9.0 Tomcatv Nasa7:Spec M 2.05M 8.2 Btrix Nasa7:Spec M 516K 33.4 Cholesky Nasa7:Spec M 3.09M 10.3 Mxm Nasa7:Spec M 247K 8.3 Vpenta Nasa7:Spec M 272K 41.8 Fpppp-kernel Spec M 169K 6.9 Table 6.1: Execution time of Selected Benchmark programs in Raw system [29] From Table 6.1, the achievable speedup varies from 6.9 to Replicating the same processing unit X times does not guarantee a performance speedup of X times. Some factors that affects the speedup achievable by Raw are: 1. Tile Parallelism: in ideal situation, X tiles can run totally independently from each other to achieve X times performance speedup. Any identified parallelism in tiles will be able to contribute positively to performance speedup. 2. Distributed Memory: the distributed memory system supports parallel tile execution, such that tiles maximize local memory access to minimize tile communication. As well, memory bottleneck problem is reduced; centralized memory system can only handle limited number of memory access at once, while distributed memory system can support multiple tiles at the same time. 3. Communication Overhead: communication between tiles will add latency to tile execution.

63 MIT Raw System 53 Raw system have minimal interruptions from static communication, but dynamic events are unavoidable which will generate a lot of network latency to the system. 4. Synchronization Overhead: A tile may get interrupted from processing from dynamic events. Worse, any tiles attempting to communicate statically with an interrupted tile must first synchronize the static communication pattern between them. This is latency due to synchronization overhead. For example, the benchmark applications Btrix and Vpenta benefits from more than 30 times improvement in performance, with 33.4 and 41.8 speedup respectively. They are both dense matrix scientific applications that benefits most from tile parallelism. As well, the extra amount of memory from extra tiles serve as greater cache for processing, resulting in the extreme performance speedup Power Model Let K denote the unit power consumption per word for the following power equations. Configuration Loading MIT Raw system is a general purpose processor approach to reconfigurable computing with enhanced parallelism. The execution of the system is driven by compiled instructions. The loading of configuration data is simply writing the instructions into on-chip instruction cache. Equation 6.1 is the power equation to model the loading of instructions on-chip: P IMEM loading = N tile i=1 N instruction tilei j=1 (K memory ctrl + K IO pad + K IMEM write ) (6.1) With full utilization, There are 16 tiles. 32 KB of instruction cache per tile and 32-bit per instruction, meaning 8K instructions can be stored per tile. In addition to instructions, Raw also requires the loading of switch instruction on chip, which can be modelled by Equation 6.2: P SMEM loading = N tile i=1 N swi nstruction tile i j=1 (K memory ctrl + K IO pad + K SMEM write ) (6.2)

64 MIT Raw System 54 Each switch instruction is 64 bit wide. With 64 KB for storage of switch instruction, 8K static router instructions can be stored on-chip. Initial Data Loading Along with the loading of instructions, application data will also be downloaded and stored on-chip for faster memory access. Data will be distributed among the sixteen tiles, as modelled in Equation 6.3. There are 32 KB of data cache per tile and 32-bit size word, meaning 8K words can be stored per tile. P DMEM loading = N tile i=1 N data tilei j=1 (K memory ctrl + K IO pad + K DMEM write ) (6.3) Execution Each Raw tile is a simple single issue RISC processor. Processing begins with an instruction fetch from instruction cache. The ALU, with the aim of registers, will perform computation based on the instruction. There are three main classifications of RAW instructions: Integer, Floating Point, and Memory. Both integer and floating point instructions are handled by execution units within the tile. Interactions with data cache will occur with Memory related instructions. In Equation 6.4, there will be power consumption to access the instruction memory for each instruction fetched. Arithmetic instructions, branch and jump instructions, comparison instructions are all classified as computational instructions since they all go through the execution units in the tile. They are classified as one instruction type for simplicity in this case. Decomposition into specific instructions can be done if accuracy and details are desired. The other classification will be data memory related instructions, particularly memory read and memory write. The power consumption for these two classifications of instruction will depend on their proportion relative to total instructions being fetched. P RAW exe = N tile (K IMEM access + (K exe units freq computational instructions ) i=1 +(K DMEM access freq memory instructions ) N total instructions fetched tilei ) (6.4)

65 MIT Raw System 55 The switch network is responsible for communications between tiles, with the static access pre-determined at compile time. Switch operation includes accessing switch memory for switch configuration, setup the crossbar and connects appropriate tiles together. The result is Equation 6.5: P RAW static network = N tile ((K SMEM access + K switch setup + K switch data transfer ) i=1 N total swi nstructions tile i ) (6.5) The dynamic network is not being modelled here, since it happens in an unpredictable manner. The frequency of occurrence, duration, amount of resultant latency are all very uncertain. Reconfiguration Overhead Cache miss can occur when on-chip memory is not enough to hold the compiled application, either the instructions, application data, or switch instructions. In which case external memory will be accessed and download the required data onto the chip. The data path involved are similar to what is done at initialization; data request will go out from the Raw system, data will be returned from external memory via I/O pad and memory controller, and lastly written to a tile s memory. Process will repeat until necessary data is obtained, or the on-chip memory cannot hold anymore. Equation 6.6, model the situation for instruction memory, switch instruction memory, and application data memory in case of cache miss. N tile P IMEM overhead loading = ((K memory ctrl + K IO pad + K IMEM write ) N instruction overhead tilei ) i=1 (6.6) N tile P SMEM overhead loading = ((K memory ctrl + K IO pad + K SMEM write ) N swi nstruction overhead tile i ) i=1 (6.7) N tile P DMEM overhead loading = ((K memory ctrl + K IO pad + K DMEM write ) N data overhead tilei ) (6.8) i=1

66 MIT Raw System 56 Statistics Table 6.2 summarizes the average power consumption of the Raw system from different benchmark programs. The core of a Raw system consumes an average of 18.2W of power. Percentage data are extracted from [30]: Component Power Consumption (W) Percentage Clock % Computation % Static Network % Dynamic Network % IO % Total % Table 6.2: Average Power Breakdown of the Raw Processor Most power are dissipated within the tiles for computation purpose, up to 33.80% of total power. It is interesting to see that the 17.33% average power consumption from static network is only a few percent higher than dynamic network. The number of static communication is much higher than that of dynamic network, but the two networks result in close power consumption. This demonstrated the efficiency and success of the static network in comparison to the low power inefficiency from dynamic events. Table 6.3 is a breakdown of current consumption for the Raw tiles executing a highly parallel application, with statistics extracted from [30]. These values correspond to Computation in Table 6.2. Instruction cache is the dominant power consuming factor within a tile s operation. Power consumption from static instruction cache can almost be ignored in the case. To apply the power model to the collected statistics, following assumptions are made: 1. Memory consumes power with its read and write operation. For simplicity, power consumed for read and write are assumed to be equal.

67 MIT Raw System 57 Component Current (A) Percentage Instruction cache % Data cache % Static Instruction cache % Computational units % Total % Table 6.3: Current Breakdown of Raw Tiles for a highly parallel application 2. The large on-chip memory allows to hold a large application. Benchmark programs used to obtain the above statistics are relatively small in size. It is assumed that on-chip memory can contain the whole application without any cache miss, thus no reconfiguration overhead. For initial configuration loading, instructions and static router instructions are loaded from external memory to on-chip memory. P cfg loading accounts for the average power consumption over the process. Based on assumption #2, each entry in the instruction memory and static router instruction memory will only be write and read once. Thus half the memory power is consumed at initial storing of the configurations. Within a tile, instruction and static router instruction adds up to 67.5% of power dissipation. Half the value for write access gives 33.75%. Applying this percentage over the computation power in Table 6.2 generates 2.40W. Again, due to assumption #2, there will be no I/O activities other than memory initialization. Among the three types of memory, instruction and static router instruction accounts for 95.8% of the activity. Using this ratio to total I/O power, or 2.8W, gives 2.68W. Summation of I/O power and memory write will produce P cfg loading = 5.08W. Using the exact same approach will be able to produce the value P DMEM loading for initial application data loading onto the chip. It is found that only 0.42W of power will be spent. This result reveals that data access is not very often in average Raw programs. Beyond initialization, all remaining power are consumed for execution of the systems. This include processing units, static network, and dynamic network. The result is P exe = 15.5W. About 73.8% of system power is being spent at the execution stage.

68 MIT Raw System Area Model The primitive processing unit for the Raw system is a Raw tile. Area projection for a Raw system is as simple as summation of all available Raw tiles. A Raw system = N tile i=1 (A Raw tile) Usually tiles will come in even number and organized in a well balanced placement. An extra tile to be added to the existing 4 x 4 placement will become an odd shape and will very less likely to happen. Adapting the area model developed in Chapter 3, overhead is added to the area equation and become: N tile A Raw system = ( (A Raw tile )) (1 + α Raw overhead ) (6.9) i=1 For the Raw system, the array of 4 x 4 Raw tiles add up to 16mmx16mm in dimension, or an area of 256mm 2. However, the total chip size becomes 18.2mmx18.2mm to include the pins, which is 331mm 2 [30]. This is a 29.29% increase in area, or an overhead of 29.29%. Computational Area Figure 6.4 is a floor plan of a Raw tile. Table 6.4 summarizes the area for different components within a Raw tile based on Figure 6.4. Within a tile, computational units only take up 12.47% of the area, 41.93% goes to storage, and the remaining is for control and communication network. A Raw processor includes sixteen tiles. Total execution units for all sixteen tiles give 31.92mm 2, which is only 9.64% of total chip area of 331mm 2 [30]. Overall speaking, this is a low efficiency on computational area. Memory Area Among all the storage elements, registers with a tile is very insignificant in area. The three types of memory: instruction, data, and switch router instruction, are occupying 33.06% of total chip area. The large area usage comes from the fact that Raw requires storage for several types of memory, and producing a large overhead.

69 MIT Raw System 59 Figure 6.4: Raw Tile Floor Plan [4] While a large portion of total chip area is occupied by memories, it is providing 2MB of on-chip memory to the system. The large amount of memory is well paid off; it allows computational and static router instructions to be fetched in a single clock cycle. Application data can also be read/write on chip in short period of time and reduce potential data cache miss. The fast accessing memory is the key to support the pre-scheduling algorithm in static communication network. If

70 MIT Raw System 60 Component Area (mm 2 ) Proportion to Total Tile Area Die % Execution Units ALU (simple) % ALU (medium) % ALU (critical) % Integer Multiply % Floating Point Unit % Total Execution Units % Storage Register % Instruction cache % Data cache % Static-router instruction cache % Total Storage % Table 6.4: Area breakdown of a Raw Tile there is constant execution interruption due to cache miss or long latency memory access, the static network will not be able to provide any gain even it has prior knowledge of the communication pattern Others I/O The 25 Gbyte/s I/O bandwidth provided by Raw is very powerful, in fact, overpowered. The high bandwidth I/O allows external memory access to be done very quickly. Not limited for communication to Raw itself, a Raw system can be used as a router to connect multiple external devices with its high I/O strength.

71 MIT Raw System 61 Scalability The primitive processing unit for a Raw processor is Raw tile. Sixteen tiles made up a Raw processor or chip. Multiple chips can be linked to provide even greater processing power. It had been tested and the design is able to support glueless connection of up to 64 Raw chips in any rectangular mesh pattern, creating a virtual Raw systems with up to 1024 Raw tiles [4]. Theoretically the scalability of the Raw system is unlimited. However, performance speedup will exhibit diminishing return due to longer communication paths. A Raw system with 1024 Raw tiles is already a satisfactory result. Further research is needed to test the true limit of the Raw design. Adaptability Raw compiles an application into software instructions and execute like a general purpose processor. The provides great flexibility to programmers. Together with large amount of on-chip memory, tremendous amount of I/O bandwidth and great scalable architecture, a Raw system can basically adapt to any types of application. This had been the design goal of Raw: to serve a replacement to general purpose processor. Raw can outperform conventional processors with the same amount of design flexibility but greater potential processing power and scalability. 6.5 Summary In summary, Raw s approach is to provide a very flexible architecture with great potentials that can adapt to any types of applications. It is very good as a replacement to general purpose processors in computer environment. While as software flexible as a general purpose processor, the Raw architecture is also scalable in nature to provide more processing power. The drawback of great flexibility is that it also generates same amount of overheads; processing overhead, area overhead, and power overhead. Many applications may not require the large onchip memory or great I/O bandwidth, but these features must remain to make sure the system is adaptable to all type of applications.

72 Chapter 7 CMU PipeRench System 7.1 Motivation and Methodology The general idea of Reconfigurable Computing Systems is to combine the flexibility of general purpose processors with the efficiency of customized hardware. While this is the fundamental idea, different researchers have different opinion and approaches to their design. Researchers from Carnegie Mellon University (CMU) believe that there are two problems with the static nature of configuration. First, a compiled application may require more hardware than is available. Second, a compiled hardware design may not be able to take advantage of additional resources available in the future process technology without recompilation [2]. To handle this two problems, approach analogue to general purpose processor software compilation is taken. The key in a general purpose processor platform is the predefined Instruction Set Architecture (ISA). Any application can be compiled into sequence of events in the form of instructions, and a fixed microprocessor is built that can handle any instructions in the ISA. The benefit of this approach is that with a clear standard, the compilation process is separated from the hardware. Optimizations being done in any compiled application require no change to the hardware, or any improvements in the processor do not require software recompilation. Another advantage is that theoretically any size of application can be compiled and executed in this platform. Size of application will affect the amount of execution time, but execution is still feasible. By analogy, PipeRench separates the compilation of an application from available hardware 62

73 CMU PipeRench System 63 resources to gain the flexibility as general purpose processor. This allows compiling algorithm to be simpler because hardware resources do not need to be considered during the process. As well, Forward Compatibility can be achieved; additional hardware resources can be added to increase processing power without the need to recompile the code since software and hardware are separated. The concept of Hardware Virtualization is what makes these possible [2, 31, 32, 33]. Execution in the PipeRench system is based on a technique called Pipeline Reconfiguration [34, 2]. This is a technique that can executes a large logical configuration onto small computation units by continuously reconfiguration of the hardware. The process of pipeline reconfiguration involves breaking a pipelined application into configuration pieces that correspond to pipeline stages in the application. These configurations are then loaded, one per cycle, onto the interconnected network of programmable processing elements (PEs). In PipeRench terminology, pipeline stages are called stripes. The pipeline stages available in the hardware are referred to as physical stripes while the pipeline stages that the application is compiled to are referred to as virtual stripes. The compiler generated virtual stripes are mapped to the physical stripes at run-time. If there are more virtual stripes than physical, the system will overwrite some of the already used physical stripes in a round robin manner. To better illustrate the idea of hardware virtualization, an example is shown below with Figure 7.1 and Figure 7.2. Figure 7.1: Example of Hardware Virtualization - Virtual pipeline stages [2]

74 CMU PipeRench System 64 Figure 7.2: Example of Hardware Virtualization - Physical pipeline stages [2] In this example, an application has five stage virtualized pipelines while there are only three physical physical stripes available. Pipeline reconfiguration will make execution possible even there are insufficient hardware in the first glance. In clock cycle 1, the system configures virtual pipe stage 1 into physical stripe 1 and ready for execution in the next clock cycle. Virtual pipe stage 2 and 3 will be loaded into hardware over the next two clock cycles. In clock cycle 4, due to insufficient resource, physical stripe 1 will be reconfigured as virtual stage 4, such that the pipeline application can continue. After virtual stage 5 being loaded at clock cycle 5, then system will load virtual stage 1 back into the system for execution. Inputs will be consumed when virtual stage 1 is under execution while outputs being produced from the last virtual state, or virtual stage 5. In this example, cycles 2, 3, 7, 8,... consume inputs; and cycles 6, 7, 11, 12,... generate outputs. Two outputs can be generated every five clock cycle in this example. The nature of this design allows large application to run on small systems. Virtualization will occur less frequently with additional resources, resulting in better performance. In previous example, if one more physical stripe is added, there will be a total of four physical stripes for the five virtual pipe stages application. Reconfiguration will begin in clock cycle 5 instead of clock cycle 4. Each virtual pipe stage can run for three clock cycles before replacement. Therefore three outputs can be generated every five clock cycle, which is a significant gain in performance. The system will continue to benefit from additional resources until the point that hardware virtualizaiton is not required anymore, or number of physical stripes equal to that of virtual stripes.

75 CMU PipeRench System Architecture Switch Fabric The key component in the PipeRench architecture is the reconfigurable physical stripes. The physical stripes must have the following characteristics: 1. allows reconfiguration at run-time; 2. reconfiguration to be done in short period of time. Ideally in a single clock cycle; 3. the design of physical stripe need to be flexible to adapt different operations. The number of physical stripes in the PipeRench system is expandable in nature. Current PipeRench prototype has sixteen physical stripes for processing. All physical stripes together are referred as Switch Fabric. Each physical stripe is composed of a set of Processing Elements (PEs). PEs are the fundamental processing unit to the system. Each stripe has interconnects used to route values to the next stripe and also to other PEs in the same stripe. The PEs and interconnects are programmable and reconfigurable to meet any operation needs. Figure 7.3 is an architectural view of the switch fabric and Figure 7.4 is a detailed view of a PE: As a fundamental processing unit in the system, each PE is eight-bit wide with both logic and registers inside. The Functional Unit within a PE is capable of all possible bit-wise functions on two operands, as well as addition, subtraction and multiplexing. It is built using eight 3-input lookup tables (3-LUTs) with carry chain support. Special purpose interconnects are used to combine adjacent PEs to perform operations more than eight bits in width. The shifters are used to connect with adjacent PEs to allow for efficient multi-pe shift operations [3]. Each PE contains a register file with eight registers. From Figure 7.4, each register within the register file has two sources of input: output from the functional units, or value from the corresponding register in the previous stripe. Register RO is a special register. If a stripe is the first in the pipeline, R0 is being used to accept input from global data bus. For the same reasoning,

CMU PipeRench System 66 Figure 7.3: Architectural View of Switch Fabric in PipeRench [2] R0 is used to output data to global data bus if it is the last stripe in the pipeline [3].

76 CMU PipeRench System 66 Figure 7.3: Architectural View of Switch Fabric in PipeRench [2] R0 is used to output data to global data bus if it is the last stripe in the pipeline [3]. Another usage of R0 is for the feature of State Store and State Restore that will be explained in later section. The meaning of configuration or reconfiguration is to change the configuration bits used to select inputs to all multiplexors and shifters within a PE. With 42 configuration bits, the functionality of a PE can be fully specified [3]. Current PipeRench prototype has 16 PEs per physical stripe, 16 physical stripes per fabric. This means 672 configuration bits for each stripe, and bits for the whole fabric Configuration Controller To effectively manage the process of hardware virtualization, the existence of a configuration controller is essential. Figure 7.5 is an architectural overview of a PipeRench system. It shows how the configuration controller is connected to other components. Configuration controller is a finite stage machine that controls operations for the system. The

77 CMU PipeRench System 67 Figure 7.4: Block Diagram of a Processing Element in PipeRench [3] configuration controller is responsible for the following tasks [31]: Interfacing between host and the fabric, mapping and scheduling configuration bits to fabric, managing on-chip configuration memory, and iteration count. Interfacing PipeRench is designed to be a coprocessor to a host machine, such as a general purpose processor. The host machine must trigger PipeRench to start its operation. The host initiates PipeRench by specifying the memory address of the configuration words of an application, the number of iterations to be executed, and the memory address of data input and output [31]. Upon completion of a task, the host can initiate a new operation.

78 CMU PipeRench System 68 Figure 7.5: Architectural Overview of a PipeRench system [31] Mapping and Scheduling Mapping refers to the task of loading configuration bits from main memory into on-chip configuration memory storage and switch fabric. If the number of available physical stripes is enough to hold an application, each stripe will only be configured once. No replacement, or no virtualization, is required in the process. This is the simple case, and mapping only need to be done once for each pipe stages. In case an application size is beyond that of available hardware, hardware virtualization is required. stripes need to be swapped during the process. The role of the controller is to schedule the replacement of stripes at appropriate time. If there are p physical stripes available, a virtual stripe can remain active in the fabric for at most p-1 cycles before it will be swapped out. A virtual stripe will be swap into fabric again for furture iteration runs. Managing Management of on-chip configuration memory in an effective approach helps to increase efficiency of the system. If an application or multiple application includes common configuration words, only

79 CMU PipeRench System 69 one copy need to to saved. This reduces loading accesses from main memory, as well save on-chip storage space. Each virtual stripe in an application includes a next address field to indicate the location of next virtual stripe. This allows the entire application to be identified from the first virtual stripe. When the application is loaded into on-chip memory, the next address field will get translated corresponding to on-chip memory space. A record of this translation is maintained in a fullyassociative on-chip Stripe Address Translation Table (SATT) [31]. Iteration Count During execution, the controller is responsible of keeping track the number of iterations being performed. This is simply being done by counting the number of active cycles of the first virtual stripe. The last stripe is also being keep track of, such that the system can tell the end of all iterations Data Management The challenge of data management in PipeRench arises due to the distributed execution required for hardware virtualization. The swapping of virtual stripes makes timing more difficult. To handle this task, there are four data controllers, as shown in Fig 7.5, to handle the two types of data management during execution: Data Input/Output and State Store/Restore [31]. Data Input/Output After the fabric is properly configured to perform a pipelined application, input data is required form external memory. The data controllers communicate with the memory bus controller via address and control logic. The configuration word for the first stripe will include a Read flag. Upon realization of the Read flag, one of the data controller will generate the proper read address and control signal to the memory controller. External memory will process the request and return input data to the pipeline system. For the same analogy, the last stripe contains a Write flag and provide output data. A data controller will generate the write address and control signal and send it to the memory controller for writing to external memory.

80 CMU PipeRench System 70 State Store/Restore One potential problem with hardware virtualization is related to storing and restore of state information. A pipe stage can be a function of itself, in other words, functionality requires the result of the same pipe stage from previous clock cycle. In PipeRench, this is handled by storing computational results in registers, which will be feeded back into functional unit the next clock cycle. With hardware virtualization, state information will be lost during stripe reconfiguration. The simple solution is to make a copy of the state information before the reconfiguration process. Among the four data data controllers, two of them have the extra feature to handle state store and restore. If a stripe is to be swapped out and it has the Store flag on in its configuration word, the data store controller will save a copy of the value in register R0 to state memory. In order to keep track the source of the state information inside the state memory, the swapped stripe s configuration memory address is written into the Address Translation Table (ATT). When the stripe is to be loaded back into the fabric, its Restore flag will trigger the data restore controller. ATT will be accessed to locate the the stripe s previous state and have the state information written into fabric along with the reconfiguration bits [31] Hardware Interface PipeRench s input and output interfaces include a 32-bit data bus and a valid and full bit. The design is made to be compatible with most FIFO memory devices. All communication is to be done in 32-bit data stream structure, in the form of packets. Header within the packet will be used to identified destination and functionality of the packet. A packet may contain the configuration bits to specific a PipeRench design, or execution data to be passed between PipeRench and external devices On-Chip Memory The configuration data is stored in 22 SRAMs, each with bit words. It is possible to store up to 256 virtual stripes for execution. Since there are 16 physical stripes available in the fabric, this implementation is able to virtual a design sixteen times its own size.

81 CMU PipeRench System 71 There are four 256 word by 32-bit dual-port SRAMs used for state information storage in the fabric. As well, sixteen dual-port SRAMs (32 x 16 bits) are used to queue data between the interface and fabric [3]. 7.3 Software PipeRench applications are written in a language call Dataflow Intermediate Language (DIL). It is a single-assignment language with C operators. Given the input program, the compiler will unrolls all loops and generates a straight-line single-assignment program with native operators of the architecture only. Places and routes will be last to map the operators onto the PipeRench stripe abstraction. The place and route algorithm used in this case is deterministic, linear-time, greedy algorithm. The tradeoff here is to improve compilation speed by sacrificing configuration size [2]. Output of the compiler is a set of configuration bits that are used to configure the physical stripes at run-time. For more details about the PipeRench compiler and its compilation algorithm, readers can refer to [35, 33]. 7.4 Specifications The following are specifications of the current PipeRench prototype extracted from reference [3]: Built using 0.18 micro technology Dimension: 7.3mm x 7.6mm Die Area: 55.24mm 2 Processing Units: 256 PEs (16 PEs per stripe, 16 physical stripes available) Transistor count: 3.65 million Clock frequency: 120MHz Supply voltage: 1.8V for core, 3.3V for I/O 7.5 Evaluation Model The proposed analytical model for performance, power, and area will be applied to PipeRench and presented in this section. Following are variables to be used in the model:

82 CMU PipeRench System 72 V - number of virtual stripes from an application p - number of physical stripes available in hardware N - number of PEs per stripe B - size of ALU (in bits) for each PE X - number of iteration C - (re)configuration clock cycle per stripe T - execution time per cycle (clock period) F - clock frequency of system The PipeRench model consists of two main operational scenarios: non-virtualization mode when p = V and virtualization mode when p < V. If P = V, it means that there are enough physical stripes to implement a virtual pipeline design. This happens with a small application. Since there are enough hardware, there is no need of stripe reconfiguration or replacement over the execution process. There is also a special case included in the model: p = 1 and p V. This case will be explained along with the development of the model Throughput Throughput is a very important unit when modelling performance, because it represent the amount of output PipeRench can generate per time unit. It is a positive performance measure; maximizing throughput is one of the design goal. Scenario 1: Non-Virtualization Case, p = V T hroughput = T hroughput = T hroughput = T otal amount of output T otal execution time Number of iteration run data width of output Number of clock cycle clock period F requency Number of iteration data width of output Initial configuration time + Execution time T hroughput = F X N B C V + X (7.1) The maximum achievable output size per clock cycle is the width of the last stripe, which is Number of PE per stripe by PE width (N B). Stripe configuration is required initially and only

83 CMU PipeRench System 73 need to be done once. After configuration, the pipeline can proceed to execution without any interruption. In this environment, one output can be produced each cycle. Thus execution time equals the number of iteration in the above equation. This equation assumes ideal pipelined application and perfectly identical pipe stages, such that each stipe is being used at full utilization. In reality it is not often the case. A random application or even slight unbalance between stripe can affect stripe utilization, resulting in degraded performance. The variable α utilization is added to the equation to represent the utilization factor of the virtual design, this leads to Equation??: T hroughput = F X N B α utilization C V + X (7.2) Readers may realize that certain operations may generate more output bits than input bits, such as a decoding operation. The idea of the utilization factor here is not to account the actual amount of outputs that can be produced, but the overall utilization factor of the system. If the number of iteration run (X ) is a large, the initial configuration time (C V ) can be ignored. Equation??can simplify to T hroughput = F N B α utilization (7.3) Scenario 2: Virtualization Case, p < V Using the same throughput equation from Chapter 3, T hroughput = T hroughput = T hroughput = T otal amount of output T otal Execution time Number of iteration run data width of output utilization factor Number of clock cycle clock period F requency Number of iteration data width of output utilization factor Initial configuration time + Execution time + Reconfiguration time T hroughput = F X N B α utilization C V + V p C X (7.4) The key differences between Equation 7.2 and Equation 7.4 are the time of execution and reconfiguration after initial configuration. The rate of processing is proportional to the number of virtual

84 CMU PipeRench System 74 stripes and number of available physical stripes. Among p physical stripes, some are unavailable due to reconfiguration. This leads to (p-c ) available physical stripes. A design specification of PipeRench is that configuration and reconfiguration of physical stripe must be done in a single clock cycle, therefore C =1. Because each iteration run is to go through all V pipe stages, the ratio is V p-1. For example, V = 5, p = 3, V (p-1) = 2.5. This means that output will be generated at an average rate of 2.5 cycles after initial configuration. Refer back to the example illustrated in Figure 7.2, outputs are generated at cycles 6, 7, 11, 12,... There are two outputs every five cycles, an average output rate of 2.5 cycles. Again, if number of iteration run is large in number, initial configuration time can be ignored. Equation 7.4 can simplify to Equation 7.5 T hroughput = (p 1) F N B α utilization V (7.5) Scenario 3: p = 1 and p V T hroughput = F X N B V (C + X) (7.6) Equation 7.6 is a very special case. Only one physical stripe is available while number of virtual stripe is greater than one. The single hardware fabric will alternate between reconfiguration and execution. This case will rarely happen, since it is very impractical due to the inefficiency and low achievable throughput. This scenario is included in this proposed performance model as consideration of all theoretical cases. Secondly, if this model is to be further extended to different applications, or to be used as analogy to other models, a complete model is necessary. Example To better illustrate the throughput equations, an example based on collected statistic is given below. At 120MHz, PipeRench executes a 40 tap 16-bit FIR filter application at 41.8 million samples per second (MSPS) [3]. It uses about 43 virtual stripes in the design, with 16 physical stripes available in the system. The variables are now given:

85 CMU PipeRench System 75 F = 120MHz, V = 43, p = 16, Throughput = 41.8 MSPS Because the unit of throughput is given as samples instead of data size, the data width units can be left out in Equation 7.5. T hroughput = (p 1) F α utilization V 41.8 MSP S = (16 1) 120 MHz α utilization 43 α utilization = From this example, it can be learnt that the PipeRench system can implement the FIR filter application with almost 100% utilization rate. This is a reasonable result since a FIR filter application is a well balanced pipeline application with repeating taps as operations Power The proposed power model from Chapter 3 will be applied to the PipeRench system. Let K denote the unit power consumption for the following power equations. Configuration Loading The host initiates operation of the PipeRench system by specifying the starting address of the configuration data. Each configuration word contains a next address field pointing to next configuration in memory. Configurations read out from memory will go through the chip I/O and memory controller before stored into on-chip configuration memory. Process will be repeated for all configuration words of an application. During the process the address translation table within the configuration controller will update the next address field relative to on-chip memory s address space. This process is modelled in Equation 7.7: P cfg loading = (K mem ctrl + K IO pad + K ctrl + K cfg mem write ) V (7.7) All I/O activities are done in 32-bit data stream. Current prototype allows storage of 256 virtual stripes on-chip. PipeRench has 132 pins in total, with 72 data pins.

86 CMU PipeRench System 76 Initial On Chip Configuration This operation phase deals with the initial mapping of configuration to physical stripes. Because physical stripes begin as unconfigured, this phase is necessary whether hardware virtualzation will occur or not. The mapping process involves reading configuration data out of the configuration memory and configuring the physical stripes. The configuration controller is involved in the process to schedule when and where to map the configurations. P fabric cfg = (K cfg mem read + K stripe cfg write + K ctrl ) V (7.8) Execution Non-Virtualization mode: The execution for non-virtualization mode is very simple. Equation 7.9 is used to model execution of non-virtualization mode, which is a simple equation as well. After initial configuration, input data will arrive the first pipe stage, and output data will get generated from last stage every clock cycle. All pipe stages within the application will execute continuously without interruption. P no virt exe = P stripe exe V (7.9) Virtualization Mode: The configuration memory contains more entries than number of available physical stripes in the hardware fabric. It can store up to 256 configuration entry compared to the 16 available physical stripe. If virtualization is required, one physical stripe will go under reconfiguration with the rest continue with execution. Equation 7.10 only identifies the execution power of the actively running physical stripes. It does not include power related to reconfiguration, which will be discussed in later section. P virt exe = P stripe exe (p - 1) (7.10) The assumption that each stripe consumes same amount of power is the ideal situation. Each

87 CMU PipeRench System 77 stripe can have a maximum of N active PE, each with B bit-width. The number of active PEs within each stripe can vary from application to application, which can vary the power consumption. Data Loading Non-Virtualization Mode: Due to continuous operation, the pipeline can accept an input and generate an output at each clock cycle. This means one data input request and one data output production. Two data controllers will be used to handle the two data transfers. However, there is only one memory bus controller available to handle one data request at a time. FIFOs are used to buffer the requests and allows memory bus controller to alternate between operations. P no virt IO data = 2 P mem ctrl + P mem bus ctrl (7.11) Virtualization Mode: I/O access is actually decreased due to hardware virtualization. With swapping of stripes, first and last stripe will be taken out of execution sometimes. Data input and output operations will not be needed at that moment. Therefore, the amount of I/O data activity is related to the proportion of time first and last stripes are inside the hardware. This is the same factor as the throughput factor, or p-1 V. Equation 7.11 and Equation 7.12 will differ by this factor. P virt IO data = p 1 V (2 P mem ctrl + P mem bus ctrl ) (7.12) Reconfiguration To reconfigure a stripe, configuration data will be loaded from configuration memory and mapped to that particular physical stripe. Controller will be involved to schedule the reconfiguration. The single stripe under reconfiguration will have the following power consumption: P stripe recfg = P cfg mem read + P stripe cfg write + P ctrl (7.13)

88 CMU PipeRench System 78 When virtualization happens, the Store and Restore flags within a configuration word are accessed. They are used to indicate if storing or restoring of state information for their corresponding stripe are required during the virtualization process. This happens with a pipe stage is a function of itself. If a stripe is to be swapped out with the store flag on, its state information will be saved. This means the value of R0 s in the stripe will be copied into the state memory. A restore takes place when a pipe stage is brought into the fabric. The restore flag within the configuration data will tell the system that this pipe stage has previously stored state information. State information, or RO values, will then be loaded into the fabric along with the configuration data. One physical stripe will go under reconfiguration every clock cycle. The chance of state restore or store is dependent on the number of self dependent pipe stage in the design. If there are D out of V self dependent pipe stages that requires state store/restore, the factor is D V. State store and restore will happen in a balance manner; a state being stored will be restored at a later time. P state store = P stripe R0 read + P state mem write (7.14) P state restore = P state mem read + P stripe R0 write (7.15) P recfg = P stripe recfg + D V (P state store + P state restore ) (7.16) Statistics Figure 7.6 is a graph showing the power consumption for a FIR filter application executed on the PipeRench system. The power consumption had been measured for a variety of FIR filter sizes. The system is running in non-virtualization mode up to 13 filter taps. It appears that power consumption is constant at 519mW in the small filter region. This is because the pass register file is not disabled in un-configured stripes. Significant false switchings are going on, resulting in same amount of activities whether one physical stripe is being used or all sixteen being used. This problem had been identified and clock gating is expected to be implemented to avoid the power waste in the future [3].

CMU PipeRench System 79 Figure 7.6: Power statistics of a FIR filter application running on PipeRench [3] A significant jump in power consumption happens from 13 filter tap to 14 filter taps.

89 CMU PipeRench System 79 Figure 7.6: Power statistics of a FIR filter application running on PipeRench [3] A significant jump in power consumption happens from 13 filter tap to 14 filter taps. This is because 14 filter taps require 17 virtual stripes, and hardware virtualization is required for execution. About 630mW of power is dissipated with this setting. To integrate the proposed power model with power statistics, a breakdown of the power consumption is required. Table 7.1 from reference [36] lists a power breakdown performed on an XC4003 FPGA over a set of benchmark netlists: Component Percentage of Power Dissipation Interconnect 65% Clock 21% IO 9% Logic 5% Table 7.1: Power Breakdown of XC4003 [36] PipeRench is very similar to FPGA in architecture; both FPGA and PipeRench are over-

90 CMU PipeRench System 80 whelmed with interconnects and very simple processing units. The power breakdown listed in Table 7.1 can be used as the basis to estimate the power consumption of the PipeRench system in non-virtualization case. Memory power consumption is missing from Table 7.1, since the FPGA does not include on-chip memory. In PipeRench s non-virtualizating operation, on-chip configuration memory will be written the configuration data at initialization. Because this is the non-virtualization case, there will be less than or equal to 16 configuration words for the application. This small application size will use up to about two SRAM memories. Using the memory model equations provided by [37], two bit SRAM memories can be estimated to use about 3.5mW of power for each memory access. There will be two accesses to the configuration memories; one access to write the configuration data at initialization, another access to read out the value for configuration of the hardware fabric. To be conservative, 10mW of power can be estimated as the memory power dissipation. Another 10mW of power is assigned for the configuration controller. Deduct (10mW + 10mW) from the measured non-virtualization power dissipation (519mW) [3], and then normalize the remaining with the breakdown from Table 7.1 give results listed in Table 7.2: Component Estimated Power (mw) Percentage Interconnect % Clock % IO % Logic % Memory % Controller % Total % Table 7.2: Estimated Power Breakdown of PipeRench - Non Virtualization case Recall Equation 7.7, the equation for initial configuration loading: P cfg loading = (K mem ctrl + K IO pad + K ctrl + K cfg mem write ) V Writing to configuration memory will dissipate about 5mW of power. Adding some power dissipa-

91 CMU PipeRench System 81 tion from I/O, controller, interconnect and clock, P cfg loading can be estimated to be about 35mW. On-chip configuration requires no I/O activity, but involves access to configuration memory and load the configuration data to hardware fabric. P fabric cfg = (K cfg mem read + K stripe cfg write + K ctrl ) V Again, another 5mW estimated for memory power consumption to access the configuration memory. Other activities include writing to stripe configuration and controller participation. There is quite a bit of routing from memory to the fabric, so interconnect power will be higher than in initial configuration loading phase. P fabric cfg can be estimated to be about 55mW. Of the remaining = 429mW of power, they are made up of execution and I/O data. Due to the complex interconnect structure in the fabric, execution will dissipate a lot of interconnect power. It is estimated that most of the interconnect and clock power will fall under this category. Along with logic power, P exe is estimated to be 300mW. The remaining 129mW is for P IO data, or activities involved to transfer application between the chip and external memory. With virtualization, increase of power consumption comes from extra access to configuration memory, data transition over bus wires, replacing the hardware fabric stripes, state information store or restore, management of control and translation table. From Figure 7.6, power increased from 519mW to 630mW at the point of virtualization, an additional 111mW. This is the value for reconfiguration power, or the variable P recfg overhead. To further breakdown this figure, one configuration memory, possibly two state memory and transition tables will be accessed each clock cycle. According to memory power estimations from [37] again, this is about 2mW of power. With two data controllers involved, it is estimated to be consuming an average power of 5mW. It is predicted that interconnect power dissipation will still dominate over others, since reconfiguration requires data transmission over long wires. This includes transmitting configuration data from memory to fabric; storing and restoring state to state memory; more control signals to/from the controller. If interconnect takes up 60% of the reconfiguration power, it would be 66mW. The remaining 40mW can be divided among clock and controller, with 28mW and 12mW respectively.

92 CMU PipeRench System 82 The newly estimated component power breakdown for PipeRench with virtualization is shown in Table 7.4 and estimated operational power breakdown in Table : Component Estimated Power (mw) Percentage Interconnect % Clock % IO % Logic % Memory % Control % Control % Table 7.3: Estimated Component Power Breakdown of PipeRench - Virtualization case Operational Phase Estimated Power (mw) Percentage Initial Configuration Loading (P cfg loading ) % Initial Configuration (P fabric cfg ) % Execution (P exe ) % I/O Data (P IO data ) % Reconfiguration (P recfg overhead ) % Total % Table 7.4: Estimated Operation Power Breakdown of PipeRench using Proposed Model Area Model A PipeRench system consists of the following components: hardware fabric, memory, and configuration controller. Due to place and route, interconnects, fragmentation, I/O pins, and other issues, there will be area overhead on top of these components. To make the area model scalable, this area overhead will be expresses as a percentage of component area. Let A define the area of a component and N be the number.

93 CMU PipeRench System 83 A total = (A fabric + A mem + A ctrl ) (1 + α total overhead ) (7.17) Hardware Fabric Hardware fabric consists of physical stripes. For the same reasoning, area of fabric does not equal the sum of all available stripe area. Again, overhead will be expressed as a percentage. A fabric = (A stripe N stripe ) (1 + α fabric overhead ) (7.18) Virtualization and Interface Logic In chip fabrication of PipeRench, memory and the configuration controller are integrated together and named as Virtualizatin and Interface Logic. It will be difficult to distinguish them in terms of area, so they will be consider together. As explained in section 7.2.5, there are three types of on-chip memories: configuration memory, state memory and interface memory. Controller includes the finite state machine for system control, address translation table, the iteration counter, as well the interface logic. A virt interface logic = (A cfg mem N cfg mem + A state mem N state mem + A cfg ctrl +A interface mem N interface mem ) (1 + α virt logic overhead ) (7.19) Figure 7.7 is a die photo of a PipeRench chip extracted from reference [3]. extracted and listed in Table 7.5: Area data are Substitutions of appropriate information to Equation 7.17 can generate the overhead figures. A total = (A fabric + A mem + A ctrl ) (1 + α total overhead ) 55.48mm 2 = (26.23mm mm 2 ) (1 + α total overhead ) α total overhead = 21.43%

CMU PipeRench System 84 Figure 7.7: PipeRench Chip Floorplan [3] Sixteen stripes give a total of 20.08mm 2. Compared to the fabric size of 26.23mm 2, this is an overhead of 30.

94 CMU PipeRench System 84 Figure 7.7: PipeRench Chip Floorplan [3] Sixteen stripes give a total of 20.08mm 2. Compared to the fabric size of 26.23mm 2, this is an overhead of 30.63%, or α fabric overhead = 30.63%. For virtualization and interface logic, Equation 7.19 (A cfgmem N cfgmem + A statem em N statem em + A interfacemem N interfacemem + A cfgctrl) adds to 17.75mm 2. Compared to the actual area of 19.46mm 2, the overhead or α virtl ogic overhead equals 9.63% Memory Area From Table 7.5, total memory adds to about 22% of the total chip area. Among the three types of on-chip memories, configuration memory is largest in size as well as in area. In PipeRench system, on-chip memory is critical. It is the short accessing time that enables

95 CMU PipeRench System 85 Component Area (mm 2 ) Proportion to Total Die Area Die % Fabric % Single stripe (A stripe ) % 16 Stripes (A stripe N stripe ) % Virtualization and Interface Logic % Single SRAM (A cfgm emanda statemem) % Configuration Memory (A cfgmem N cfgmem) % State Memory (A statem em N statem em) % Interface Memory cell (A interfacemem) % Interface Memory (A interfacemem N interfacemem) % Controller (A cfgc trl) % Table 7.5: Area breakdown of a PipeRench Chip [3] PipeRench to obtain new configuration data and reconfigures a stripe in a single clock cycle. As well access to the on-chip state memory will be done at the same speed during pipeline reconfiguration Computational Area Figure 7.8 is a floorplan of a PipeRench PE from [3] and Table 7.6 are values extracted from the diagram. Component Area (um 2 ) Proportion to Total PE Area PE % ALU % Registers % Table 7.6: Area breakdown of a PipeRench PE [3] Each PE is 325um x 225um in dimensional [3]. Sixteen of these composes a stripe, and sixteen

96 CMU PipeRench System 86 Figure 7.8: Floor plan of a PipeRench PE [3] stripes make up the fabric. Therefore a total of 256 PEs available in the system. Within a PE, the ALU or computational unit makes up 9.18% of the area, and registers for another 9.89%. The rest of the area compose of multiplexors, interconnects, and control drivers. Extend this to a system level, adding up area of ALU from 256 PEs will give 1.719mm 2, or 3.1% to the total chip area of 55.48mm 2 [3]. Therefore, computational area efficiency is only 3.1% for the PipeRench system. This is a relatively low efficiency. If the same calculation applies to registers, it will be a 3.34% register area occupancy rate. ALU within a PE is made up of lookup tables (LUTs). The simple structure of these elements allows fast execution with very small amount of area. As a result, area is dominated by other parts or overhead, generating such low computational area efficiency Others I/O The 32-bit hardware interface is used to provide continuous data stream for the pipeline execution. However, this interface is quite limited in bandwidth and constantly occupied for data transfer. This limits the PipeRench system s potential for high I/O application.

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany