RECONFIGURABLE computing (RC) [5] is an interesting

Size: px

Start display at page:

Download "RECONFIGURABLE computing (RC) [5] is an interesting"

Ross Wilkinson
5 years ago
Views:

1 730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 System-Level Power-Performance Tradeoffs for Reconfigurable Computing Juanjo Noguera and Rosa M. Badia Abstract In this paper, we propose a configuration-aware datapartitioning approach for reconfigurable computing. We show how the reconfiguration overhead impacts the data-partitioning process. Moreover, we explore the system-level power-performance tradeoffs available when implementing streaming embedded applications on fine-grained reconfigurable architectures. For a certain group of streaming applications, we show that an efficient hardware/software partitioning algorithm is required when targeting low power. However, if the application objective is performance, then we propose the use of dynamically reconfigurable architectures. We propose a design methodology that adapts the architecture and algorithms to the application requirements. The methodology has been proven to work on a real research platform based on Xilinx devices. Finally, we have applied our methodology and algorithms to the case study of image sharpening, which is required nowadays in digital cameras and mobile phones. Index Terms Hardware/software (HW/SW) codesign, powerperformance tradeoffs, reconfigurable computing (RC). I. INTRODUCTION AND MOTIVATION RECONFIGURABLE computing (RC) [5] is an interesting alternative to application-specific integrated circuits (ASICs) and general-purpose processors in order to implement embedded systems, since it provides the flexibility of software processors and the efficiency and throughput of hardware coprocessors. Programmable-system-on-chips have become a reality, combining a wide range of complex functions on a single die. An example is the Virtex-II Pro from Xilinx, which integrates a core processor (PowerPC405), embedded memory, and configurable logic. 1 Additionally, the importance of having on-chip programmable logic regions in system-on-chip (SoC) platforms is becoming increasingly evident. Partitioning an application among software and programmable logic hardware can substantially improve performance, but such partitioning can also improve power consumption by performing computations more effectively and by allowing for longer microprocessor shutdown periods. Dynamic reconfiguration [25] has emerged as a particularly attractive technique to increase the effective use of Manuscript received July 2, 2005; revised January 9, This work was supported by the CICYT under Project TIN CO2-01 and by DURSI under Project 2001SGR J. Noguera was with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain. He is now with Xilinx Research Laboratories, Saggart, Co. Dublin, Ireland ( juanjo.noguera@xilinx.com; jnoguera@ac.upc.edu). R. M. Badia is with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain ( rosab@ac.upc.edu). Digital Object Identifier /TVLSI [Online]. Available: programmable logic blocks, since it allows the change of the device configuration on the fly during application execution. However, this attractive idea of time-multiplexing the needed device configuration does not come for free. The reconfiguration overhead has to be minimized in order to improve application performance. Temporal partitioning [16] and context scheduling [9] can be used to minimize this penalty. We could summarize that the system-level approaches to reconfigurable computing could be divided in two broad categories: 1) hardware/software (HW/SW) partitioning for statically reconfigurable architectures and 2) temporal partitioning and context scheduling 2 for dynamically reconfigurable architectures. On the other hand, energy-efficient computation is a major challenge in embedded systems design, especially if portable, battery-powered systems (e.g., mobile phones or digital cameras) are considered [1]. It is well known that the memory hierarchy is one of the major contributors to the system-level power budget [1], [22]. Thus, the way we do partition the data between on-chip or off-chip memory will impact the overall system-level power consumption. In this paper, we investigate the power-performance tradeoffs for these two system-level approaches to RC. We show that, when targeting streaming applications, the use of a given approach (i.e., HW/SW partitioning or context scheduling) depends on the application requirements (i.e., power or performance). Moreover, we propose that, in the reconfiguration context scheduling approach, the reconfigurable architecture should process large blocks of data, which should be stored in external memory resources. The execution of large blocks of data minimizes the reconfiguration overhead, but it also increases the power consumption due to the use of external memory. On the other hand, the HW/SW partitioning-based approach should process small blocks of data that can be stored in on-chip memory, which means that we reduce the overall system power consumption. The paper is organized as follows. Section II explains the related work. In Section III, we introduce our target architecture. The proposed design methodology for embedded systems is presented in Section IV. Section V introduces the concept of configuration-aware data partitioning. In Section VI, we explain the benchmarks, the experimental setup, and the obtained results. Finally, the conclusions of this paper are presented in Section VII. 2 In this paper, context scheduling refers to the scheduling of: 1) tasks executions and 2) in partially (not multicontext) reconfigurable devices, the reconfiguration processes of the configurable blocks /$ IEEE

2 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 731 II. PREVIOUS WORK HW/SW partitioning for reconfigurable computing has been addressed in several research efforts [2], [6], [8]. An integrated algorithm for HW/SW partitioning and scheduling, temporal partitioning, and context scheduling is presented in [2]. On the other hand, context scheduling has been also widely addressed in many publications [9], [16], [21], [23]. However, none of these papers introduces any power-performance tradeoffs. A review of design techniques for system-level dynamic power management can be found in [1]. In addition, a survey on power-aware design techniques for real-time systems is given in [22]. However, none of these papers considers the use of reconfigurable architectures. Power consumption of field-programmable gate-array (FPGA) devices has been addressed in several research efforts [3], [19], [24]. In addition, several power estimation models have been proposed [7], [15]. However, all of these approaches study the power requirements at the device level and not at the system level. Recently, very few research efforts have addressed low-power task scheduling for dynamically reconfigurable devices. The technique proposed in [10] tries to minimize power consumption during reconfiguration by minimizing the number of bit changes between reconfiguration contexts. However, no power-performance tradeoffs and power measurements are presented. More recently, in [12], it was shown that configuration prefetching and frequency scaling could reduce the energy consumption without affecting performance. However, this paper does not cover the benefits of HW/SW partitioning. Additional techniques are given in [18] and [20]. A technique for application partitioning between configurable logic and an embedded processor is given in [20]. This paper shows that such partitioning helps to improve both performance and energy. However, the paper only considers statically configurable logic and does not consider dynamically reconfigurable architectures. A different approach for coarse-grained RC is presented in [18]. In the paper, a data-scheduler algorithm is proposed to reduce the overall system energy. However, the paper does not consider the benefits of HW/SW partitioning. A. Contributions of This Work This paper explores the system-level power-performance tradeoffs for fine-grained reconfigurable computing. More specifically, the paper compares, in terms of energy savings and performance improvements, the two key approaches existing in reconfigurable computing: 1) partitioning an application between software and configurable hardware and 2) context scheduling for dynamically reconfigurable architectures. To the best of our knowledge, this open issue has not been addressed in previous research efforts. In addition, the study presented in this paper focuses on a data-size-based partitioning approach for streaming applications. This is different from the majority of the traditional HW/SW partitioning and context scheduling approaches in the literature focused on task graph dependency analysis. Fig. 1. Dynamically reconfigurable CMP architecture. Fig. 2. (a) Dynamically reconfigurable processor. (b) Architecture of the L2 on-chip memory subsystem. III. TARGET ARCHITECTURE The target architecture is a heterogeneous architecture, which includes an embedded processor, a given number of dynamically reconfigurable processors (DRPs), an on-chip L2 multibank memory subsystem, and external DRAM memory resources. An example of this architecture is shown in Fig. 1, where we can see a four-drp-based architecture. This architecture follows the chip multiprocessor (CMP) paradigm. The data that must be transferred between tasks executed in the DRP processors are stored in the on-chip L2 memory subsystem. Each DRP processor can be independently reconfigured. The proposed target architecture supports multiple reconfigurations running concurrently, which is not the case for most of the architectures proposed in the literature. Each DRP processor has a local L1 memory buffer. A hardware-based data prefetching mechanism is proposed to hide the memory latency. Each DRP has a point-to-point link to the L2 buffers (in order to simplify Fig. 1, this is not shown in the picture). However, this is shown in Fig. 2(a), which shows the internal architecture of a DRP processor. There are three main components in this architecture: 1) the load unit; 2) the store unit; and 3) the dynamically reconfigurable logic. The DRPs are single-context devices. It can be observed in Fig. 2(a) that the load and store units have internal L1 data buffers. As it is shown in the picture, each unit (i.e., load and store) has two internal buffers. This approach enables the possibility of having three processes running concurrently: 1) the load unit receiving data for the next computation; 2) the reconfigurable logic is processing data from a buffer in the load unit and storing this processed data in a buffer of the store unit; and 3) the store

3 732 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 3. Design methodology for embedded systems. unit is sending the previous processed data to the L2 memory subsystem. The on-chip L2 memory subsystem is based on a multibank approach [see Fig. 2(b)]. Each one of these banks is logically divided into two independent subbanks (i.e., this enables reading from one subbank while concurrently writing to the other subbank of the same physical bank). These buffers interact from one side with the data prefetch units [see the left-hand side of Fig. 2(b)] using a crossbar and from the other side with an on-chip bus that interacts with the external DRAM memory controller. In this L2 memory subsystem, there must be as many data prefetch units as DRP processors. The proposed architecture includes for each DRP a dedicated hardware-based configuration prefetch unit. This is not shown in the pictures in order to simplify the figures. Thus, the architecture supports the transfer of data in one DRP overlapped with the reconfiguration of a different DRP. Each DRP processor has its own clock signal, which means that this is a kind of globally-asynchronous locallysynchronous (GALS) architecture. The architecture supports the use of clock-gating and frequency-scaling techniques for power consumption minimization independently for each DRP. IV. DESIGN METHODOLOGY FOR EMBEDDED SYSTEMS The proposed design methodology is depicted in Fig. 3. We can observe that it is divided into three steps: 1) application phase; 2) static phase; and 3) dynamic phase. A. Application Phase The proposed methodology assumes that the input application is specified as a task graph, where nodes represent tasks (i.e., coarse-grained computations) and edges represent data dependencies. Each edge has a weight to represent the amount of data that must be transferred between tasks. Finally, each task has an associated task type (i.e., in the task-graph specification, we could have several tasks implementing the same type of computation). B. Static Phase In this phase, there are four main processes: 1) task-level graph transformations; 2) HW/SW synthesis; 3) HW/SW partitioning; and 4) priority task assignment. We can apply some task-level graph transformation techniques in order to increase the architecture performance. These transformations include: task pipelining, task blocking, and task (configuration) replication. The output of this step is the modified task graph. The HW/SW synthesis is the process of implementing the tasks found in the application. The output of this process is a set of estimators. Typical estimators are HW execution time, SW execution time, HW area, and reconfiguration time. These estimators could be obtained using accurate implementation tools (i.e., compiler, logic synthesis, and place&route tools) or using high-level estimation tools. The HW/SW partitioning process decides which tasks are mapped to hardware or software depending on: 1) the architecture parameters (i.e., the number of DRP processors or external DRAM size); 2) the modified task graph; and 3) the task s estimators. Note that the application requirements do not directly affect the HW/SW partitioning process, but they do affect this process indirectly using the modified task graph. The partitioning algorithm must take into account the configuration prefetch technique in its implementation. Finally, in the static phase, we also find the Priority Task Assignment process. In this process, we statically assign to each task a priority of execution. This information will be used during run-time to decide the execution order of the tasks. An example of priority function is the critical-path analysis. C. Dynamic Phase This phase is responsible for the scheduling of the tasks but also for the scheduling of the DRP s reconfigurations. The Task Scheduler and DRP Context Scheduler cooperate and run in parallel during application run-time execution. Their functionality is based on the use of a look-ahead strategy into the list of tasks ready for execution (i.e., tasks which predecessors have finished its execution). At run-time, the task scheduler assigns tasks to DRPs and decides the execution order of the tasks found in the list of ready for execution. The DRP context (configuration) scheduler is used to minimize reconfiguration overhead. The objective of the DRP context scheduler is to decide: 1) which DRP processor must be reconfigured and 2) which reconfiguration context, or hardware task from the list of tasks ready for reconfiguration (i.e., tasks which predecessors have initiated its execution), must be loaded in the DRP processor. This scheduler tries to minimize this reconfiguration overhead by overlapping the execution of tasks with DRP reconfigurations. These algorithms are implemented in hardware using the dynamic scheduling unit (DSU) found in our architecture (see Fig. 1) [13]. Several research efforts in the field of SoC design propose moving into hardware functionality that traditionally has been assigned to operating systems [17]. V. CONFIGURATION-AWARE DATA PARTITIONING Here, we explain how, depending on the application requirements (e.g., power or performance), the reconfiguration over-

4 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 733 Fig. 4. (a) Task graph for this example. (b) Sequential scheduling. (c) Scheduling with configuration prefetching. (d) Scheduling with data partitioning and configuration prefetching. head impacts the data-partitioning process. Moreover, we show that the proposed data-partitioning technique highly influences the HW/SW partitioning results. Finally, we explain the powerperformance design tradeoffs that are involved in the data-partitioning technique. A. Introduction and Motivation In our approach, we want to execute an application that is modeled as a task graph on a hybrid reconfigurable architecture with a given number of DRP processors, each one of them characterized by the reconfiguration time. Moreover, the application must process an input data set of a given fixed size. In many streaming embedded applications, we could assume that the execution time of the application is proportional to the size of data that has to be processed. In other words, this means that the execution time of the tasks that are found in the application is proportional to the amount of data that each task has to process. The data-partitioning process that we are proposing assumes that the execution time of the tasks is longer than the DRP reconfiguration time. Obviously, there are several alternatives when scheduling an application on a dynamically reconfigurable architecture. In Fig. 4, we can observe three possible solutions for an application with five tasks and a three-drp-based architecture. The task graph used in this example is shown in Fig. 4(a), where we can also observe for each task its execution time. In the following paragraphs, we explain these three possible solutions. 1) Sequential Scheduling: This is the simplest solution, where task executions and DRP reconfigurations are sequentially scheduled in the DRPs [see Fig. 4(b)]. We can observe that the execution time of the tasks is longer than DRP reconfigurations (shown as a shadowed R in the figure). Finally, we should also notice the performance penalty due to the reconfiguration overhead. 2) Scheduling With Configuration Prefetching: Configuration caching [27] and configuration prefetching [4] are wellknown mechanisms in reconfigurable computing to hide the reconfiguration overhead. Configuration prefetching is based on the idea of loading the required configuration on a DRP before it is actually required, thus overlapping execution in a DRP with reconfiguration in a different DRP. In our approach, the configuration prefetching of a task could start when all its predecessor tasks have started their execution. For instance, the configuration prefetching of task T2 could start after task T1 has begun its execution. On the other hand, the execution of a task might start when all of its predecessors have finished their execution (task T2 can start when task T1 has finished). As we can observe in Fig. 4(c), this technique hides completely the reconfiguration overhead to all DRP processors, thus improving the application performance. This approach is based on the idea that the task graph is executed only one time (i.e., each task processes all the input data set). The benefit of this approach is that it requires the minimum number of DRP reconfigurations (e.g., five reconfigurations in this example). However, this approach has two main drawbacks. The size of the shared memory buffers used for task communication is large (it must be able to store the maximum data size required by all the tasks). The DRP processors are waiting for their incoming data during a significant amount of time (i.e., they have finished the reconfiguration but cannot start execution because the input streams are not in the shared memory buffers). 3) Scheduling With Configuration Prefetching and Data Partitioning: This approach tries to overcome the limitations of the previous approach. This solution also uses the concept of configuration prefetching, but the input data set is not processed all at the same time. In this sense, the input data set is partitioned in several data blocks of a given size. This also means that the task graph must be iterated as many times as the number of input data blocks. In the example shown in Fig. 4(d), we can observe that the input data set has been partitioned into two data blocks (named 0 and 1 ) and that the task graph is iterated twice. This technique reduces the size of the shared memory buffers required for task communication. Moreover, we can also see that the latency from DRP reconfiguration to DRP execution is also reduced. However, this approach has the drawback that it increases the number of reconfigurations because the task graph must be iterated several times. For example, in Fig. 4(d), we can see that we now have nine reconfigurations compared with the five reconfigurations required in Fig. 4(c). In addition, this technique also impacts performance, since in this approach we can not use reconfiguration prefetch among two iterations of the task graph.

5 734 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 5. Example of how increasing the amount of processed data might help to minimize the reconfiguration overhead. B. Model for the Reconfiguration Overhead It has been demonstrated that the parameters of the reconfigurable architecture (i.e., number of DRP processors or reconfiguration time) have a direct impact into the performance given by the HW/SW partitioning process [11]. The partitioning process must take into account the reconfiguration time and the configuration prefetching technique for reconfiguration latency minimization. This is summarized in the following expression, which shows how the execution time 3 of a task mapped to hardware must be modified to consider the reconfiguration overhead (see also Fig. 5): where is the execution time of task without any reconfiguration overhead; is the probability of reconfiguration, which is a function of the number of tasks mapped to hardware and the number of DRP processors; is the reconfiguration time needed for a DRP processor to change its context (configuration); is the average executing time for all tasks. On the other hand, in the design of embedded systems, we would like to minimize the number of accesses to external memory, in order to reduce the overall system-level power consumption. Thus, data transfers between tasks should be kept to a size that fits into the on-chip L2 memory. In many streaming embedded applications, we could assume that the execution time of a given task implemented in hardware or software is proportional to the size of the data that must be processed. Thus, if the data are stored in on-chip memory with 3 In this paper, the execution time of a task includes the time required to: 1) read the data from memory; 2) process the data; and 3) write the processed data back to memory. (1) a smaller capacity, then we could conclude that the average execution time of the tasks will be smaller when compared with the reconfiguration time (we are assuming reconfiguration times in the order of 800 s 1.4 ms). If this is the case, and applying expression (1), we will have a significant reconfiguration overhead (because ), which may prevent moving the task from software to hardware. In order to overcome this limitation and reduce the reconfiguration overhead, we could increase the amount of data to be processed by the task. Increasing the amount of data means that we will be forced to use external memory. Using this approach, we increase the performance (because more tasks could be mapped to hardware) but we also increase the overall system-level power consumption. An example of this previous concept [see (1)] can be observed in Fig. 5, where we consider the execution of two tasks. Thus, in Fig. 5(b), we can see that although using the reconfiguration prefetching technique, we cannot completely hide the reconfiguration overhead for task T2, since task T1 has a shorter execution time because it processes ten data units [see Fig. 5(a)]. As previously introduced, this might be improved by increasing the amount of processed data. In this example, we have increased the amount of processed data to twenty data units [see Fig. 5(c)], which in fact increases the execution time of task T1 in such a manner that it equals the reconfiguration time for task T2, hence completely hiding the reconfiguration overhead [see Fig. 5(d)]. C. Data Partitioning for Reconfigurable Architectures How the input data set is partitioned will mainly drive the use of a given approach: 1) HW/SW partitioning for statically reconfigurable architectures using on-chip memory or 2) context scheduling for dynamically reconfigurable architectures using off-chip memory. Thus, the input streaming data set must be partitioned into several blocks, and the size of these blocks is mainly driven by the objectives of the application (i.e., power or performance). Consequently, if the application objective is performance, then we should process large blocks of data because we want to minimize the reconfiguration overhead. Moreover, if we are processing large blocks of data, it is more likely that these blocks do not fit in the on-chip L2 memory subsystem and we are forced to use off-chip memory, thus increasing the overall system-level power consumption. On the other hand, we have the situation where the application objective is low power. In this case, we must process small blocks of data so that they could be stored in the on-chip memory, thus minimizing the system-level power consumption. The drawback of this solution is that, if we process small blocks of data, then the reconfiguration overhead would be more significant, and this might prevent mapping more tasks on the run-time reconfigurable hardware. As a summary, processing small blocks of data, which are stored in on-chip memory, reduces the power consumption but it also reduces the application performance. In addition, the type of on-chip/off-chip data partitioning gives us the number of iterations of the task graph. This is shown in Fig. 6, where we can observe an example of an image-processing application. In this figure, we observe three different image sizes (i.e., , , and ).

NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 735 Fig. 6. (a) Initial task graph. (b) Data partitioning for dynamically reconfigurable architectures.

6(b), we can observe that the size of the blocks of data is large (e.g., 256 256).

For example, when we must process an image size of 512 512, we must iterate the task graph four times, since we are processing blocks of size 256 256 pixels.

6 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 735 Fig. 6. (a) Initial task graph. (b) Data partitioning for dynamically reconfigurable architectures. (c) Data partitioning for HW/SW partitioning. Fig. 7. Image sharpening benchmarks. (a) Unsharp masking. (b) Sobel filter. (c) Laplacian filter. In Fig. 6(b), we can observe that the size of the blocks of data is large (e.g., ). The amount of data to be processed must be such that the task execution time at least equals the reconfiguration time. In this situation, the number of task graph iterations is reduced. For example, when we must process an image size of , we must iterate the task graph four times, since we are processing blocks of size pixels. In the opposite case, we have the situation where we process small blocks of data [e.g., as shown in Fig. 6(c)], but we have a large number of iterations of the task graph. For example, if we want to process an input image of , then we must iterate 16 times the task graph. This example assumes that the data are partitioned into squared blocks, but the input data set (e.g., image) could have been also partitioned into blocks of several rows or columns. The authors would like to clarify at this point that the techniques proposed in this paper might not be applied to all kinds of streaming applications. There might be other type of applications where this idea of block-based data partitioning and processing is not possible (e.g., video coding). VI. EXPERIMENTS AND RESULTS A. Image Sharpening Benchmarks The proposed dynamically reconfigurable architecture is addressing streaming data (computationally intensive) embedded applications, that is, applications with a large amount of datalevel parallelism. It is not the goal of the proposed architecture to address control-dominated applications. Image-processing applications are a good example of the type of applications that we are addressing. This kind of application is becoming more and more sensible for power consumption, especially if we consider the increasing market share of digital cameras or mobile phones with embedded cameras, which require this type of image-processing technique. In this sense, we have selected three applications that implement an image sharpening application (see Fig. 7). The three benchmarks follow the same basic process: 1) transform the input image from RGB to YCrCb color space; 2) image quality improvements processing Fig. 8. Galapagos prototyping platform. the luminance (mainly using sliding window operations like 3 3 linear convolutions); and 3) transform from YCrCb back to RGB color space. Three different input data sets (image sizes) have been used in the experiments: 1) ; 2) ; and 3) B. Prototype Implementation A prototype of the proposed architecture has been designed and implemented. The Galapagos system is a PCI-based system (64 b/66 MHz). It is based on leading-edge FPGAs from Xilinx and high-bandwidth DDR SDRAM memory (see the left-hand side of Fig. 8). This reconfigurable system is based on a Virtex-II Pro device. The device used is a XC2VP20, which includes two PowerPC processors. The dynamically scheduling unit (DSU, in Fig. 1) and the data prefetch units of the L2 memory subsystem [see Fig. 2(b)] have been mapped to the Virtex-II pro device, which also includes the SDRAM memory controller. The design of these blocks has been done in verilog HDL, and the implementation has been done using Synplicity (synthesis) and Xilinx (place&route) tools. The DRP processors of our architecture are implemented in the Galapagos system using three Virtex-II devices (i.e., XC2V1000). The load and store units have been implemented using Virtex-II on-chip memory. The size of the buffers in the load/store units is 2 KB each buffer (i.e., 4 KB for each unit). The width of the memory words is 64 b. Fig. 8 shows a picture of the Galapagos system in a PC environment.

736 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 9.

10 shows the execution time of the tasks for the unsharp masking application running on: 1) an embedded processor, PowerPC405 (300 MHz), which processes blocks of data of 64 64 pixels; 2) a DRP

It is interesting to note the order of magnitude that has been obtained in the implementation of the blur task (3 3 linear convolution).

7 736 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 9. Final placed and routed task on three virtex-ii devices: (a) XC2V1000, (b)xc2v500, and (c) XC2V250. Fig. 10. HW/SW task execution time. C. Task Performance Results Fig. 10 shows the execution time of the tasks for the unsharp masking application running on: 1) an embedded processor, PowerPC405 (300 MHz), which processes blocks of data of pixels; 2) a DRP processor from the Galapagos System (60 MHz) processing blocks of pixels; and 3) a DRP processor processing blocks of data of pixels. It is interesting to note the order of magnitude that has been obtained in the implementation of the blur task (3 3 linear convolution). It is not the objective of this paper to explain the details of the implementation of the several tasks in hardware. These tasks have been designed in verilog HDL, simulated using Modelsim, and implemented using Synplicity (synthesis) and Xilinx (place&route) tools. In order to reduce the reconfiguration overhead, we have used the partial reconfiguration capability of the Virtex-II devices [26]. In this sense, the Virtex-II resources used by the hardware tasks have been fixed to be in the center of the device, where we time-multiplexed the required task (see Fig. 9). The left and right sides of the device are used by the DRP s load and store units, which are not run-time reconfigured [see Fig. 2(a)]. We have implemented the DRP processors in three different Xilinx Virtex-II devices (i.e., XC2V250, XC2V500, and XC2V1000), which mainly differ in the amount of hardware area used by the reconfigurable unit. Using this capability of the Virtex-II devices, we have obtained, using a reconfiguration clock of 66 MHz, the following average reconfiguration times for the three devices: a) 949 s for a XC2V250; b)1087 s for a XC2V500; and c) 1337 s for a XC2V1000. Fig. 11. Hardware/software task power-consumption. D. Task Power Results Fig. 11 shows the power consumption for a Galapagos DRP in its several states using on-chip or off-chip memory. Moreover, we can also observe the power consumption for the embedded PowerPC405 processor, which is used to execute the tasks mapped to software. In Fig. 11, we give three values for the DRP s power consumption (a different value for each Xilinx Virtex-II device). These power consumption values for the several Virtex-II devices have been obtained using Xpower, 4 which is the power estimation tool from Xilinx. Moreover, using XPower, we have estimated the power consumption for the on-chip memory. Finally, the power consumption for the off-chip memory (i.e., external DRAM) has been obtained from Micron datasheets. 5 We have used two memory chips of 64 MByte running at 100 MHz. In the following paragraphs, we explain the DRP processor s power consumption in its several states. The power consumption in the idle/wait state represents: 1) the static (i.e., leakage) power associated with a complete DRP processor (i.e., Load, Store and Reconfigurable units) and 2) the static power taken by the on-chip or off- 4 [Online]. Available: 5 [Online]. Available:

8 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 737 chip memory resources. Clearly, the static power increases when we: 1) use external memory or 2) increase the size of the device (i.e., increase the hardware area). The power in the reconfiguration state includes: 1) the power of the DRP processor itself; from Xilinx, we have obtained that this power consumption is mainly driven by the device leakage power; 2) the dynamic power consumption of the L2 configuration prefetch unit; and 3) the dynamic power from the on-chip or off-chip memory resources (in this latest situation, we also include the power consumption from the I/O buffers). It is interesting to note that the dynamic power taken by the Xilinx Virtex-II devices during the reconfiguration process can be ignored, that is, the static (i.e., leakage) power is that significant that the dynamic power turns into noise keep in mind that during the reconfiguration process only a minor amount of logic is actually switching, since the reconfiguration context (i.e., bitstream) is sequentially loaded into the reconfigurable hardware. The power consumption in the execution state takes account of: 1) the static and dynamic power of the full DRP processor (i.e., Load, Store and Reconfigurable units); 2) the dynamic power from the L2 data prefetch units; and 3) the power consumption of the associated (i.e., on-chip or off-chip) memory resources. As for the previous case, when dealing with external memory we also take into account the power consumption of the I/O buffers (e.g., LVTTL 3.3 V). The DRP power consumption in execution is an average power obtained when the tasks from the unsharp masking application run at 60 MHz. This average power consumption has been obtained implementing a gate-level accurate simulation after the place&route process for all the tasks. Finally, let us briefly explain the power consumption of the embedded CPU. According to Xilinx, the PowerPC405 takes 0.9 mw/mhz. 6 Assuming a clock frequency of 300 MHz, then we obtain a power consumption of 270 mw. We should also add here the power consumption of the data prefetch units attached to the embedded processor. E. Energy Performance Tradeoff Results In this subsection, we explain the energy performance tradeoffs results obtained when applying the proposed configuration-aware data-partitioning technique. The performance results have been obtained from real executions on the Galapagos system. The execution generates a log file with the state changes of the Virtex-II devices and embedded PowerPC. We have obtained the energy from: 1) the power consumption of the components as described in Fig. 11 and 2) the execution log file, which gives information about the amount of time that a device has been in a given state. Fig. 13(a) shows the performance results and Fig. 13(b) shows the energy consumption results for the unsharp masking application. In all pictures, we can observe the obtained results for the image size, when we change the target device 6 [Online]. Available: Fig. 12. Unsharp masking HW/SW task partitioning. (i.e., we show the results for the three Virtex-II devices). In addition, we present the following four implementations. Software implementation (named seq_sw); this implementation is based on the use of the embedded PowerPC405 remember that the associated performance and power results are shown in Figs. 10 and 11, respectively. In this experiment, we assume that the input images have been partitioned on blocks of pixels (i.e., remember that since we have partitioned the input image in several blocks, we must iterate the task graph several times for example, 16 times in the case of an image with pixels). HW/SW partitioning (named seq_hw_sw): in this approach, we use on-chip memory since we process small data blocks (i.e., pixels). Moreover, we have used the HW/SW partitioning algorithm proposed in [11], assuming we use two or three DRP processors and the average reconfiguration times introduced in the previous subsection. The obtained partitioning can be observed in Fig. 12, where we see that the reconfiguration overhead prevents us from moving into HW more tasks than the number of available DRP processors. Dynamic reconfiguration (named seq_dr): In this case, we increase the size of the data blocks to process. Specifically, we process blocks of pixels, which means that we must use off-chip memory (i.e., external DRAM). This amount of data implies that the tasks execution time is pretty more similar to the DRP reconfiguration time. This fact means that, when we apply the HW/SW partitioning algorithm, all tasks are mapped to the reconfigurable hardware. Hardware implementation (named seq_hw): This approach assumes that: 1) we use five DRP processors and 2) we use on-chip memory, since we process blocks of data of pixels. This should be considered as the optimum solution in terms of both power and performance, since: 1) there is no reconfiguration overhead (i.e., we have as many DRPs as tasks) and 2) we use on-chip memory. In Fig. 13(a), we show the performance that we have obtained using the four implementations. We can observe that the software implementation (i.e., PowerPC405 based solution) obtains the worst performance results. The use of the HW/SW partitioning approach contributes to a major improvement in per-

738 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 13. Unsharp masking application. (a) Performance results. (b) Energy results.

9 738 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 13. Unsharp masking application. (a) Performance results. (b) Energy results. formance, since critical tasks are mapped to the configurable hardware. Obviously, increasing the number of DRP processors helps to improve performance since more tasks are implemented in hardware (i.e., 29% improvement when moving from two to three DRP processors). Moreover, it is clear that the reconfiguration time does not affect this approach, since there are no reconfigurations. The dynamic reconfiguration technique helps to improve performance even more. Dynamic reconfiguration improves the HW/SW partitioning approach by: 1) 62.5% when using two DRP processors and 2) 47.3% when using three DRP processors. In addition, dynamic reconfiguration improves the solution based on the embedded CPU by 83.14% Finally, it is worth mentioning that, in the unsharp masking benchmark, the dynamic reconfiguration approach does not benefit from increasing the number of DRP processors (i.e., we obtain the same results in both situations). Since we are using the unmodified linear task graph, we have enough resources with two DRP processors to completely hide the reconfiguration overhead (i.e., one DRP processor is in reconfiguration while the other one is in execution). On the other hand, Fig. 13(b) shows the energy consumption for all four approaches. It is clear that the solution based on the embedded CPU is the approach that consumes the larger amount of energy. Despite using on-chip memory and requiring the minimum amount of power (see Fig. 11), the long execution time of the tasks implemented in the PowerPC405 contribute to this large energy consumption. Obviously, the hardware-based approach is the optimum solution in terms of energy consumption, thanks to the use of on-chip memory and the short execution times, which do not have any reconfiguration overhead. Then, as an intermediate solution, we have the results for the mixed HW/SW and dynamic reconfiguration approaches. We must first observe, in both approaches, that the energy increases when: 1) having fixed the number of DRP processors, we increase the size of the reconfigurable unit (e.g., we move from two XC2V250 to two XC2V500 devices) or 2) having fixed a given Virtex-II device, we increase the number of DRP processors (i.e., we move from two to three DRP processors). In both situations, this increment of the energy is due to the increment of the static (i.e., idle) leakage power, which comes with the increment of the hardware area. From Fig. 13(b), we can observe that, independently of the number of DRP processors, the mixed HW/SW solution requires less energy than the dynamic reconfiguration approach does, 7 that is, the dynamic reconfiguration approach, despite its performance advantages, requires more energy due to its high power requirements, which comes from the use of off-chip memory. It is interesting to note here that the dynamic reconfiguration approach has the same energy requirements for execution and reconfiguration, as in the case where we use two DRP processors. As a summary from Fig. 13(b), we obtain that both solutions based on configurable logic give an average 43% of energy reduction when they are compared with energy required by the embedded CPU implementation. This energy improvement might be a value up to 60%. Moreover, HW/SW partitioning improves, in terms of energy consumption, the dynamic reconfiguration approach by 16.4% when using two DRP processors and 35% when using three DRP processors. VII. CONCLUSION In this paper. we have explored the system-level power-performance tradeoffs for fine-grained reconfigurable computing. We have proposed a configuration-aware data-partitioning technique for reconfigurable architectures, and we have shown how the reconfiguration overhead directly impacts this data-partitioning process. When targeting many streaming applications (like the imageprocessing applications), we have shown that the use of a given approach (i.e., HW/SW partitioning for statically reconfigurable or context scheduling for dynamically reconfigurable architectures) depends on the application requirements (i.e., power or performance). Thus, in this type of applications, if the objective is energy efficiency, then HW/SW partitioning for statically reconfigurable logic is the most favorable solution. On the other 7 In the calculation of the energy taken by the dynamic reconfiguration approach, we are assuming that we can completely power off the embedded CPU (i.e., we do not consider the leakage power due to the PowerPC).

NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 739 hand, if the application objective is performance, then context scheduling for dynamically reconfigurable

Finally, future work includes the study of the same tradeoffs in a mixed environment, where HW/SW partitioning could be used with context scheduling for dynamically reconfigurable architectures.

10 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 739 hand, if the application objective is performance, then context scheduling for dynamically reconfigurable architectures is the optimum solution. Finally, future work includes the study of the same tradeoffs in a mixed environment, where HW/SW partitioning could be used with context scheduling for dynamically reconfigurable architectures. Other topics of future research include applying the techniques proposed in this paper to other types of embedded applications and proposing a detailed implementation for the L2 memory subsystem. REFERENCES [1] L. Benini, A. Bogliolo, and G. De Micheli, A survey of design techniques for system-level dynamic power management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp , Jun [2] K. Chatta and R. Vemuri, Hardware-software co-design for dynamically reconfigurable architectures, in Proc. FPL, 1999, pp [3] V. George, H. Zhang, and J. Rabaey, The design of a low energy FPGA, in Proc. Int. Symp. ISLPED, 1999, pp [4] S. Hauck, Configuration prefetch for single context reconfigurable coprocessors, in Proc. ACM Int. Symp. FPGA, 1998, pp [5] R. Hartenstein, A decade of reconfigurable computing: A Visionary retrospective, in Proc. DATE, 2001, pp [6] B. Jeong, Hardware-software co-synthesis for run-time incrementally reconfigurable FPGAs, in Proc. ASP-DAC, 2000, pp [7] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for power efficient FPGAs, in Proc. ACM Int. Symp. FPGA, 2003, pp [8] Y. Li, Hardware-software co-design of embedded reconfigurable architectures, in Proc. DAC, 2000, pp [9] R. Maestre, A framework for reconfigurable computing: Task scheduling and context management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp , Dec [10] R. Maestre, Configuration management in multi-context reconfigurable systems for simultaneous performance and power optimizations, in Proc. ISSS, 2000, pp [11] J. Noguera and R. M. Badia, HW/SW co-design techniques for dynamically reconfigurable architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 4, pp , Aug [12], System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures, in Proc. CASES, 2003, pp [13], Multitasking on reconfigurable architectures: Micro-architecture support and dynamic scheduling, in Proc. ACM TECS, 2004, pp [14], Power-performance trade-offs for reconfigurable computing, in Proc. CODES + ISSS, 2004, pp [15] K. W. Poon, A. Yan, and S. J. E. Wilton, A flexible power model for FPGAs, in Proc. 12th Int. Conf. Field-Programmable Logic Appl. (FPL), 2002, pp [16] K. Purna and D. Badia, Temporal partitioning and scheduling data flow graphs for reconfigurable computers, IEEE Trans. Computers, vol. 48, no. 6, pp , Jun [17] B. E. Saglam (Akgul) and V. Mooney, System-on-a-chip processor synchronization support in hardware, in Proc. DATE, 2001, pp [18] M. Sánchez-Élez, A complete data scheduler for multi-context reconfigurable architectures, in Proc. DATE, 2002, pp [19] L. Shang, A. S. Kaviani, and K. Bathala, Dynamic power consumption in virtex-ii FPGA family, in Proc. Int. Symp. FPGA (FPGA), 2002, pp [20] G. Stitt, F. Vahid, and S. Nemetebaksh, Energy savings and speedups from partitioning critical software loops to hardware in embedded systems, in Proc. ACM TECS, 2004, pp [21] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, A time-multiplexed FPGA, in Proc. 5th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 1997, pp [22] O. S. Unsal and I. Koren, System-level power-aware design techniques in real-time systems, Proc. IEEE, vol. 91, pp , Jul [23] M. Vasilko and D. Ait-Boudaoud, Scheduling for dynamically reconfigurable FPGAs, in Proc. Int. Workshop Logic Arch. Synthesis (IFIP TC10 WG10.5), 1995, pp [24] K. Weiß, C. Oetker, I. Katchan, T. Steckstor, and W. Katchan, Power estimation approach for sram-based FPGAs, in Proc. 8th ACM Int. Symp. Field-Programmable Gate Arrays (FPGA), 2000, pp [25] M. J. Wirthlin and B. L. Hutchings, Improving functional density through run-time circuit reconfiguration, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 2, pp , Jun [26] Two Flows for Partial Reconfiguration: Module Based or Small Bit Manipulations, Xinlinx Corp., San Jose, CA, 2005, Xilinx Application Note XAPP290. [27] Z. Li, K. Compton, and S. Hauck, Configuration caching management techniques for reconfigurable computing, in Proc. 8th IEEE Symp. Field-Programmable Custom Computing Machines, 2000, pp Juanjo Noguera received the B.Sc. degree in computer science from the Autonomous University of Barcelona, Barcelona, Spain, in 1997, and the Ph.D. degree in computer science from the Technical University of Catalonia, Barcelona, Spain, in He has worked for the Spanish National Center for Microelectronics, the Technical University of Catalonia, and Hewlett-Packard Inkjet Commercial Division. In January 2006, he joined the Xilinx Research Labs, Dublin, Ireland. His interests include system-level design, reconfigurable architectures, and low-power design techniques. He has published papers in international journals and conference proceedings. Rosa M. Badia received the B.Sc. and Ph.D. degrees in computer science from the Technical University of Catalonia, Barcelona, Spain, in 1989 and 1994, respectively. She is currently an Associate Professor in the Computer Architecture Department of the Technical University of Catalonia, and Project Manager at the Barcelona Supercomputing Center, Barcelona, Spain. Her interests include CAD tools for VLSI, reconfigurable architectures, performance prediction and analysis of message passing applications, and GRID computing. She has published papers in international journals and conference proceedings

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time