RECONFIGURABLE computing (RC) [5] is an interesting
|
|
- Ross Wilkinson
- 5 years ago
- Views:
Transcription
1 730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 System-Level Power-Performance Tradeoffs for Reconfigurable Computing Juanjo Noguera and Rosa M. Badia Abstract In this paper, we propose a configuration-aware datapartitioning approach for reconfigurable computing. We show how the reconfiguration overhead impacts the data-partitioning process. Moreover, we explore the system-level power-performance tradeoffs available when implementing streaming embedded applications on fine-grained reconfigurable architectures. For a certain group of streaming applications, we show that an efficient hardware/software partitioning algorithm is required when targeting low power. However, if the application objective is performance, then we propose the use of dynamically reconfigurable architectures. We propose a design methodology that adapts the architecture and algorithms to the application requirements. The methodology has been proven to work on a real research platform based on Xilinx devices. Finally, we have applied our methodology and algorithms to the case study of image sharpening, which is required nowadays in digital cameras and mobile phones. Index Terms Hardware/software (HW/SW) codesign, powerperformance tradeoffs, reconfigurable computing (RC). I. INTRODUCTION AND MOTIVATION RECONFIGURABLE computing (RC) [5] is an interesting alternative to application-specific integrated circuits (ASICs) and general-purpose processors in order to implement embedded systems, since it provides the flexibility of software processors and the efficiency and throughput of hardware coprocessors. Programmable-system-on-chips have become a reality, combining a wide range of complex functions on a single die. An example is the Virtex-II Pro from Xilinx, which integrates a core processor (PowerPC405), embedded memory, and configurable logic. 1 Additionally, the importance of having on-chip programmable logic regions in system-on-chip (SoC) platforms is becoming increasingly evident. Partitioning an application among software and programmable logic hardware can substantially improve performance, but such partitioning can also improve power consumption by performing computations more effectively and by allowing for longer microprocessor shutdown periods. Dynamic reconfiguration [25] has emerged as a particularly attractive technique to increase the effective use of Manuscript received July 2, 2005; revised January 9, This work was supported by the CICYT under Project TIN CO2-01 and by DURSI under Project 2001SGR J. Noguera was with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain. He is now with Xilinx Research Laboratories, Saggart, Co. Dublin, Ireland ( juanjo.noguera@xilinx.com; jnoguera@ac.upc.edu). R. M. Badia is with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain ( rosab@ac.upc.edu). Digital Object Identifier /TVLSI [Online]. Available: programmable logic blocks, since it allows the change of the device configuration on the fly during application execution. However, this attractive idea of time-multiplexing the needed device configuration does not come for free. The reconfiguration overhead has to be minimized in order to improve application performance. Temporal partitioning [16] and context scheduling [9] can be used to minimize this penalty. We could summarize that the system-level approaches to reconfigurable computing could be divided in two broad categories: 1) hardware/software (HW/SW) partitioning for statically reconfigurable architectures and 2) temporal partitioning and context scheduling 2 for dynamically reconfigurable architectures. On the other hand, energy-efficient computation is a major challenge in embedded systems design, especially if portable, battery-powered systems (e.g., mobile phones or digital cameras) are considered [1]. It is well known that the memory hierarchy is one of the major contributors to the system-level power budget [1], [22]. Thus, the way we do partition the data between on-chip or off-chip memory will impact the overall system-level power consumption. In this paper, we investigate the power-performance tradeoffs for these two system-level approaches to RC. We show that, when targeting streaming applications, the use of a given approach (i.e., HW/SW partitioning or context scheduling) depends on the application requirements (i.e., power or performance). Moreover, we propose that, in the reconfiguration context scheduling approach, the reconfigurable architecture should process large blocks of data, which should be stored in external memory resources. The execution of large blocks of data minimizes the reconfiguration overhead, but it also increases the power consumption due to the use of external memory. On the other hand, the HW/SW partitioning-based approach should process small blocks of data that can be stored in on-chip memory, which means that we reduce the overall system power consumption. The paper is organized as follows. Section II explains the related work. In Section III, we introduce our target architecture. The proposed design methodology for embedded systems is presented in Section IV. Section V introduces the concept of configuration-aware data partitioning. In Section VI, we explain the benchmarks, the experimental setup, and the obtained results. Finally, the conclusions of this paper are presented in Section VII. 2 In this paper, context scheduling refers to the scheduling of: 1) tasks executions and 2) in partially (not multicontext) reconfigurable devices, the reconfiguration processes of the configurable blocks /$ IEEE
2 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 731 II. PREVIOUS WORK HW/SW partitioning for reconfigurable computing has been addressed in several research efforts [2], [6], [8]. An integrated algorithm for HW/SW partitioning and scheduling, temporal partitioning, and context scheduling is presented in [2]. On the other hand, context scheduling has been also widely addressed in many publications [9], [16], [21], [23]. However, none of these papers introduces any power-performance tradeoffs. A review of design techniques for system-level dynamic power management can be found in [1]. In addition, a survey on power-aware design techniques for real-time systems is given in [22]. However, none of these papers considers the use of reconfigurable architectures. Power consumption of field-programmable gate-array (FPGA) devices has been addressed in several research efforts [3], [19], [24]. In addition, several power estimation models have been proposed [7], [15]. However, all of these approaches study the power requirements at the device level and not at the system level. Recently, very few research efforts have addressed low-power task scheduling for dynamically reconfigurable devices. The technique proposed in [10] tries to minimize power consumption during reconfiguration by minimizing the number of bit changes between reconfiguration contexts. However, no power-performance tradeoffs and power measurements are presented. More recently, in [12], it was shown that configuration prefetching and frequency scaling could reduce the energy consumption without affecting performance. However, this paper does not cover the benefits of HW/SW partitioning. Additional techniques are given in [18] and [20]. A technique for application partitioning between configurable logic and an embedded processor is given in [20]. This paper shows that such partitioning helps to improve both performance and energy. However, the paper only considers statically configurable logic and does not consider dynamically reconfigurable architectures. A different approach for coarse-grained RC is presented in [18]. In the paper, a data-scheduler algorithm is proposed to reduce the overall system energy. However, the paper does not consider the benefits of HW/SW partitioning. A. Contributions of This Work This paper explores the system-level power-performance tradeoffs for fine-grained reconfigurable computing. More specifically, the paper compares, in terms of energy savings and performance improvements, the two key approaches existing in reconfigurable computing: 1) partitioning an application between software and configurable hardware and 2) context scheduling for dynamically reconfigurable architectures. To the best of our knowledge, this open issue has not been addressed in previous research efforts. In addition, the study presented in this paper focuses on a data-size-based partitioning approach for streaming applications. This is different from the majority of the traditional HW/SW partitioning and context scheduling approaches in the literature focused on task graph dependency analysis. Fig. 1. Dynamically reconfigurable CMP architecture. Fig. 2. (a) Dynamically reconfigurable processor. (b) Architecture of the L2 on-chip memory subsystem. III. TARGET ARCHITECTURE The target architecture is a heterogeneous architecture, which includes an embedded processor, a given number of dynamically reconfigurable processors (DRPs), an on-chip L2 multibank memory subsystem, and external DRAM memory resources. An example of this architecture is shown in Fig. 1, where we can see a four-drp-based architecture. This architecture follows the chip multiprocessor (CMP) paradigm. The data that must be transferred between tasks executed in the DRP processors are stored in the on-chip L2 memory subsystem. Each DRP processor can be independently reconfigured. The proposed target architecture supports multiple reconfigurations running concurrently, which is not the case for most of the architectures proposed in the literature. Each DRP processor has a local L1 memory buffer. A hardware-based data prefetching mechanism is proposed to hide the memory latency. Each DRP has a point-to-point link to the L2 buffers (in order to simplify Fig. 1, this is not shown in the picture). However, this is shown in Fig. 2(a), which shows the internal architecture of a DRP processor. There are three main components in this architecture: 1) the load unit; 2) the store unit; and 3) the dynamically reconfigurable logic. The DRPs are single-context devices. It can be observed in Fig. 2(a) that the load and store units have internal L1 data buffers. As it is shown in the picture, each unit (i.e., load and store) has two internal buffers. This approach enables the possibility of having three processes running concurrently: 1) the load unit receiving data for the next computation; 2) the reconfigurable logic is processing data from a buffer in the load unit and storing this processed data in a buffer of the store unit; and 3) the store
3 732 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 3. Design methodology for embedded systems. unit is sending the previous processed data to the L2 memory subsystem. The on-chip L2 memory subsystem is based on a multibank approach [see Fig. 2(b)]. Each one of these banks is logically divided into two independent subbanks (i.e., this enables reading from one subbank while concurrently writing to the other subbank of the same physical bank). These buffers interact from one side with the data prefetch units [see the left-hand side of Fig. 2(b)] using a crossbar and from the other side with an on-chip bus that interacts with the external DRAM memory controller. In this L2 memory subsystem, there must be as many data prefetch units as DRP processors. The proposed architecture includes for each DRP a dedicated hardware-based configuration prefetch unit. This is not shown in the pictures in order to simplify the figures. Thus, the architecture supports the transfer of data in one DRP overlapped with the reconfiguration of a different DRP. Each DRP processor has its own clock signal, which means that this is a kind of globally-asynchronous locallysynchronous (GALS) architecture. The architecture supports the use of clock-gating and frequency-scaling techniques for power consumption minimization independently for each DRP. IV. DESIGN METHODOLOGY FOR EMBEDDED SYSTEMS The proposed design methodology is depicted in Fig. 3. We can observe that it is divided into three steps: 1) application phase; 2) static phase; and 3) dynamic phase. A. Application Phase The proposed methodology assumes that the input application is specified as a task graph, where nodes represent tasks (i.e., coarse-grained computations) and edges represent data dependencies. Each edge has a weight to represent the amount of data that must be transferred between tasks. Finally, each task has an associated task type (i.e., in the task-graph specification, we could have several tasks implementing the same type of computation). B. Static Phase In this phase, there are four main processes: 1) task-level graph transformations; 2) HW/SW synthesis; 3) HW/SW partitioning; and 4) priority task assignment. We can apply some task-level graph transformation techniques in order to increase the architecture performance. These transformations include: task pipelining, task blocking, and task (configuration) replication. The output of this step is the modified task graph. The HW/SW synthesis is the process of implementing the tasks found in the application. The output of this process is a set of estimators. Typical estimators are HW execution time, SW execution time, HW area, and reconfiguration time. These estimators could be obtained using accurate implementation tools (i.e., compiler, logic synthesis, and place&route tools) or using high-level estimation tools. The HW/SW partitioning process decides which tasks are mapped to hardware or software depending on: 1) the architecture parameters (i.e., the number of DRP processors or external DRAM size); 2) the modified task graph; and 3) the task s estimators. Note that the application requirements do not directly affect the HW/SW partitioning process, but they do affect this process indirectly using the modified task graph. The partitioning algorithm must take into account the configuration prefetch technique in its implementation. Finally, in the static phase, we also find the Priority Task Assignment process. In this process, we statically assign to each task a priority of execution. This information will be used during run-time to decide the execution order of the tasks. An example of priority function is the critical-path analysis. C. Dynamic Phase This phase is responsible for the scheduling of the tasks but also for the scheduling of the DRP s reconfigurations. The Task Scheduler and DRP Context Scheduler cooperate and run in parallel during application run-time execution. Their functionality is based on the use of a look-ahead strategy into the list of tasks ready for execution (i.e., tasks which predecessors have finished its execution). At run-time, the task scheduler assigns tasks to DRPs and decides the execution order of the tasks found in the list of ready for execution. The DRP context (configuration) scheduler is used to minimize reconfiguration overhead. The objective of the DRP context scheduler is to decide: 1) which DRP processor must be reconfigured and 2) which reconfiguration context, or hardware task from the list of tasks ready for reconfiguration (i.e., tasks which predecessors have initiated its execution), must be loaded in the DRP processor. This scheduler tries to minimize this reconfiguration overhead by overlapping the execution of tasks with DRP reconfigurations. These algorithms are implemented in hardware using the dynamic scheduling unit (DSU) found in our architecture (see Fig. 1) [13]. Several research efforts in the field of SoC design propose moving into hardware functionality that traditionally has been assigned to operating systems [17]. V. CONFIGURATION-AWARE DATA PARTITIONING Here, we explain how, depending on the application requirements (e.g., power or performance), the reconfiguration over-
4 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 733 Fig. 4. (a) Task graph for this example. (b) Sequential scheduling. (c) Scheduling with configuration prefetching. (d) Scheduling with data partitioning and configuration prefetching. head impacts the data-partitioning process. Moreover, we show that the proposed data-partitioning technique highly influences the HW/SW partitioning results. Finally, we explain the powerperformance design tradeoffs that are involved in the data-partitioning technique. A. Introduction and Motivation In our approach, we want to execute an application that is modeled as a task graph on a hybrid reconfigurable architecture with a given number of DRP processors, each one of them characterized by the reconfiguration time. Moreover, the application must process an input data set of a given fixed size. In many streaming embedded applications, we could assume that the execution time of the application is proportional to the size of data that has to be processed. In other words, this means that the execution time of the tasks that are found in the application is proportional to the amount of data that each task has to process. The data-partitioning process that we are proposing assumes that the execution time of the tasks is longer than the DRP reconfiguration time. Obviously, there are several alternatives when scheduling an application on a dynamically reconfigurable architecture. In Fig. 4, we can observe three possible solutions for an application with five tasks and a three-drp-based architecture. The task graph used in this example is shown in Fig. 4(a), where we can also observe for each task its execution time. In the following paragraphs, we explain these three possible solutions. 1) Sequential Scheduling: This is the simplest solution, where task executions and DRP reconfigurations are sequentially scheduled in the DRPs [see Fig. 4(b)]. We can observe that the execution time of the tasks is longer than DRP reconfigurations (shown as a shadowed R in the figure). Finally, we should also notice the performance penalty due to the reconfiguration overhead. 2) Scheduling With Configuration Prefetching: Configuration caching [27] and configuration prefetching [4] are wellknown mechanisms in reconfigurable computing to hide the reconfiguration overhead. Configuration prefetching is based on the idea of loading the required configuration on a DRP before it is actually required, thus overlapping execution in a DRP with reconfiguration in a different DRP. In our approach, the configuration prefetching of a task could start when all its predecessor tasks have started their execution. For instance, the configuration prefetching of task T2 could start after task T1 has begun its execution. On the other hand, the execution of a task might start when all of its predecessors have finished their execution (task T2 can start when task T1 has finished). As we can observe in Fig. 4(c), this technique hides completely the reconfiguration overhead to all DRP processors, thus improving the application performance. This approach is based on the idea that the task graph is executed only one time (i.e., each task processes all the input data set). The benefit of this approach is that it requires the minimum number of DRP reconfigurations (e.g., five reconfigurations in this example). However, this approach has two main drawbacks. The size of the shared memory buffers used for task communication is large (it must be able to store the maximum data size required by all the tasks). The DRP processors are waiting for their incoming data during a significant amount of time (i.e., they have finished the reconfiguration but cannot start execution because the input streams are not in the shared memory buffers). 3) Scheduling With Configuration Prefetching and Data Partitioning: This approach tries to overcome the limitations of the previous approach. This solution also uses the concept of configuration prefetching, but the input data set is not processed all at the same time. In this sense, the input data set is partitioned in several data blocks of a given size. This also means that the task graph must be iterated as many times as the number of input data blocks. In the example shown in Fig. 4(d), we can observe that the input data set has been partitioned into two data blocks (named 0 and 1 ) and that the task graph is iterated twice. This technique reduces the size of the shared memory buffers required for task communication. Moreover, we can also see that the latency from DRP reconfiguration to DRP execution is also reduced. However, this approach has the drawback that it increases the number of reconfigurations because the task graph must be iterated several times. For example, in Fig. 4(d), we can see that we now have nine reconfigurations compared with the five reconfigurations required in Fig. 4(c). In addition, this technique also impacts performance, since in this approach we can not use reconfiguration prefetch among two iterations of the task graph.
5 734 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 5. Example of how increasing the amount of processed data might help to minimize the reconfiguration overhead. B. Model for the Reconfiguration Overhead It has been demonstrated that the parameters of the reconfigurable architecture (i.e., number of DRP processors or reconfiguration time) have a direct impact into the performance given by the HW/SW partitioning process [11]. The partitioning process must take into account the reconfiguration time and the configuration prefetching technique for reconfiguration latency minimization. This is summarized in the following expression, which shows how the execution time 3 of a task mapped to hardware must be modified to consider the reconfiguration overhead (see also Fig. 5): where is the execution time of task without any reconfiguration overhead; is the probability of reconfiguration, which is a function of the number of tasks mapped to hardware and the number of DRP processors; is the reconfiguration time needed for a DRP processor to change its context (configuration); is the average executing time for all tasks. On the other hand, in the design of embedded systems, we would like to minimize the number of accesses to external memory, in order to reduce the overall system-level power consumption. Thus, data transfers between tasks should be kept to a size that fits into the on-chip L2 memory. In many streaming embedded applications, we could assume that the execution time of a given task implemented in hardware or software is proportional to the size of the data that must be processed. Thus, if the data are stored in on-chip memory with 3 In this paper, the execution time of a task includes the time required to: 1) read the data from memory; 2) process the data; and 3) write the processed data back to memory. (1) a smaller capacity, then we could conclude that the average execution time of the tasks will be smaller when compared with the reconfiguration time (we are assuming reconfiguration times in the order of 800 s 1.4 ms). If this is the case, and applying expression (1), we will have a significant reconfiguration overhead (because ), which may prevent moving the task from software to hardware. In order to overcome this limitation and reduce the reconfiguration overhead, we could increase the amount of data to be processed by the task. Increasing the amount of data means that we will be forced to use external memory. Using this approach, we increase the performance (because more tasks could be mapped to hardware) but we also increase the overall system-level power consumption. An example of this previous concept [see (1)] can be observed in Fig. 5, where we consider the execution of two tasks. Thus, in Fig. 5(b), we can see that although using the reconfiguration prefetching technique, we cannot completely hide the reconfiguration overhead for task T2, since task T1 has a shorter execution time because it processes ten data units [see Fig. 5(a)]. As previously introduced, this might be improved by increasing the amount of processed data. In this example, we have increased the amount of processed data to twenty data units [see Fig. 5(c)], which in fact increases the execution time of task T1 in such a manner that it equals the reconfiguration time for task T2, hence completely hiding the reconfiguration overhead [see Fig. 5(d)]. C. Data Partitioning for Reconfigurable Architectures How the input data set is partitioned will mainly drive the use of a given approach: 1) HW/SW partitioning for statically reconfigurable architectures using on-chip memory or 2) context scheduling for dynamically reconfigurable architectures using off-chip memory. Thus, the input streaming data set must be partitioned into several blocks, and the size of these blocks is mainly driven by the objectives of the application (i.e., power or performance). Consequently, if the application objective is performance, then we should process large blocks of data because we want to minimize the reconfiguration overhead. Moreover, if we are processing large blocks of data, it is more likely that these blocks do not fit in the on-chip L2 memory subsystem and we are forced to use off-chip memory, thus increasing the overall system-level power consumption. On the other hand, we have the situation where the application objective is low power. In this case, we must process small blocks of data so that they could be stored in the on-chip memory, thus minimizing the system-level power consumption. The drawback of this solution is that, if we process small blocks of data, then the reconfiguration overhead would be more significant, and this might prevent mapping more tasks on the run-time reconfigurable hardware. As a summary, processing small blocks of data, which are stored in on-chip memory, reduces the power consumption but it also reduces the application performance. In addition, the type of on-chip/off-chip data partitioning gives us the number of iterations of the task graph. This is shown in Fig. 6, where we can observe an example of an image-processing application. In this figure, we observe three different image sizes (i.e., , , and ).
6 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 735 Fig. 6. (a) Initial task graph. (b) Data partitioning for dynamically reconfigurable architectures. (c) Data partitioning for HW/SW partitioning. Fig. 7. Image sharpening benchmarks. (a) Unsharp masking. (b) Sobel filter. (c) Laplacian filter. In Fig. 6(b), we can observe that the size of the blocks of data is large (e.g., ). The amount of data to be processed must be such that the task execution time at least equals the reconfiguration time. In this situation, the number of task graph iterations is reduced. For example, when we must process an image size of , we must iterate the task graph four times, since we are processing blocks of size pixels. In the opposite case, we have the situation where we process small blocks of data [e.g., as shown in Fig. 6(c)], but we have a large number of iterations of the task graph. For example, if we want to process an input image of , then we must iterate 16 times the task graph. This example assumes that the data are partitioned into squared blocks, but the input data set (e.g., image) could have been also partitioned into blocks of several rows or columns. The authors would like to clarify at this point that the techniques proposed in this paper might not be applied to all kinds of streaming applications. There might be other type of applications where this idea of block-based data partitioning and processing is not possible (e.g., video coding). VI. EXPERIMENTS AND RESULTS A. Image Sharpening Benchmarks The proposed dynamically reconfigurable architecture is addressing streaming data (computationally intensive) embedded applications, that is, applications with a large amount of datalevel parallelism. It is not the goal of the proposed architecture to address control-dominated applications. Image-processing applications are a good example of the type of applications that we are addressing. This kind of application is becoming more and more sensible for power consumption, especially if we consider the increasing market share of digital cameras or mobile phones with embedded cameras, which require this type of image-processing technique. In this sense, we have selected three applications that implement an image sharpening application (see Fig. 7). The three benchmarks follow the same basic process: 1) transform the input image from RGB to YCrCb color space; 2) image quality improvements processing Fig. 8. Galapagos prototyping platform. the luminance (mainly using sliding window operations like 3 3 linear convolutions); and 3) transform from YCrCb back to RGB color space. Three different input data sets (image sizes) have been used in the experiments: 1) ; 2) ; and 3) B. Prototype Implementation A prototype of the proposed architecture has been designed and implemented. The Galapagos system is a PCI-based system (64 b/66 MHz). It is based on leading-edge FPGAs from Xilinx and high-bandwidth DDR SDRAM memory (see the left-hand side of Fig. 8). This reconfigurable system is based on a Virtex-II Pro device. The device used is a XC2VP20, which includes two PowerPC processors. The dynamically scheduling unit (DSU, in Fig. 1) and the data prefetch units of the L2 memory subsystem [see Fig. 2(b)] have been mapped to the Virtex-II pro device, which also includes the SDRAM memory controller. The design of these blocks has been done in verilog HDL, and the implementation has been done using Synplicity (synthesis) and Xilinx (place&route) tools. The DRP processors of our architecture are implemented in the Galapagos system using three Virtex-II devices (i.e., XC2V1000). The load and store units have been implemented using Virtex-II on-chip memory. The size of the buffers in the load/store units is 2 KB each buffer (i.e., 4 KB for each unit). The width of the memory words is 64 b. Fig. 8 shows a picture of the Galapagos system in a PC environment.
7 736 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 9. Final placed and routed task on three virtex-ii devices: (a) XC2V1000, (b)xc2v500, and (c) XC2V250. Fig. 10. HW/SW task execution time. C. Task Performance Results Fig. 10 shows the execution time of the tasks for the unsharp masking application running on: 1) an embedded processor, PowerPC405 (300 MHz), which processes blocks of data of pixels; 2) a DRP processor from the Galapagos System (60 MHz) processing blocks of pixels; and 3) a DRP processor processing blocks of data of pixels. It is interesting to note the order of magnitude that has been obtained in the implementation of the blur task (3 3 linear convolution). It is not the objective of this paper to explain the details of the implementation of the several tasks in hardware. These tasks have been designed in verilog HDL, simulated using Modelsim, and implemented using Synplicity (synthesis) and Xilinx (place&route) tools. In order to reduce the reconfiguration overhead, we have used the partial reconfiguration capability of the Virtex-II devices [26]. In this sense, the Virtex-II resources used by the hardware tasks have been fixed to be in the center of the device, where we time-multiplexed the required task (see Fig. 9). The left and right sides of the device are used by the DRP s load and store units, which are not run-time reconfigured [see Fig. 2(a)]. We have implemented the DRP processors in three different Xilinx Virtex-II devices (i.e., XC2V250, XC2V500, and XC2V1000), which mainly differ in the amount of hardware area used by the reconfigurable unit. Using this capability of the Virtex-II devices, we have obtained, using a reconfiguration clock of 66 MHz, the following average reconfiguration times for the three devices: a) 949 s for a XC2V250; b)1087 s for a XC2V500; and c) 1337 s for a XC2V1000. Fig. 11. Hardware/software task power-consumption. D. Task Power Results Fig. 11 shows the power consumption for a Galapagos DRP in its several states using on-chip or off-chip memory. Moreover, we can also observe the power consumption for the embedded PowerPC405 processor, which is used to execute the tasks mapped to software. In Fig. 11, we give three values for the DRP s power consumption (a different value for each Xilinx Virtex-II device). These power consumption values for the several Virtex-II devices have been obtained using Xpower, 4 which is the power estimation tool from Xilinx. Moreover, using XPower, we have estimated the power consumption for the on-chip memory. Finally, the power consumption for the off-chip memory (i.e., external DRAM) has been obtained from Micron datasheets. 5 We have used two memory chips of 64 MByte running at 100 MHz. In the following paragraphs, we explain the DRP processor s power consumption in its several states. The power consumption in the idle/wait state represents: 1) the static (i.e., leakage) power associated with a complete DRP processor (i.e., Load, Store and Reconfigurable units) and 2) the static power taken by the on-chip or off- 4 [Online]. Available: 5 [Online]. Available:
8 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 737 chip memory resources. Clearly, the static power increases when we: 1) use external memory or 2) increase the size of the device (i.e., increase the hardware area). The power in the reconfiguration state includes: 1) the power of the DRP processor itself; from Xilinx, we have obtained that this power consumption is mainly driven by the device leakage power; 2) the dynamic power consumption of the L2 configuration prefetch unit; and 3) the dynamic power from the on-chip or off-chip memory resources (in this latest situation, we also include the power consumption from the I/O buffers). It is interesting to note that the dynamic power taken by the Xilinx Virtex-II devices during the reconfiguration process can be ignored, that is, the static (i.e., leakage) power is that significant that the dynamic power turns into noise keep in mind that during the reconfiguration process only a minor amount of logic is actually switching, since the reconfiguration context (i.e., bitstream) is sequentially loaded into the reconfigurable hardware. The power consumption in the execution state takes account of: 1) the static and dynamic power of the full DRP processor (i.e., Load, Store and Reconfigurable units); 2) the dynamic power from the L2 data prefetch units; and 3) the power consumption of the associated (i.e., on-chip or off-chip) memory resources. As for the previous case, when dealing with external memory we also take into account the power consumption of the I/O buffers (e.g., LVTTL 3.3 V). The DRP power consumption in execution is an average power obtained when the tasks from the unsharp masking application run at 60 MHz. This average power consumption has been obtained implementing a gate-level accurate simulation after the place&route process for all the tasks. Finally, let us briefly explain the power consumption of the embedded CPU. According to Xilinx, the PowerPC405 takes 0.9 mw/mhz. 6 Assuming a clock frequency of 300 MHz, then we obtain a power consumption of 270 mw. We should also add here the power consumption of the data prefetch units attached to the embedded processor. E. Energy Performance Tradeoff Results In this subsection, we explain the energy performance tradeoffs results obtained when applying the proposed configuration-aware data-partitioning technique. The performance results have been obtained from real executions on the Galapagos system. The execution generates a log file with the state changes of the Virtex-II devices and embedded PowerPC. We have obtained the energy from: 1) the power consumption of the components as described in Fig. 11 and 2) the execution log file, which gives information about the amount of time that a device has been in a given state. Fig. 13(a) shows the performance results and Fig. 13(b) shows the energy consumption results for the unsharp masking application. In all pictures, we can observe the obtained results for the image size, when we change the target device 6 [Online]. Available: Fig. 12. Unsharp masking HW/SW task partitioning. (i.e., we show the results for the three Virtex-II devices). In addition, we present the following four implementations. Software implementation (named seq_sw); this implementation is based on the use of the embedded PowerPC405 remember that the associated performance and power results are shown in Figs. 10 and 11, respectively. In this experiment, we assume that the input images have been partitioned on blocks of pixels (i.e., remember that since we have partitioned the input image in several blocks, we must iterate the task graph several times for example, 16 times in the case of an image with pixels). HW/SW partitioning (named seq_hw_sw): in this approach, we use on-chip memory since we process small data blocks (i.e., pixels). Moreover, we have used the HW/SW partitioning algorithm proposed in [11], assuming we use two or three DRP processors and the average reconfiguration times introduced in the previous subsection. The obtained partitioning can be observed in Fig. 12, where we see that the reconfiguration overhead prevents us from moving into HW more tasks than the number of available DRP processors. Dynamic reconfiguration (named seq_dr): In this case, we increase the size of the data blocks to process. Specifically, we process blocks of pixels, which means that we must use off-chip memory (i.e., external DRAM). This amount of data implies that the tasks execution time is pretty more similar to the DRP reconfiguration time. This fact means that, when we apply the HW/SW partitioning algorithm, all tasks are mapped to the reconfigurable hardware. Hardware implementation (named seq_hw): This approach assumes that: 1) we use five DRP processors and 2) we use on-chip memory, since we process blocks of data of pixels. This should be considered as the optimum solution in terms of both power and performance, since: 1) there is no reconfiguration overhead (i.e., we have as many DRPs as tasks) and 2) we use on-chip memory. In Fig. 13(a), we show the performance that we have obtained using the four implementations. We can observe that the software implementation (i.e., PowerPC405 based solution) obtains the worst performance results. The use of the HW/SW partitioning approach contributes to a major improvement in per-
9 738 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 13. Unsharp masking application. (a) Performance results. (b) Energy results. formance, since critical tasks are mapped to the configurable hardware. Obviously, increasing the number of DRP processors helps to improve performance since more tasks are implemented in hardware (i.e., 29% improvement when moving from two to three DRP processors). Moreover, it is clear that the reconfiguration time does not affect this approach, since there are no reconfigurations. The dynamic reconfiguration technique helps to improve performance even more. Dynamic reconfiguration improves the HW/SW partitioning approach by: 1) 62.5% when using two DRP processors and 2) 47.3% when using three DRP processors. In addition, dynamic reconfiguration improves the solution based on the embedded CPU by 83.14% Finally, it is worth mentioning that, in the unsharp masking benchmark, the dynamic reconfiguration approach does not benefit from increasing the number of DRP processors (i.e., we obtain the same results in both situations). Since we are using the unmodified linear task graph, we have enough resources with two DRP processors to completely hide the reconfiguration overhead (i.e., one DRP processor is in reconfiguration while the other one is in execution). On the other hand, Fig. 13(b) shows the energy consumption for all four approaches. It is clear that the solution based on the embedded CPU is the approach that consumes the larger amount of energy. Despite using on-chip memory and requiring the minimum amount of power (see Fig. 11), the long execution time of the tasks implemented in the PowerPC405 contribute to this large energy consumption. Obviously, the hardware-based approach is the optimum solution in terms of energy consumption, thanks to the use of on-chip memory and the short execution times, which do not have any reconfiguration overhead. Then, as an intermediate solution, we have the results for the mixed HW/SW and dynamic reconfiguration approaches. We must first observe, in both approaches, that the energy increases when: 1) having fixed the number of DRP processors, we increase the size of the reconfigurable unit (e.g., we move from two XC2V250 to two XC2V500 devices) or 2) having fixed a given Virtex-II device, we increase the number of DRP processors (i.e., we move from two to three DRP processors). In both situations, this increment of the energy is due to the increment of the static (i.e., idle) leakage power, which comes with the increment of the hardware area. From Fig. 13(b), we can observe that, independently of the number of DRP processors, the mixed HW/SW solution requires less energy than the dynamic reconfiguration approach does, 7 that is, the dynamic reconfiguration approach, despite its performance advantages, requires more energy due to its high power requirements, which comes from the use of off-chip memory. It is interesting to note here that the dynamic reconfiguration approach has the same energy requirements for execution and reconfiguration, as in the case where we use two DRP processors. As a summary from Fig. 13(b), we obtain that both solutions based on configurable logic give an average 43% of energy reduction when they are compared with energy required by the embedded CPU implementation. This energy improvement might be a value up to 60%. Moreover, HW/SW partitioning improves, in terms of energy consumption, the dynamic reconfiguration approach by 16.4% when using two DRP processors and 35% when using three DRP processors. VII. CONCLUSION In this paper. we have explored the system-level power-performance tradeoffs for fine-grained reconfigurable computing. We have proposed a configuration-aware data-partitioning technique for reconfigurable architectures, and we have shown how the reconfiguration overhead directly impacts this data-partitioning process. When targeting many streaming applications (like the imageprocessing applications), we have shown that the use of a given approach (i.e., HW/SW partitioning for statically reconfigurable or context scheduling for dynamically reconfigurable architectures) depends on the application requirements (i.e., power or performance). Thus, in this type of applications, if the objective is energy efficiency, then HW/SW partitioning for statically reconfigurable logic is the most favorable solution. On the other 7 In the calculation of the energy taken by the dynamic reconfiguration approach, we are assuming that we can completely power off the embedded CPU (i.e., we do not consider the leakage power due to the PowerPC).
10 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 739 hand, if the application objective is performance, then context scheduling for dynamically reconfigurable architectures is the optimum solution. Finally, future work includes the study of the same tradeoffs in a mixed environment, where HW/SW partitioning could be used with context scheduling for dynamically reconfigurable architectures. Other topics of future research include applying the techniques proposed in this paper to other types of embedded applications and proposing a detailed implementation for the L2 memory subsystem. REFERENCES [1] L. Benini, A. Bogliolo, and G. De Micheli, A survey of design techniques for system-level dynamic power management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp , Jun [2] K. Chatta and R. Vemuri, Hardware-software co-design for dynamically reconfigurable architectures, in Proc. FPL, 1999, pp [3] V. George, H. Zhang, and J. Rabaey, The design of a low energy FPGA, in Proc. Int. Symp. ISLPED, 1999, pp [4] S. Hauck, Configuration prefetch for single context reconfigurable coprocessors, in Proc. ACM Int. Symp. FPGA, 1998, pp [5] R. Hartenstein, A decade of reconfigurable computing: A Visionary retrospective, in Proc. DATE, 2001, pp [6] B. Jeong, Hardware-software co-synthesis for run-time incrementally reconfigurable FPGAs, in Proc. ASP-DAC, 2000, pp [7] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for power efficient FPGAs, in Proc. ACM Int. Symp. FPGA, 2003, pp [8] Y. Li, Hardware-software co-design of embedded reconfigurable architectures, in Proc. DAC, 2000, pp [9] R. Maestre, A framework for reconfigurable computing: Task scheduling and context management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp , Dec [10] R. Maestre, Configuration management in multi-context reconfigurable systems for simultaneous performance and power optimizations, in Proc. ISSS, 2000, pp [11] J. Noguera and R. M. Badia, HW/SW co-design techniques for dynamically reconfigurable architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 4, pp , Aug [12], System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures, in Proc. CASES, 2003, pp [13], Multitasking on reconfigurable architectures: Micro-architecture support and dynamic scheduling, in Proc. ACM TECS, 2004, pp [14], Power-performance trade-offs for reconfigurable computing, in Proc. CODES + ISSS, 2004, pp [15] K. W. Poon, A. Yan, and S. J. E. Wilton, A flexible power model for FPGAs, in Proc. 12th Int. Conf. Field-Programmable Logic Appl. (FPL), 2002, pp [16] K. Purna and D. Badia, Temporal partitioning and scheduling data flow graphs for reconfigurable computers, IEEE Trans. Computers, vol. 48, no. 6, pp , Jun [17] B. E. Saglam (Akgul) and V. Mooney, System-on-a-chip processor synchronization support in hardware, in Proc. DATE, 2001, pp [18] M. Sánchez-Élez, A complete data scheduler for multi-context reconfigurable architectures, in Proc. DATE, 2002, pp [19] L. Shang, A. S. Kaviani, and K. Bathala, Dynamic power consumption in virtex-ii FPGA family, in Proc. Int. Symp. FPGA (FPGA), 2002, pp [20] G. Stitt, F. Vahid, and S. Nemetebaksh, Energy savings and speedups from partitioning critical software loops to hardware in embedded systems, in Proc. ACM TECS, 2004, pp [21] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, A time-multiplexed FPGA, in Proc. 5th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 1997, pp [22] O. S. Unsal and I. Koren, System-level power-aware design techniques in real-time systems, Proc. IEEE, vol. 91, pp , Jul [23] M. Vasilko and D. Ait-Boudaoud, Scheduling for dynamically reconfigurable FPGAs, in Proc. Int. Workshop Logic Arch. Synthesis (IFIP TC10 WG10.5), 1995, pp [24] K. Weiß, C. Oetker, I. Katchan, T. Steckstor, and W. Katchan, Power estimation approach for sram-based FPGAs, in Proc. 8th ACM Int. Symp. Field-Programmable Gate Arrays (FPGA), 2000, pp [25] M. J. Wirthlin and B. L. Hutchings, Improving functional density through run-time circuit reconfiguration, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 2, pp , Jun [26] Two Flows for Partial Reconfiguration: Module Based or Small Bit Manipulations, Xinlinx Corp., San Jose, CA, 2005, Xilinx Application Note XAPP290. [27] Z. Li, K. Compton, and S. Hauck, Configuration caching management techniques for reconfigurable computing, in Proc. 8th IEEE Symp. Field-Programmable Custom Computing Machines, 2000, pp Juanjo Noguera received the B.Sc. degree in computer science from the Autonomous University of Barcelona, Barcelona, Spain, in 1997, and the Ph.D. degree in computer science from the Technical University of Catalonia, Barcelona, Spain, in He has worked for the Spanish National Center for Microelectronics, the Technical University of Catalonia, and Hewlett-Packard Inkjet Commercial Division. In January 2006, he joined the Xilinx Research Labs, Dublin, Ireland. His interests include system-level design, reconfigurable architectures, and low-power design techniques. He has published papers in international journals and conference proceedings. Rosa M. Badia received the B.Sc. and Ph.D. degrees in computer science from the Technical University of Catalonia, Barcelona, Spain, in 1989 and 1994, respectively. She is currently an Associate Professor in the Computer Architecture Department of the Technical University of Catalonia, and Project Manager at the Barcelona Supercomputing Center, Barcelona, Spain. Her interests include CAD tools for VLSI, reconfigurable architectures, performance prediction and analysis of message passing applications, and GRID computing. She has published papers in international journals and conference proceedings
A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems
A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A REVIEWARTICLE OF SDRAM DESIGN WITH NECESSARY CRITERIA OF DDR CONTROLLER Sushmita Bilani *1 & Mr. Sujeet Mishra 2 *1 M.Tech Student
More informationA Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique
A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department
More informationA hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements
A hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements Javier Resano, Daniel Mozos, Elena Pérez, Hortensia Mecha, Julio Septién Dept.
More informationISSN Vol.05, Issue.12, December-2017, Pages:
ISSN 2322-0929 Vol.05, Issue.12, December-2017, Pages:1174-1178 www.ijvdcs.org Design of High Speed DDR3 SDRAM Controller NETHAGANI KAMALAKAR 1, G. RAMESH 2 1 PG Scholar, Khammam Institute of Technology
More informationResource Efficient Multi Ported Sram Based Ternary Content Addressable Memory
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 PP 11-18 www.iosrjen.org Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory S.Parkavi (1) And S.Bharath
More informationMapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.
Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable
More informationQUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection
QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla 1,2, Neil W. Bergmann 1, Jürgen Becker 2 1 ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au
More informationDesign and Implementation of High Performance DDR3 SDRAM controller
Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore
More informationA Complete Data Scheduler for Multi-Context Reconfigurable Architectures
A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationHardware Software Codesign of Embedded Systems
Hardware Software Codesign of Embedded Systems Rabi Mahapatra Texas A&M University Today s topics Course Organization Introduction to HS-CODES Codesign Motivation Some Issues on Codesign of Embedded System
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationPower Reduction Techniques in the Memory System. Typical Memory Hierarchy
Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache
More informationA Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific
More informationA Partitioning Flow for Accelerating Applications in Processor-FPGA Systems
A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical & Computer Engineering
More informationIntroduction to reconfigurable systems
Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used
More informationMapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience
Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience H. Krupnova CMG/FMVG, ST Microelectronics Grenoble, France Helena.Krupnova@st.com Abstract Today, having a fast hardware
More informationTradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter
Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,
More informationOptimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip
Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,
More informationLOW POWER FPGA IMPLEMENTATION OF REAL-TIME QRS DETECTION ALGORITHM
LOW POWER FPGA IMPLEMENTATION OF REAL-TIME QRS DETECTION ALGORITHM VIJAYA.V, VAISHALI BARADWAJ, JYOTHIRANI GUGGILLA Electronics and Communications Engineering Department, Vaagdevi Engineering College,
More informationUML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems
UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems Chih-Hao Tseng and Pao-Ann Hsiung Department of Computer Science and Information Engineering, National
More informationDESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2
ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT
More informationHardware Acceleration of Edge Detection Algorithm on FPGAs
Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationAN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong
AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4 Bas Breijer, Filipa Duarte, and Stephan Wong Computer Engineering, EEMCS Delft University of Technology Mekelweg 4, 2826CD, Delft, The Netherlands email:
More informationHardware Software Codesign of Embedded System
Hardware Software Codesign of Embedded System CPSC489-501 Rabi Mahapatra Mahapatra - Texas A&M - Fall 00 1 Today s topics Course Organization Introduction to HS-CODES Codesign Motivation Some Issues on
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationSoft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study
Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering
More informationof Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture
Enhancement of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture Sushmita Bilani Department of Electronics and Communication (Embedded System & VLSI Design),
More informationAn Approach for Adaptive DRAM Temperature and Power Management
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationImplementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationHardware-Software Codesign. 1. Introduction
Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2
More informationUsing Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices
Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices Yang Qu 1, Juha-Pekka Soininen 1 and Jari Nurmi 2 1 Technical Research Centre of Finland (VTT), Kaitoväylä
More informationA Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding
A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationFPGA Provides Speedy Data Compression for Hyperspectral Imagery
FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationReal Time NoC Based Pipelined Architectonics With Efficient TDM Schema
Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema [1] Laila A, [2] Ajeesh R V [1] PG Student [VLSI & ES] [2] Assistant professor, Department of ECE, TKM Institute of Technology, Kollam
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationMulti MicroBlaze System for Parallel Computing
Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need
More informationHardware/Software Co-design
Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationAbstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003
Title Reconfigurable Logic and Hardware Software Codesign Class EEC282 Author Marty Nicholes Date 12/06/2003 Abstract. This is a review paper covering various aspects of reconfigurable logic. The focus
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationImproving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints
Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent
More informationLong Term Trends for Embedded System Design
Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationFPGA Implementation and Validation of the Asynchronous Array of simple Processors
FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,
More informationDESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC
DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com
More informationISSN Vol.05,Issue.09, September-2017, Pages:
WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,
More informationEnergy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
More informationIMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA
IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA T. Rupalatha 1, Mr.C.Leelamohan 2, Mrs.M.Sreelakshmi 3 P.G. Student, Department of ECE, C R Engineering College, Tirupati, India 1 Associate Professor,
More informationImplementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications
46 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.3, March 2008 Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications
More informationPilot: A Platform-based HW/SW Synthesis System
Pilot: A Platform-based HW/SW Synthesis System SOC Group, VLSI CAD Lab, UCLA Led by Jason Cong Zhong Chen, Yiping Fan, Xun Yang, Zhiru Zhang ICSOC Workshop, Beijing August 20, 2002 Outline Overview The
More informationthe main limitations of the work is that wiring increases with 1. INTRODUCTION
Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,
More informationImplementation of Asynchronous Topology using SAPTL
Implementation of Asynchronous Topology using SAPTL NARESH NAGULA *, S. V. DEVIKA **, SK. KHAMURUDDEEN *** *(senior software Engineer & Technical Lead, Xilinx India) ** (Associate Professor, Department
More informationDesign Partitioning Methodology for Systems on Programmable Chip
Design Partitioning Methodology for Systems on Programmable Chip Abdo Azibi and Ramzi Ayadi Department of Electronics College of Technology at Alkharj, Saudi Arabia Email: aazibi, amzi.ayadi@tvtc.gov.sa
More informationLecture 41: Introduction to Reconfigurable Computing
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41: Introduction to Reconfigurable Computing Michael Le, Sp07 Head TA April 30, 2007 Slides Courtesy of Hayden So, Sp06 CS61c Head TA Following
More informationFPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression
FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationMicroelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together.
Microelectronics Initially, only a few gates or memory cells could be reliably manufactured and packaged together. These early integrated circuits are referred to as small-scale integration (SSI). As time
More informationA hardware operating system kernel for multi-processor systems
A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationInterfacing a High Speed Crypto Accelerator to an Embedded CPU
Interfacing a High Speed Crypto Accelerator to an Embedded CPU Alireza Hodjat ahodjat @ee.ucla.edu Electrical Engineering Department University of California, Los Angeles Ingrid Verbauwhede ingrid @ee.ucla.edu
More informationVerification of Multiprocessor system using Hardware/Software Co-simulation
Vol. 2, 85 Verification of Multiprocessor system using Hardware/Software Co-simulation Hassan M Raza and Rajendra M Patrikar Abstract--Co-simulation for verification has recently been introduced as an
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationDesign of a System-on-Chip Switched Network and its Design Support Λ
Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of
More informationEmbedded Real-Time Video Processing System on FPGA
Embedded Real-Time Video Processing System on FPGA Yahia Said 1, Taoufik Saidani 1, Fethi Smach 2, Mohamed Atri 1, and Hichem Snoussi 3 1 Laboratory of Electronics and Microelectronics (EμE), Faculty of
More informationDesign Issues in Hardware/Software Co-Design
Volume-2, Issue-1, January-February, 2014, pp. 01-05, IASTER 2013 www.iaster.com, Online: 2347-6109, Print: 2348-0017 ABSTRACT Design Issues in Hardware/Software Co-Design R. Ganesh Sr. Asst. Professor,
More informationLecture 7: Introduction to Co-synthesis Algorithms
Design & Co-design of Embedded Systems Lecture 7: Introduction to Co-synthesis Algorithms Sharif University of Technology Computer Engineering Dept. Winter-Spring 2008 Mehdi Modarressi Topics for today
More informationPerformance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path
Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical
More informationWITH the development of the semiconductor technology,
Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)
More informationEmbedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory
Embedded Systems 8. Hardware Components Lothar Thiele Computer Engineering and Networks Laboratory Do you Remember? 8 2 8 3 High Level Physical View 8 4 High Level Physical View 8 5 Implementation Alternatives
More informationPower Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study
Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico
More informationImplementing Photoshop Filters in Virtex
Implementing Photoshop Filters in Virtex S. Ludwig, R. Slous and S. Singh Springer-Verlag Berlin Heildelberg 1999. This paper was first published in Field-Programmable Logic and Applications, Proceedings
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationBus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao
Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor
More informationTestability Design for Sleep Convention Logic
Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 11, Number 7 (2018) pp. 561-566 Research India Publications http://www.ripublication.com Testability Design for Sleep Convention
More informationFPGA: What? Why? Marco D. Santambrogio
FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much
More informationMulti processor systems with configurable hardware acceleration
Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations
More informationFPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP
FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College
More informationA Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors
, July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationA Novel Deadlock Avoidance Algorithm and Its Hardware Implementation
A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)
More informationOn GPU Bus Power Reduction with 3D IC Technologies
On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The
More informationFPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS
FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,
More informationFast FPGA Routing Approach Using Stochestic Architecture
. Fast FPGA Routing Approach Using Stochestic Architecture MITESH GURJAR 1, NAYAN PATEL 2 1 M.E. Student, VLSI and Embedded System Design, GTU PG School, Ahmedabad, Gujarat, India. 2 Professor, Sabar Institute
More informationAn Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling
An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate
More informationOptimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased
Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics
More informationThree DIMENSIONAL-CHIPS
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna
More informationArchitectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad
nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses
More information