RECONFIGURABLE computing (RC) [5] is an interesting

Size: px
Start display at page:

Download "RECONFIGURABLE computing (RC) [5] is an interesting"

Transcription

1 730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 System-Level Power-Performance Tradeoffs for Reconfigurable Computing Juanjo Noguera and Rosa M. Badia Abstract In this paper, we propose a configuration-aware datapartitioning approach for reconfigurable computing. We show how the reconfiguration overhead impacts the data-partitioning process. Moreover, we explore the system-level power-performance tradeoffs available when implementing streaming embedded applications on fine-grained reconfigurable architectures. For a certain group of streaming applications, we show that an efficient hardware/software partitioning algorithm is required when targeting low power. However, if the application objective is performance, then we propose the use of dynamically reconfigurable architectures. We propose a design methodology that adapts the architecture and algorithms to the application requirements. The methodology has been proven to work on a real research platform based on Xilinx devices. Finally, we have applied our methodology and algorithms to the case study of image sharpening, which is required nowadays in digital cameras and mobile phones. Index Terms Hardware/software (HW/SW) codesign, powerperformance tradeoffs, reconfigurable computing (RC). I. INTRODUCTION AND MOTIVATION RECONFIGURABLE computing (RC) [5] is an interesting alternative to application-specific integrated circuits (ASICs) and general-purpose processors in order to implement embedded systems, since it provides the flexibility of software processors and the efficiency and throughput of hardware coprocessors. Programmable-system-on-chips have become a reality, combining a wide range of complex functions on a single die. An example is the Virtex-II Pro from Xilinx, which integrates a core processor (PowerPC405), embedded memory, and configurable logic. 1 Additionally, the importance of having on-chip programmable logic regions in system-on-chip (SoC) platforms is becoming increasingly evident. Partitioning an application among software and programmable logic hardware can substantially improve performance, but such partitioning can also improve power consumption by performing computations more effectively and by allowing for longer microprocessor shutdown periods. Dynamic reconfiguration [25] has emerged as a particularly attractive technique to increase the effective use of Manuscript received July 2, 2005; revised January 9, This work was supported by the CICYT under Project TIN CO2-01 and by DURSI under Project 2001SGR J. Noguera was with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain. He is now with Xilinx Research Laboratories, Saggart, Co. Dublin, Ireland ( juanjo.noguera@xilinx.com; jnoguera@ac.upc.edu). R. M. Badia is with the Computer Architecture Department, Technical University of Catalonia, Barcelona, Spain ( rosab@ac.upc.edu). Digital Object Identifier /TVLSI [Online]. Available: programmable logic blocks, since it allows the change of the device configuration on the fly during application execution. However, this attractive idea of time-multiplexing the needed device configuration does not come for free. The reconfiguration overhead has to be minimized in order to improve application performance. Temporal partitioning [16] and context scheduling [9] can be used to minimize this penalty. We could summarize that the system-level approaches to reconfigurable computing could be divided in two broad categories: 1) hardware/software (HW/SW) partitioning for statically reconfigurable architectures and 2) temporal partitioning and context scheduling 2 for dynamically reconfigurable architectures. On the other hand, energy-efficient computation is a major challenge in embedded systems design, especially if portable, battery-powered systems (e.g., mobile phones or digital cameras) are considered [1]. It is well known that the memory hierarchy is one of the major contributors to the system-level power budget [1], [22]. Thus, the way we do partition the data between on-chip or off-chip memory will impact the overall system-level power consumption. In this paper, we investigate the power-performance tradeoffs for these two system-level approaches to RC. We show that, when targeting streaming applications, the use of a given approach (i.e., HW/SW partitioning or context scheduling) depends on the application requirements (i.e., power or performance). Moreover, we propose that, in the reconfiguration context scheduling approach, the reconfigurable architecture should process large blocks of data, which should be stored in external memory resources. The execution of large blocks of data minimizes the reconfiguration overhead, but it also increases the power consumption due to the use of external memory. On the other hand, the HW/SW partitioning-based approach should process small blocks of data that can be stored in on-chip memory, which means that we reduce the overall system power consumption. The paper is organized as follows. Section II explains the related work. In Section III, we introduce our target architecture. The proposed design methodology for embedded systems is presented in Section IV. Section V introduces the concept of configuration-aware data partitioning. In Section VI, we explain the benchmarks, the experimental setup, and the obtained results. Finally, the conclusions of this paper are presented in Section VII. 2 In this paper, context scheduling refers to the scheduling of: 1) tasks executions and 2) in partially (not multicontext) reconfigurable devices, the reconfiguration processes of the configurable blocks /$ IEEE

2 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 731 II. PREVIOUS WORK HW/SW partitioning for reconfigurable computing has been addressed in several research efforts [2], [6], [8]. An integrated algorithm for HW/SW partitioning and scheduling, temporal partitioning, and context scheduling is presented in [2]. On the other hand, context scheduling has been also widely addressed in many publications [9], [16], [21], [23]. However, none of these papers introduces any power-performance tradeoffs. A review of design techniques for system-level dynamic power management can be found in [1]. In addition, a survey on power-aware design techniques for real-time systems is given in [22]. However, none of these papers considers the use of reconfigurable architectures. Power consumption of field-programmable gate-array (FPGA) devices has been addressed in several research efforts [3], [19], [24]. In addition, several power estimation models have been proposed [7], [15]. However, all of these approaches study the power requirements at the device level and not at the system level. Recently, very few research efforts have addressed low-power task scheduling for dynamically reconfigurable devices. The technique proposed in [10] tries to minimize power consumption during reconfiguration by minimizing the number of bit changes between reconfiguration contexts. However, no power-performance tradeoffs and power measurements are presented. More recently, in [12], it was shown that configuration prefetching and frequency scaling could reduce the energy consumption without affecting performance. However, this paper does not cover the benefits of HW/SW partitioning. Additional techniques are given in [18] and [20]. A technique for application partitioning between configurable logic and an embedded processor is given in [20]. This paper shows that such partitioning helps to improve both performance and energy. However, the paper only considers statically configurable logic and does not consider dynamically reconfigurable architectures. A different approach for coarse-grained RC is presented in [18]. In the paper, a data-scheduler algorithm is proposed to reduce the overall system energy. However, the paper does not consider the benefits of HW/SW partitioning. A. Contributions of This Work This paper explores the system-level power-performance tradeoffs for fine-grained reconfigurable computing. More specifically, the paper compares, in terms of energy savings and performance improvements, the two key approaches existing in reconfigurable computing: 1) partitioning an application between software and configurable hardware and 2) context scheduling for dynamically reconfigurable architectures. To the best of our knowledge, this open issue has not been addressed in previous research efforts. In addition, the study presented in this paper focuses on a data-size-based partitioning approach for streaming applications. This is different from the majority of the traditional HW/SW partitioning and context scheduling approaches in the literature focused on task graph dependency analysis. Fig. 1. Dynamically reconfigurable CMP architecture. Fig. 2. (a) Dynamically reconfigurable processor. (b) Architecture of the L2 on-chip memory subsystem. III. TARGET ARCHITECTURE The target architecture is a heterogeneous architecture, which includes an embedded processor, a given number of dynamically reconfigurable processors (DRPs), an on-chip L2 multibank memory subsystem, and external DRAM memory resources. An example of this architecture is shown in Fig. 1, where we can see a four-drp-based architecture. This architecture follows the chip multiprocessor (CMP) paradigm. The data that must be transferred between tasks executed in the DRP processors are stored in the on-chip L2 memory subsystem. Each DRP processor can be independently reconfigured. The proposed target architecture supports multiple reconfigurations running concurrently, which is not the case for most of the architectures proposed in the literature. Each DRP processor has a local L1 memory buffer. A hardware-based data prefetching mechanism is proposed to hide the memory latency. Each DRP has a point-to-point link to the L2 buffers (in order to simplify Fig. 1, this is not shown in the picture). However, this is shown in Fig. 2(a), which shows the internal architecture of a DRP processor. There are three main components in this architecture: 1) the load unit; 2) the store unit; and 3) the dynamically reconfigurable logic. The DRPs are single-context devices. It can be observed in Fig. 2(a) that the load and store units have internal L1 data buffers. As it is shown in the picture, each unit (i.e., load and store) has two internal buffers. This approach enables the possibility of having three processes running concurrently: 1) the load unit receiving data for the next computation; 2) the reconfigurable logic is processing data from a buffer in the load unit and storing this processed data in a buffer of the store unit; and 3) the store

3 732 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 3. Design methodology for embedded systems. unit is sending the previous processed data to the L2 memory subsystem. The on-chip L2 memory subsystem is based on a multibank approach [see Fig. 2(b)]. Each one of these banks is logically divided into two independent subbanks (i.e., this enables reading from one subbank while concurrently writing to the other subbank of the same physical bank). These buffers interact from one side with the data prefetch units [see the left-hand side of Fig. 2(b)] using a crossbar and from the other side with an on-chip bus that interacts with the external DRAM memory controller. In this L2 memory subsystem, there must be as many data prefetch units as DRP processors. The proposed architecture includes for each DRP a dedicated hardware-based configuration prefetch unit. This is not shown in the pictures in order to simplify the figures. Thus, the architecture supports the transfer of data in one DRP overlapped with the reconfiguration of a different DRP. Each DRP processor has its own clock signal, which means that this is a kind of globally-asynchronous locallysynchronous (GALS) architecture. The architecture supports the use of clock-gating and frequency-scaling techniques for power consumption minimization independently for each DRP. IV. DESIGN METHODOLOGY FOR EMBEDDED SYSTEMS The proposed design methodology is depicted in Fig. 3. We can observe that it is divided into three steps: 1) application phase; 2) static phase; and 3) dynamic phase. A. Application Phase The proposed methodology assumes that the input application is specified as a task graph, where nodes represent tasks (i.e., coarse-grained computations) and edges represent data dependencies. Each edge has a weight to represent the amount of data that must be transferred between tasks. Finally, each task has an associated task type (i.e., in the task-graph specification, we could have several tasks implementing the same type of computation). B. Static Phase In this phase, there are four main processes: 1) task-level graph transformations; 2) HW/SW synthesis; 3) HW/SW partitioning; and 4) priority task assignment. We can apply some task-level graph transformation techniques in order to increase the architecture performance. These transformations include: task pipelining, task blocking, and task (configuration) replication. The output of this step is the modified task graph. The HW/SW synthesis is the process of implementing the tasks found in the application. The output of this process is a set of estimators. Typical estimators are HW execution time, SW execution time, HW area, and reconfiguration time. These estimators could be obtained using accurate implementation tools (i.e., compiler, logic synthesis, and place&route tools) or using high-level estimation tools. The HW/SW partitioning process decides which tasks are mapped to hardware or software depending on: 1) the architecture parameters (i.e., the number of DRP processors or external DRAM size); 2) the modified task graph; and 3) the task s estimators. Note that the application requirements do not directly affect the HW/SW partitioning process, but they do affect this process indirectly using the modified task graph. The partitioning algorithm must take into account the configuration prefetch technique in its implementation. Finally, in the static phase, we also find the Priority Task Assignment process. In this process, we statically assign to each task a priority of execution. This information will be used during run-time to decide the execution order of the tasks. An example of priority function is the critical-path analysis. C. Dynamic Phase This phase is responsible for the scheduling of the tasks but also for the scheduling of the DRP s reconfigurations. The Task Scheduler and DRP Context Scheduler cooperate and run in parallel during application run-time execution. Their functionality is based on the use of a look-ahead strategy into the list of tasks ready for execution (i.e., tasks which predecessors have finished its execution). At run-time, the task scheduler assigns tasks to DRPs and decides the execution order of the tasks found in the list of ready for execution. The DRP context (configuration) scheduler is used to minimize reconfiguration overhead. The objective of the DRP context scheduler is to decide: 1) which DRP processor must be reconfigured and 2) which reconfiguration context, or hardware task from the list of tasks ready for reconfiguration (i.e., tasks which predecessors have initiated its execution), must be loaded in the DRP processor. This scheduler tries to minimize this reconfiguration overhead by overlapping the execution of tasks with DRP reconfigurations. These algorithms are implemented in hardware using the dynamic scheduling unit (DSU) found in our architecture (see Fig. 1) [13]. Several research efforts in the field of SoC design propose moving into hardware functionality that traditionally has been assigned to operating systems [17]. V. CONFIGURATION-AWARE DATA PARTITIONING Here, we explain how, depending on the application requirements (e.g., power or performance), the reconfiguration over-

4 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 733 Fig. 4. (a) Task graph for this example. (b) Sequential scheduling. (c) Scheduling with configuration prefetching. (d) Scheduling with data partitioning and configuration prefetching. head impacts the data-partitioning process. Moreover, we show that the proposed data-partitioning technique highly influences the HW/SW partitioning results. Finally, we explain the powerperformance design tradeoffs that are involved in the data-partitioning technique. A. Introduction and Motivation In our approach, we want to execute an application that is modeled as a task graph on a hybrid reconfigurable architecture with a given number of DRP processors, each one of them characterized by the reconfiguration time. Moreover, the application must process an input data set of a given fixed size. In many streaming embedded applications, we could assume that the execution time of the application is proportional to the size of data that has to be processed. In other words, this means that the execution time of the tasks that are found in the application is proportional to the amount of data that each task has to process. The data-partitioning process that we are proposing assumes that the execution time of the tasks is longer than the DRP reconfiguration time. Obviously, there are several alternatives when scheduling an application on a dynamically reconfigurable architecture. In Fig. 4, we can observe three possible solutions for an application with five tasks and a three-drp-based architecture. The task graph used in this example is shown in Fig. 4(a), where we can also observe for each task its execution time. In the following paragraphs, we explain these three possible solutions. 1) Sequential Scheduling: This is the simplest solution, where task executions and DRP reconfigurations are sequentially scheduled in the DRPs [see Fig. 4(b)]. We can observe that the execution time of the tasks is longer than DRP reconfigurations (shown as a shadowed R in the figure). Finally, we should also notice the performance penalty due to the reconfiguration overhead. 2) Scheduling With Configuration Prefetching: Configuration caching [27] and configuration prefetching [4] are wellknown mechanisms in reconfigurable computing to hide the reconfiguration overhead. Configuration prefetching is based on the idea of loading the required configuration on a DRP before it is actually required, thus overlapping execution in a DRP with reconfiguration in a different DRP. In our approach, the configuration prefetching of a task could start when all its predecessor tasks have started their execution. For instance, the configuration prefetching of task T2 could start after task T1 has begun its execution. On the other hand, the execution of a task might start when all of its predecessors have finished their execution (task T2 can start when task T1 has finished). As we can observe in Fig. 4(c), this technique hides completely the reconfiguration overhead to all DRP processors, thus improving the application performance. This approach is based on the idea that the task graph is executed only one time (i.e., each task processes all the input data set). The benefit of this approach is that it requires the minimum number of DRP reconfigurations (e.g., five reconfigurations in this example). However, this approach has two main drawbacks. The size of the shared memory buffers used for task communication is large (it must be able to store the maximum data size required by all the tasks). The DRP processors are waiting for their incoming data during a significant amount of time (i.e., they have finished the reconfiguration but cannot start execution because the input streams are not in the shared memory buffers). 3) Scheduling With Configuration Prefetching and Data Partitioning: This approach tries to overcome the limitations of the previous approach. This solution also uses the concept of configuration prefetching, but the input data set is not processed all at the same time. In this sense, the input data set is partitioned in several data blocks of a given size. This also means that the task graph must be iterated as many times as the number of input data blocks. In the example shown in Fig. 4(d), we can observe that the input data set has been partitioned into two data blocks (named 0 and 1 ) and that the task graph is iterated twice. This technique reduces the size of the shared memory buffers required for task communication. Moreover, we can also see that the latency from DRP reconfiguration to DRP execution is also reduced. However, this approach has the drawback that it increases the number of reconfigurations because the task graph must be iterated several times. For example, in Fig. 4(d), we can see that we now have nine reconfigurations compared with the five reconfigurations required in Fig. 4(c). In addition, this technique also impacts performance, since in this approach we can not use reconfiguration prefetch among two iterations of the task graph.

5 734 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 5. Example of how increasing the amount of processed data might help to minimize the reconfiguration overhead. B. Model for the Reconfiguration Overhead It has been demonstrated that the parameters of the reconfigurable architecture (i.e., number of DRP processors or reconfiguration time) have a direct impact into the performance given by the HW/SW partitioning process [11]. The partitioning process must take into account the reconfiguration time and the configuration prefetching technique for reconfiguration latency minimization. This is summarized in the following expression, which shows how the execution time 3 of a task mapped to hardware must be modified to consider the reconfiguration overhead (see also Fig. 5): where is the execution time of task without any reconfiguration overhead; is the probability of reconfiguration, which is a function of the number of tasks mapped to hardware and the number of DRP processors; is the reconfiguration time needed for a DRP processor to change its context (configuration); is the average executing time for all tasks. On the other hand, in the design of embedded systems, we would like to minimize the number of accesses to external memory, in order to reduce the overall system-level power consumption. Thus, data transfers between tasks should be kept to a size that fits into the on-chip L2 memory. In many streaming embedded applications, we could assume that the execution time of a given task implemented in hardware or software is proportional to the size of the data that must be processed. Thus, if the data are stored in on-chip memory with 3 In this paper, the execution time of a task includes the time required to: 1) read the data from memory; 2) process the data; and 3) write the processed data back to memory. (1) a smaller capacity, then we could conclude that the average execution time of the tasks will be smaller when compared with the reconfiguration time (we are assuming reconfiguration times in the order of 800 s 1.4 ms). If this is the case, and applying expression (1), we will have a significant reconfiguration overhead (because ), which may prevent moving the task from software to hardware. In order to overcome this limitation and reduce the reconfiguration overhead, we could increase the amount of data to be processed by the task. Increasing the amount of data means that we will be forced to use external memory. Using this approach, we increase the performance (because more tasks could be mapped to hardware) but we also increase the overall system-level power consumption. An example of this previous concept [see (1)] can be observed in Fig. 5, where we consider the execution of two tasks. Thus, in Fig. 5(b), we can see that although using the reconfiguration prefetching technique, we cannot completely hide the reconfiguration overhead for task T2, since task T1 has a shorter execution time because it processes ten data units [see Fig. 5(a)]. As previously introduced, this might be improved by increasing the amount of processed data. In this example, we have increased the amount of processed data to twenty data units [see Fig. 5(c)], which in fact increases the execution time of task T1 in such a manner that it equals the reconfiguration time for task T2, hence completely hiding the reconfiguration overhead [see Fig. 5(d)]. C. Data Partitioning for Reconfigurable Architectures How the input data set is partitioned will mainly drive the use of a given approach: 1) HW/SW partitioning for statically reconfigurable architectures using on-chip memory or 2) context scheduling for dynamically reconfigurable architectures using off-chip memory. Thus, the input streaming data set must be partitioned into several blocks, and the size of these blocks is mainly driven by the objectives of the application (i.e., power or performance). Consequently, if the application objective is performance, then we should process large blocks of data because we want to minimize the reconfiguration overhead. Moreover, if we are processing large blocks of data, it is more likely that these blocks do not fit in the on-chip L2 memory subsystem and we are forced to use off-chip memory, thus increasing the overall system-level power consumption. On the other hand, we have the situation where the application objective is low power. In this case, we must process small blocks of data so that they could be stored in the on-chip memory, thus minimizing the system-level power consumption. The drawback of this solution is that, if we process small blocks of data, then the reconfiguration overhead would be more significant, and this might prevent mapping more tasks on the run-time reconfigurable hardware. As a summary, processing small blocks of data, which are stored in on-chip memory, reduces the power consumption but it also reduces the application performance. In addition, the type of on-chip/off-chip data partitioning gives us the number of iterations of the task graph. This is shown in Fig. 6, where we can observe an example of an image-processing application. In this figure, we observe three different image sizes (i.e., , , and ).

6 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 735 Fig. 6. (a) Initial task graph. (b) Data partitioning for dynamically reconfigurable architectures. (c) Data partitioning for HW/SW partitioning. Fig. 7. Image sharpening benchmarks. (a) Unsharp masking. (b) Sobel filter. (c) Laplacian filter. In Fig. 6(b), we can observe that the size of the blocks of data is large (e.g., ). The amount of data to be processed must be such that the task execution time at least equals the reconfiguration time. In this situation, the number of task graph iterations is reduced. For example, when we must process an image size of , we must iterate the task graph four times, since we are processing blocks of size pixels. In the opposite case, we have the situation where we process small blocks of data [e.g., as shown in Fig. 6(c)], but we have a large number of iterations of the task graph. For example, if we want to process an input image of , then we must iterate 16 times the task graph. This example assumes that the data are partitioned into squared blocks, but the input data set (e.g., image) could have been also partitioned into blocks of several rows or columns. The authors would like to clarify at this point that the techniques proposed in this paper might not be applied to all kinds of streaming applications. There might be other type of applications where this idea of block-based data partitioning and processing is not possible (e.g., video coding). VI. EXPERIMENTS AND RESULTS A. Image Sharpening Benchmarks The proposed dynamically reconfigurable architecture is addressing streaming data (computationally intensive) embedded applications, that is, applications with a large amount of datalevel parallelism. It is not the goal of the proposed architecture to address control-dominated applications. Image-processing applications are a good example of the type of applications that we are addressing. This kind of application is becoming more and more sensible for power consumption, especially if we consider the increasing market share of digital cameras or mobile phones with embedded cameras, which require this type of image-processing technique. In this sense, we have selected three applications that implement an image sharpening application (see Fig. 7). The three benchmarks follow the same basic process: 1) transform the input image from RGB to YCrCb color space; 2) image quality improvements processing Fig. 8. Galapagos prototyping platform. the luminance (mainly using sliding window operations like 3 3 linear convolutions); and 3) transform from YCrCb back to RGB color space. Three different input data sets (image sizes) have been used in the experiments: 1) ; 2) ; and 3) B. Prototype Implementation A prototype of the proposed architecture has been designed and implemented. The Galapagos system is a PCI-based system (64 b/66 MHz). It is based on leading-edge FPGAs from Xilinx and high-bandwidth DDR SDRAM memory (see the left-hand side of Fig. 8). This reconfigurable system is based on a Virtex-II Pro device. The device used is a XC2VP20, which includes two PowerPC processors. The dynamically scheduling unit (DSU, in Fig. 1) and the data prefetch units of the L2 memory subsystem [see Fig. 2(b)] have been mapped to the Virtex-II pro device, which also includes the SDRAM memory controller. The design of these blocks has been done in verilog HDL, and the implementation has been done using Synplicity (synthesis) and Xilinx (place&route) tools. The DRP processors of our architecture are implemented in the Galapagos system using three Virtex-II devices (i.e., XC2V1000). The load and store units have been implemented using Virtex-II on-chip memory. The size of the buffers in the load/store units is 2 KB each buffer (i.e., 4 KB for each unit). The width of the memory words is 64 b. Fig. 8 shows a picture of the Galapagos system in a PC environment.

7 736 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 9. Final placed and routed task on three virtex-ii devices: (a) XC2V1000, (b)xc2v500, and (c) XC2V250. Fig. 10. HW/SW task execution time. C. Task Performance Results Fig. 10 shows the execution time of the tasks for the unsharp masking application running on: 1) an embedded processor, PowerPC405 (300 MHz), which processes blocks of data of pixels; 2) a DRP processor from the Galapagos System (60 MHz) processing blocks of pixels; and 3) a DRP processor processing blocks of data of pixels. It is interesting to note the order of magnitude that has been obtained in the implementation of the blur task (3 3 linear convolution). It is not the objective of this paper to explain the details of the implementation of the several tasks in hardware. These tasks have been designed in verilog HDL, simulated using Modelsim, and implemented using Synplicity (synthesis) and Xilinx (place&route) tools. In order to reduce the reconfiguration overhead, we have used the partial reconfiguration capability of the Virtex-II devices [26]. In this sense, the Virtex-II resources used by the hardware tasks have been fixed to be in the center of the device, where we time-multiplexed the required task (see Fig. 9). The left and right sides of the device are used by the DRP s load and store units, which are not run-time reconfigured [see Fig. 2(a)]. We have implemented the DRP processors in three different Xilinx Virtex-II devices (i.e., XC2V250, XC2V500, and XC2V1000), which mainly differ in the amount of hardware area used by the reconfigurable unit. Using this capability of the Virtex-II devices, we have obtained, using a reconfiguration clock of 66 MHz, the following average reconfiguration times for the three devices: a) 949 s for a XC2V250; b)1087 s for a XC2V500; and c) 1337 s for a XC2V1000. Fig. 11. Hardware/software task power-consumption. D. Task Power Results Fig. 11 shows the power consumption for a Galapagos DRP in its several states using on-chip or off-chip memory. Moreover, we can also observe the power consumption for the embedded PowerPC405 processor, which is used to execute the tasks mapped to software. In Fig. 11, we give three values for the DRP s power consumption (a different value for each Xilinx Virtex-II device). These power consumption values for the several Virtex-II devices have been obtained using Xpower, 4 which is the power estimation tool from Xilinx. Moreover, using XPower, we have estimated the power consumption for the on-chip memory. Finally, the power consumption for the off-chip memory (i.e., external DRAM) has been obtained from Micron datasheets. 5 We have used two memory chips of 64 MByte running at 100 MHz. In the following paragraphs, we explain the DRP processor s power consumption in its several states. The power consumption in the idle/wait state represents: 1) the static (i.e., leakage) power associated with a complete DRP processor (i.e., Load, Store and Reconfigurable units) and 2) the static power taken by the on-chip or off- 4 [Online]. Available: 5 [Online]. Available:

8 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 737 chip memory resources. Clearly, the static power increases when we: 1) use external memory or 2) increase the size of the device (i.e., increase the hardware area). The power in the reconfiguration state includes: 1) the power of the DRP processor itself; from Xilinx, we have obtained that this power consumption is mainly driven by the device leakage power; 2) the dynamic power consumption of the L2 configuration prefetch unit; and 3) the dynamic power from the on-chip or off-chip memory resources (in this latest situation, we also include the power consumption from the I/O buffers). It is interesting to note that the dynamic power taken by the Xilinx Virtex-II devices during the reconfiguration process can be ignored, that is, the static (i.e., leakage) power is that significant that the dynamic power turns into noise keep in mind that during the reconfiguration process only a minor amount of logic is actually switching, since the reconfiguration context (i.e., bitstream) is sequentially loaded into the reconfigurable hardware. The power consumption in the execution state takes account of: 1) the static and dynamic power of the full DRP processor (i.e., Load, Store and Reconfigurable units); 2) the dynamic power from the L2 data prefetch units; and 3) the power consumption of the associated (i.e., on-chip or off-chip) memory resources. As for the previous case, when dealing with external memory we also take into account the power consumption of the I/O buffers (e.g., LVTTL 3.3 V). The DRP power consumption in execution is an average power obtained when the tasks from the unsharp masking application run at 60 MHz. This average power consumption has been obtained implementing a gate-level accurate simulation after the place&route process for all the tasks. Finally, let us briefly explain the power consumption of the embedded CPU. According to Xilinx, the PowerPC405 takes 0.9 mw/mhz. 6 Assuming a clock frequency of 300 MHz, then we obtain a power consumption of 270 mw. We should also add here the power consumption of the data prefetch units attached to the embedded processor. E. Energy Performance Tradeoff Results In this subsection, we explain the energy performance tradeoffs results obtained when applying the proposed configuration-aware data-partitioning technique. The performance results have been obtained from real executions on the Galapagos system. The execution generates a log file with the state changes of the Virtex-II devices and embedded PowerPC. We have obtained the energy from: 1) the power consumption of the components as described in Fig. 11 and 2) the execution log file, which gives information about the amount of time that a device has been in a given state. Fig. 13(a) shows the performance results and Fig. 13(b) shows the energy consumption results for the unsharp masking application. In all pictures, we can observe the obtained results for the image size, when we change the target device 6 [Online]. Available: Fig. 12. Unsharp masking HW/SW task partitioning. (i.e., we show the results for the three Virtex-II devices). In addition, we present the following four implementations. Software implementation (named seq_sw); this implementation is based on the use of the embedded PowerPC405 remember that the associated performance and power results are shown in Figs. 10 and 11, respectively. In this experiment, we assume that the input images have been partitioned on blocks of pixels (i.e., remember that since we have partitioned the input image in several blocks, we must iterate the task graph several times for example, 16 times in the case of an image with pixels). HW/SW partitioning (named seq_hw_sw): in this approach, we use on-chip memory since we process small data blocks (i.e., pixels). Moreover, we have used the HW/SW partitioning algorithm proposed in [11], assuming we use two or three DRP processors and the average reconfiguration times introduced in the previous subsection. The obtained partitioning can be observed in Fig. 12, where we see that the reconfiguration overhead prevents us from moving into HW more tasks than the number of available DRP processors. Dynamic reconfiguration (named seq_dr): In this case, we increase the size of the data blocks to process. Specifically, we process blocks of pixels, which means that we must use off-chip memory (i.e., external DRAM). This amount of data implies that the tasks execution time is pretty more similar to the DRP reconfiguration time. This fact means that, when we apply the HW/SW partitioning algorithm, all tasks are mapped to the reconfigurable hardware. Hardware implementation (named seq_hw): This approach assumes that: 1) we use five DRP processors and 2) we use on-chip memory, since we process blocks of data of pixels. This should be considered as the optimum solution in terms of both power and performance, since: 1) there is no reconfiguration overhead (i.e., we have as many DRPs as tasks) and 2) we use on-chip memory. In Fig. 13(a), we show the performance that we have obtained using the four implementations. We can observe that the software implementation (i.e., PowerPC405 based solution) obtains the worst performance results. The use of the HW/SW partitioning approach contributes to a major improvement in per-

9 738 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 Fig. 13. Unsharp masking application. (a) Performance results. (b) Energy results. formance, since critical tasks are mapped to the configurable hardware. Obviously, increasing the number of DRP processors helps to improve performance since more tasks are implemented in hardware (i.e., 29% improvement when moving from two to three DRP processors). Moreover, it is clear that the reconfiguration time does not affect this approach, since there are no reconfigurations. The dynamic reconfiguration technique helps to improve performance even more. Dynamic reconfiguration improves the HW/SW partitioning approach by: 1) 62.5% when using two DRP processors and 2) 47.3% when using three DRP processors. In addition, dynamic reconfiguration improves the solution based on the embedded CPU by 83.14% Finally, it is worth mentioning that, in the unsharp masking benchmark, the dynamic reconfiguration approach does not benefit from increasing the number of DRP processors (i.e., we obtain the same results in both situations). Since we are using the unmodified linear task graph, we have enough resources with two DRP processors to completely hide the reconfiguration overhead (i.e., one DRP processor is in reconfiguration while the other one is in execution). On the other hand, Fig. 13(b) shows the energy consumption for all four approaches. It is clear that the solution based on the embedded CPU is the approach that consumes the larger amount of energy. Despite using on-chip memory and requiring the minimum amount of power (see Fig. 11), the long execution time of the tasks implemented in the PowerPC405 contribute to this large energy consumption. Obviously, the hardware-based approach is the optimum solution in terms of energy consumption, thanks to the use of on-chip memory and the short execution times, which do not have any reconfiguration overhead. Then, as an intermediate solution, we have the results for the mixed HW/SW and dynamic reconfiguration approaches. We must first observe, in both approaches, that the energy increases when: 1) having fixed the number of DRP processors, we increase the size of the reconfigurable unit (e.g., we move from two XC2V250 to two XC2V500 devices) or 2) having fixed a given Virtex-II device, we increase the number of DRP processors (i.e., we move from two to three DRP processors). In both situations, this increment of the energy is due to the increment of the static (i.e., idle) leakage power, which comes with the increment of the hardware area. From Fig. 13(b), we can observe that, independently of the number of DRP processors, the mixed HW/SW solution requires less energy than the dynamic reconfiguration approach does, 7 that is, the dynamic reconfiguration approach, despite its performance advantages, requires more energy due to its high power requirements, which comes from the use of off-chip memory. It is interesting to note here that the dynamic reconfiguration approach has the same energy requirements for execution and reconfiguration, as in the case where we use two DRP processors. As a summary from Fig. 13(b), we obtain that both solutions based on configurable logic give an average 43% of energy reduction when they are compared with energy required by the embedded CPU implementation. This energy improvement might be a value up to 60%. Moreover, HW/SW partitioning improves, in terms of energy consumption, the dynamic reconfiguration approach by 16.4% when using two DRP processors and 35% when using three DRP processors. VII. CONCLUSION In this paper. we have explored the system-level power-performance tradeoffs for fine-grained reconfigurable computing. We have proposed a configuration-aware data-partitioning technique for reconfigurable architectures, and we have shown how the reconfiguration overhead directly impacts this data-partitioning process. When targeting many streaming applications (like the imageprocessing applications), we have shown that the use of a given approach (i.e., HW/SW partitioning for statically reconfigurable or context scheduling for dynamically reconfigurable architectures) depends on the application requirements (i.e., power or performance). Thus, in this type of applications, if the objective is energy efficiency, then HW/SW partitioning for statically reconfigurable logic is the most favorable solution. On the other 7 In the calculation of the energy taken by the dynamic reconfiguration approach, we are assuming that we can completely power off the embedded CPU (i.e., we do not consider the leakage power due to the PowerPC).

10 NOGUERA AND BADIA: SYSTEM-LEVEL POWER-PERFORMANCE TRADEOFFS FOR RECONFIGURABLE COMPUTING 739 hand, if the application objective is performance, then context scheduling for dynamically reconfigurable architectures is the optimum solution. Finally, future work includes the study of the same tradeoffs in a mixed environment, where HW/SW partitioning could be used with context scheduling for dynamically reconfigurable architectures. Other topics of future research include applying the techniques proposed in this paper to other types of embedded applications and proposing a detailed implementation for the L2 memory subsystem. REFERENCES [1] L. Benini, A. Bogliolo, and G. De Micheli, A survey of design techniques for system-level dynamic power management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 8, no. 3, pp , Jun [2] K. Chatta and R. Vemuri, Hardware-software co-design for dynamically reconfigurable architectures, in Proc. FPL, 1999, pp [3] V. George, H. Zhang, and J. Rabaey, The design of a low energy FPGA, in Proc. Int. Symp. ISLPED, 1999, pp [4] S. Hauck, Configuration prefetch for single context reconfigurable coprocessors, in Proc. ACM Int. Symp. FPGA, 1998, pp [5] R. Hartenstein, A decade of reconfigurable computing: A Visionary retrospective, in Proc. DATE, 2001, pp [6] B. Jeong, Hardware-software co-synthesis for run-time incrementally reconfigurable FPGAs, in Proc. ASP-DAC, 2000, pp [7] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for power efficient FPGAs, in Proc. ACM Int. Symp. FPGA, 2003, pp [8] Y. Li, Hardware-software co-design of embedded reconfigurable architectures, in Proc. DAC, 2000, pp [9] R. Maestre, A framework for reconfigurable computing: Task scheduling and context management, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp , Dec [10] R. Maestre, Configuration management in multi-context reconfigurable systems for simultaneous performance and power optimizations, in Proc. ISSS, 2000, pp [11] J. Noguera and R. M. Badia, HW/SW co-design techniques for dynamically reconfigurable architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 4, pp , Aug [12], System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures, in Proc. CASES, 2003, pp [13], Multitasking on reconfigurable architectures: Micro-architecture support and dynamic scheduling, in Proc. ACM TECS, 2004, pp [14], Power-performance trade-offs for reconfigurable computing, in Proc. CODES + ISSS, 2004, pp [15] K. W. Poon, A. Yan, and S. J. E. Wilton, A flexible power model for FPGAs, in Proc. 12th Int. Conf. Field-Programmable Logic Appl. (FPL), 2002, pp [16] K. Purna and D. Badia, Temporal partitioning and scheduling data flow graphs for reconfigurable computers, IEEE Trans. Computers, vol. 48, no. 6, pp , Jun [17] B. E. Saglam (Akgul) and V. Mooney, System-on-a-chip processor synchronization support in hardware, in Proc. DATE, 2001, pp [18] M. Sánchez-Élez, A complete data scheduler for multi-context reconfigurable architectures, in Proc. DATE, 2002, pp [19] L. Shang, A. S. Kaviani, and K. Bathala, Dynamic power consumption in virtex-ii FPGA family, in Proc. Int. Symp. FPGA (FPGA), 2002, pp [20] G. Stitt, F. Vahid, and S. Nemetebaksh, Energy savings and speedups from partitioning critical software loops to hardware in embedded systems, in Proc. ACM TECS, 2004, pp [21] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, A time-multiplexed FPGA, in Proc. 5th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 1997, pp [22] O. S. Unsal and I. Koren, System-level power-aware design techniques in real-time systems, Proc. IEEE, vol. 91, pp , Jul [23] M. Vasilko and D. Ait-Boudaoud, Scheduling for dynamically reconfigurable FPGAs, in Proc. Int. Workshop Logic Arch. Synthesis (IFIP TC10 WG10.5), 1995, pp [24] K. Weiß, C. Oetker, I. Katchan, T. Steckstor, and W. Katchan, Power estimation approach for sram-based FPGAs, in Proc. 8th ACM Int. Symp. Field-Programmable Gate Arrays (FPGA), 2000, pp [25] M. J. Wirthlin and B. L. Hutchings, Improving functional density through run-time circuit reconfiguration, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 2, pp , Jun [26] Two Flows for Partial Reconfiguration: Module Based or Small Bit Manipulations, Xinlinx Corp., San Jose, CA, 2005, Xilinx Application Note XAPP290. [27] Z. Li, K. Compton, and S. Hauck, Configuration caching management techniques for reconfigurable computing, in Proc. 8th IEEE Symp. Field-Programmable Custom Computing Machines, 2000, pp Juanjo Noguera received the B.Sc. degree in computer science from the Autonomous University of Barcelona, Barcelona, Spain, in 1997, and the Ph.D. degree in computer science from the Technical University of Catalonia, Barcelona, Spain, in He has worked for the Spanish National Center for Microelectronics, the Technical University of Catalonia, and Hewlett-Packard Inkjet Commercial Division. In January 2006, he joined the Xilinx Research Labs, Dublin, Ireland. His interests include system-level design, reconfigurable architectures, and low-power design techniques. He has published papers in international journals and conference proceedings. Rosa M. Badia received the B.Sc. and Ph.D. degrees in computer science from the Technical University of Catalonia, Barcelona, Spain, in 1989 and 1994, respectively. She is currently an Associate Professor in the Computer Architecture Department of the Technical University of Catalonia, and Project Manager at the Barcelona Supercomputing Center, Barcelona, Spain. Her interests include CAD tools for VLSI, reconfigurable architectures, performance prediction and analysis of message passing applications, and GRID computing. She has published papers in international journals and conference proceedings

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A REVIEWARTICLE OF SDRAM DESIGN WITH NECESSARY CRITERIA OF DDR CONTROLLER Sushmita Bilani *1 & Mr. Sujeet Mishra 2 *1 M.Tech Student

More information

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department

More information

A hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements

A hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements A hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements Javier Resano, Daniel Mozos, Elena Pérez, Hortensia Mecha, Julio Septién Dept.

More information

ISSN Vol.05, Issue.12, December-2017, Pages:

ISSN Vol.05, Issue.12, December-2017, Pages: ISSN 2322-0929 Vol.05, Issue.12, December-2017, Pages:1174-1178 www.ijvdcs.org Design of High Speed DDR3 SDRAM Controller NETHAGANI KAMALAKAR 1, G. RAMESH 2 1 PG Scholar, Khammam Institute of Technology

More information

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 PP 11-18 www.iosrjen.org Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory S.Parkavi (1) And S.Bharath

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla 1,2, Neil W. Bergmann 1, Jürgen Becker 2 1 ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au

More information

Design and Implementation of High Performance DDR3 SDRAM controller

Design and Implementation of High Performance DDR3 SDRAM controller Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Hardware Software Codesign of Embedded Systems

Hardware Software Codesign of Embedded Systems Hardware Software Codesign of Embedded Systems Rabi Mahapatra Texas A&M University Today s topics Course Organization Introduction to HS-CODES Codesign Motivation Some Issues on Codesign of Embedded System

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical & Computer Engineering

More information

Introduction to reconfigurable systems

Introduction to reconfigurable systems Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used

More information

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience H. Krupnova CMG/FMVG, ST Microelectronics Grenoble, France Helena.Krupnova@st.com Abstract Today, having a fast hardware

More information

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,

More information

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

LOW POWER FPGA IMPLEMENTATION OF REAL-TIME QRS DETECTION ALGORITHM

LOW POWER FPGA IMPLEMENTATION OF REAL-TIME QRS DETECTION ALGORITHM LOW POWER FPGA IMPLEMENTATION OF REAL-TIME QRS DETECTION ALGORITHM VIJAYA.V, VAISHALI BARADWAJ, JYOTHIRANI GUGGILLA Electronics and Communications Engineering Department, Vaagdevi Engineering College,

More information

UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems

UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems UML-Based Design Flow and Partitioning Methodology for Dynamically Reconfigurable Computing Systems Chih-Hao Tseng and Pao-Ann Hsiung Department of Computer Science and Information Engineering, National

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Hardware Acceleration of Edge Detection Algorithm on FPGAs Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4 Bas Breijer, Filipa Duarte, and Stephan Wong Computer Engineering, EEMCS Delft University of Technology Mekelweg 4, 2826CD, Delft, The Netherlands email:

More information

Hardware Software Codesign of Embedded System

Hardware Software Codesign of Embedded System Hardware Software Codesign of Embedded System CPSC489-501 Rabi Mahapatra Mahapatra - Texas A&M - Fall 00 1 Today s topics Course Organization Introduction to HS-CODES Codesign Motivation Some Issues on

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture Enhancement of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture Sushmita Bilani Department of Electronics and Communication (Embedded System & VLSI Design),

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Hardware-Software Codesign. 1. Introduction

Hardware-Software Codesign. 1. Introduction Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2

More information

Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices

Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices Using Dynamic Voltage Scaling to Reduce the Configuration Energy of Run Time Reconfigurable Devices Yang Qu 1, Juha-Pekka Soininen 1 and Jari Nurmi 2 1 Technical Research Centre of Finland (VTT), Kaitoväylä

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema [1] Laila A, [2] Ajeesh R V [1] PG Student [VLSI & ES] [2] Assistant professor, Department of ECE, TKM Institute of Technology, Kollam

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

Hardware/Software Co-design

Hardware/Software Co-design Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Abstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003

Abstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003 Title Reconfigurable Logic and Hardware Software Codesign Class EEC282 Author Marty Nicholes Date 12/06/2003 Abstract. This is a review paper covering various aspects of reconfigurable logic. The focus

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Long Term Trends for Embedded System Design

Long Term Trends for Embedded System Design Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

Energy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture

Energy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA T. Rupalatha 1, Mr.C.Leelamohan 2, Mrs.M.Sreelakshmi 3 P.G. Student, Department of ECE, C R Engineering College, Tirupati, India 1 Associate Professor,

More information

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications 46 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.3, March 2008 Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

More information

Pilot: A Platform-based HW/SW Synthesis System

Pilot: A Platform-based HW/SW Synthesis System Pilot: A Platform-based HW/SW Synthesis System SOC Group, VLSI CAD Lab, UCLA Led by Jason Cong Zhong Chen, Yiping Fan, Xun Yang, Zhiru Zhang ICSOC Workshop, Beijing August 20, 2002 Outline Overview The

More information

the main limitations of the work is that wiring increases with 1. INTRODUCTION

the main limitations of the work is that wiring increases with 1. INTRODUCTION Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,

More information

Implementation of Asynchronous Topology using SAPTL

Implementation of Asynchronous Topology using SAPTL Implementation of Asynchronous Topology using SAPTL NARESH NAGULA *, S. V. DEVIKA **, SK. KHAMURUDDEEN *** *(senior software Engineer & Technical Lead, Xilinx India) ** (Associate Professor, Department

More information

Design Partitioning Methodology for Systems on Programmable Chip

Design Partitioning Methodology for Systems on Programmable Chip Design Partitioning Methodology for Systems on Programmable Chip Abdo Azibi and Ramzi Ayadi Department of Electronics College of Technology at Alkharj, Saudi Arabia Email: aazibi, amzi.ayadi@tvtc.gov.sa

More information

Lecture 41: Introduction to Reconfigurable Computing

Lecture 41: Introduction to Reconfigurable Computing inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41: Introduction to Reconfigurable Computing Michael Le, Sp07 Head TA April 30, 2007 Slides Courtesy of Hayden So, Sp06 CS61c Head TA Following

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

Microelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together.

Microelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together. Microelectronics Initially, only a few gates or memory cells could be reliably manufactured and packaged together. These early integrated circuits are referred to as small-scale integration (SSI). As time

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Interfacing a High Speed Crypto Accelerator to an Embedded CPU Interfacing a High Speed Crypto Accelerator to an Embedded CPU Alireza Hodjat ahodjat @ee.ucla.edu Electrical Engineering Department University of California, Los Angeles Ingrid Verbauwhede ingrid @ee.ucla.edu

More information

Verification of Multiprocessor system using Hardware/Software Co-simulation

Verification of Multiprocessor system using Hardware/Software Co-simulation Vol. 2, 85 Verification of Multiprocessor system using Hardware/Software Co-simulation Hassan M Raza and Rajendra M Patrikar Abstract--Co-simulation for verification has recently been introduced as an

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Design of a System-on-Chip Switched Network and its Design Support Λ

Design of a System-on-Chip Switched Network and its Design Support Λ Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of

More information

Embedded Real-Time Video Processing System on FPGA

Embedded Real-Time Video Processing System on FPGA Embedded Real-Time Video Processing System on FPGA Yahia Said 1, Taoufik Saidani 1, Fethi Smach 2, Mohamed Atri 1, and Hichem Snoussi 3 1 Laboratory of Electronics and Microelectronics (EμE), Faculty of

More information

Design Issues in Hardware/Software Co-Design

Design Issues in Hardware/Software Co-Design Volume-2, Issue-1, January-February, 2014, pp. 01-05, IASTER 2013 www.iaster.com, Online: 2347-6109, Print: 2348-0017 ABSTRACT Design Issues in Hardware/Software Co-Design R. Ganesh Sr. Asst. Professor,

More information

Lecture 7: Introduction to Co-synthesis Algorithms

Lecture 7: Introduction to Co-synthesis Algorithms Design & Co-design of Embedded Systems Lecture 7: Introduction to Co-synthesis Algorithms Sharif University of Technology Computer Engineering Dept. Winter-Spring 2008 Mehdi Modarressi Topics for today

More information

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory Embedded Systems 8. Hardware Components Lothar Thiele Computer Engineering and Networks Laboratory Do you Remember? 8 2 8 3 High Level Physical View 8 4 High Level Physical View 8 5 Implementation Alternatives

More information

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico

More information

Implementing Photoshop Filters in Virtex

Implementing Photoshop Filters in Virtex Implementing Photoshop Filters in Virtex S. Ludwig, R. Slous and S. Singh Springer-Verlag Berlin Heildelberg 1999. This paper was first published in Field-Programmable Logic and Applications, Proceedings

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

More information

Testability Design for Sleep Convention Logic

Testability Design for Sleep Convention Logic Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 11, Number 7 (2018) pp. 561-566 Research India Publications http://www.ripublication.com Testability Design for Sleep Convention

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

Multi processor systems with configurable hardware acceleration

Multi processor systems with configurable hardware acceleration Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

Fast FPGA Routing Approach Using Stochestic Architecture

Fast FPGA Routing Approach Using Stochestic Architecture . Fast FPGA Routing Approach Using Stochestic Architecture MITESH GURJAR 1, NAYAN PATEL 2 1 M.E. Student, VLSI and Embedded System Design, GTU PG School, Ahmedabad, Gujarat, India. 2 Professor, Sabar Institute

More information

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

Three DIMENSIONAL-CHIPS

Three DIMENSIONAL-CHIPS IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information