Thermal-aware Fault-Tolerant System Design with Coarse-Grained Reconfigurable Array Architecture

Size: px

Start display at page:

Download "Thermal-aware Fault-Tolerant System Design with Coarse-Grained Reconfigurable Array Architecture"

Lorena Cummings
5 years ago
Views:

1 2010 NASA/ESA Conference on Adaptive Hardware and Systems Thermal-aware Fault-Tolerant System Design with Coarse-Grained Reconfigurable Array Architecture Ganghee Lee and Kiyoung Choi Department of Electrical Engineering and Computer Science Seoul National University, Seoul, Korea {berean97, Abstract Coarse-grained reconfigurable array architectures have drawn increasing attention due to their performance and flexibility. A typical coarse-grained reconfigurable array architecture has many PEs in the array, which is suitable for implementing spatial redundancy used for faulttolerant systems design. In this paper, we propose to implement replications and a voting function on the PE array of a coarse-grained reconfigurable array architecture to design a fault-tolerant system. We also introduce thermal-aware application mapping onto the coarse-grained reconfigurable array architecture for reliability. The experiment with Viterbi decoder shows that our approach enables implementing fault-tolerance with 12% area overhead which comes from implementing conditional execution. 1. Introduction Fault-tolerance is the property that enables a system to continue operating properly in the event of failure. Especially in aero space and biomedical applications, the system has to be highly reliable since the effect of a fault can be catastrophic. To attain the required reliability, soft error tolerant design has been widely attempted by replicating multiple identical instances of the same system, executing all of them in parallel, and choosing the correct result on the basis of majority vote [1]. The same inputs are provided to each replication and so same outputs are expected but the outputs of the replications are compared using a voter. Reconfigurable computing is becoming more and more popular with the increasing requirements for more flexibility and higher performance. Actually, coarsegrained reconfigurable array architectures [2][3] are gaining popularity, since it can reduce huge NRE (nonrecurring engineering) cost of custom VLSI chips but also has higher area efficiency than fine-grained architectures such as FPGAs. Typically, the coarse-grained reconfigurable array architectures consist of a reconfigurable array of processing elements (PEs) and its controller. Due to the large number of PEs in the array, the coarse-grained reconfigurable array architectures are suitable for implementing spatial redundancy for faulttolerance. However, conventional coarse-grained reconfigurable array architectures suffer from inefficiency in implementing a voter since they are usually designed for data-intensive kernel part rather than control-intensive part such as a voter. Most of the previous researches for reliability on reconfigurable architecture are focused on fine-grained reconfigurable devices such as FPGA [4][5]. In [6], they introduce a coarse-grained reconfigurable architecture enabling flexible reliability. However, it incurs much area overhead due to the voter implementation. In this paper, we introduce an approach to designing fault-tolerant systems efficiently with coarse-grained reconfigurable array architecture. In a preliminary effort [16], we presented an approach to supporting conditional execution on the reconfigurable architecture. The support of conditional execution enables efficient implementation of a voter without additional overhead. The novelty of our approach is as follows. - We implement low overhead fault-tolerant system with existing conditional execution mechanism. Since we implement both replications and a voter on the reconfigurable PE array, we do not incur additional area overhead for implementing a voter unlike the approach in [6]. - We consider thermal effect when generating configuration code for coarse-grained reconfigurable array architecture for reliability. The remainder of this paper is organized as follows. Section 2 introduces our coarse-grained reconfigurable array architecture. Section 3 explains the design flow for reliability. Section 4 shows experiments with Viterbi decoder. Finally, Section 5 concludes with some remarks on future work /10/$ IEEE 265

Figure 1. Coarse-grained reconfigurable array architecture. 2. Target architecture 2.1. Coarse-grained reconfigurable array architecture Our target architecture consists of an array of PEs, several sets of data memories and a configuration cache memory [7].

The size of the array can be optimized to a specific application domain [7]. In Figure 1, for example, the architecture contains a 4x4 reconfigurable array of PEs.

2 Figure 1. Coarse-grained reconfigurable array architecture. 2. Target architecture 2.1. Coarse-grained reconfigurable array architecture Our target architecture consists of an array of PEs, several sets of data memories and a configuration cache memory [7]. Figure 1 shows our coarse-grained reconfigurable array architecture and internal structure of the PEs. It is connected with the nearest neighboring PEstop, bottom, left and right. The size of the array can be optimized to a specific application domain [7]. In Figure 1, for example, the architecture contains a 4x4 reconfigurable array of PEs. The area-critical functional units (such as multipliers or dividers) are located outside the PEs and shared among a set of PEs [7]. Each areacritical functional unit is pipelined to curtail the critical path delay, and its execution is initiated by scheduling the area-critical operation on one of the PEs that share this area-critical resource. Thus each PE can be dynamically reconfigured either to perform arithmetic and logical operations with its own ALU in one clock cycle, or to perform multiplication or division operations using the corresponding shared functional unit in several clock cycles with pipelining. The data memory in Figure 1 is used for storing data that can be accessed by the PEs. There are two sets of memory, each of which consists of three banks: one connected to the write bus and the other two connected to the read buses. These read/write buses are also shared by the PEs like the area-critical shared functional units. The two sets of memory are used for double buffering. The configuration cache is composed of an array of Cache Elements (CEs), whose size is the same as the size of the array of PEs. More specifically, each PE has its own CE, and therefore, the two arrays (PE array and CE array) have the same dimension. Each CE has many layers, with each layer having a different context, such that the entire array of PEs can be reconfigured within just one cycle by switching the layers. Note that the area-critical resources are shared by the PEs on the same row as shown in Figure 1 and activated through the individual PEs, and thus need not be modeled separately from the PEs Feature for supporting conditional execution To support conditional execution on the reconfigurable architecture, our target architecture [16] has Condition signal as shown in Figure 2. The condition signal can be issued by conditional operations such as comparison or logical negation and the PE can select one of the results from multiple sources (between A sel and B sel ). An interconnection network is also introduced for conditional execution [16]. Among various interconnect architectures, we use the column-wide bus architecture, where buses are placed on the array along with each column. Figure 3 shows the column-wide bus architecture where the total number of buses on the array 266

(a) Triple-modular redundancy (TMR) Figure 2. PE structure for supporting conditional execution. (b) Double-modular redundancy (DMR) (c) No redundancy (NR) Figure 4.

Note that a conditional operation should be executed just before the resulting condition signal is used, since in the current implementation a PE broadcasts the condition signal value to the

Design flow for reliability We implement three different levels of reliability with coarse-grained reconfigurable array architecture.

Figure 4 shows the three different levels of reliability: i) TMR (triple-modular redundancy), ii) DMR (doublemodular redundancy) and iii) NR (no redundancy).

In this case, the voting circuit can output the correct result, and discard the erroneous version. In DMR mode, two replications of each element are used for reliability.

3 (a) Triple-modular redundancy (TMR) Figure 2. PE structure for supporting conditional execution. (b) Double-modular redundancy (DMR) (c) No redundancy (NR) Figure 4. Three different level of reliability. Figure 3. Column-wide bus architecture. is the same as the number of columns. Each bus has 1-bit width used for the condition signal. Note that a conditional operation should be executed just before the resulting condition signal is used, since in the current implementation a PE broadcasts the condition signal value to the column-wide bus and it is preserved only for the next one cycle. Then the other PEs get the value from the column-wide bus in the next cycle. 3. Design flow for reliability We implement three different levels of reliability with coarse-grained reconfigurable array architecture. By exploiting the flexibility of reconfigurable architecture, we can easily change the level of reliability without incurring any additional overhead. Figure 4 shows the three different levels of reliability: i) TMR (triple-modular redundancy), ii) DMR (doublemodular redundancy) and iii) NR (no redundancy). In TMR mode, three replications of each element are used for reliability. The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. In DMR mode, two replications of each element are used for reliability. Thus, the voting circuit can only detect a mismatch. In NR mode, it does not check the failure of the system. For the application for performance (such as multimedia example), Figure 5. Design flow for fault-tolerant code generation. we run the system in NR mode. However, for the application that has to be highly reliable (such as biomedical or aero space applications), we run the system in DMR or TMR mode Fault-tolerant code generation Figure 5 shows the design flow for fault-tolerant code generation considering different levels of reliability. At the first step, we adopt HLS (high-level synthesis) techniques to map application kernels onto the reconfigurable array architecture through scheduling and binding. The mapping requires solving multiple problems 267

First, we should compile the application and generate configuration of the architecture while maximally exploiting the parallelism in both the application and the architecture.

4 (a) (b) Figure 6. TMR voter implementation. Table I Error detection result (O D ) of TMR voter O D result Description 00 No error 01 Unrecoverable error in voter 10 Two-to-one vote 11 Unrecoverable error in replica simultaneously. First, we should compile the application and generate configuration of the architecture while maximally exploiting the parallelism in both the application and the architecture. Consider that two operations having data or control dependency between them are mapped onto two different PEs that have no direct interconnection. In this case, other PEs are used for routing, and for this, we add extra dummy move operations for data forwarding [9][10]. Our kernel mapping algorithm in Figure 5 consists of two phases: i) list scheduling to get an initial solution, and ii) quantum-inspired evolutionary (QEA) algorithm [14] to get a more refined solution. QEA is a kind of evolutionary algorithm which is known to be very efficient. We seed the QEA to start from the list scheduling result and try to minimize the total latency. Since the QEA starts with a relatively good initial solution, it tends to reach a better solution sooner than starting with a random seed. As a result of QEA, the schedule and binding of each vertex are determined. Once the schedule and binding are given, it tries to find the routing paths among the vertices with unused remaining PEs to see if these schedule and binding results are implementable with the limited interconnect resources. In the scheduling and binding algorithm, the number of allowed resources in a column is given by M / R, where M is the number of PEs in a column and R is the number of replications. For example if the reconfigurable array has 8 PEs in a column and requires TMR level of reliability, the allowed number of resources for each replication is 8 / 3 2. For each replica, we need the replicated data. If we put only one data in the data memory we may suffer from data contention with limited interconnect resources. To resolve (a) TMR voter (b) DMR voter Figure 7. CDFG representations for voter implementation. such contention caused by memory accesses, we allow storing the replicated data for each replica. Thus, according to the level of reliability, we store multiple copies of data into the data memory using DMA. After generating the configuration for one replication is completed, we replicate the configuration according to the level of reliability. Then at the final stage, we insert voting operation. The voter uses compare operations of the PEs (a PE in our coarse-grained reconfigurable array architecture can perform conditional operations as well as arithmetic or logical operations (see Section 2.3)). The outputs (O R and O D ) of the voter are stored in two different locations. One (O R ) is the computational result and the other (O D ) is the error detection result. Figure 6 shows input (X, Y and Z) and output (O D and O R ) table of the voter with triple-modular redundancy (TMR). Inputs (X, Y and Z) of the table (in Figure 6(a)) are the compared result of replications (A, B and C) as shows in Figure 6(b). Table I shows the description of O D result. In Table I, there are two kinds of unrecoverable errors: one from the voter part and the other from the replication part. When 11 is observed as the O D result, we see that three different results are generated respectively from the three replicas. On the other hand, if 10 is observed from the O D result, we know that the voter has a fault. For example, X=0, Y=0 and Z=1 (one of the cases where 10 is observed at O D ) means that A is equal to B and A is equal to C, but, B is not equal to C. Since this is logically not true, we infer that the voting logic has a fault. We cannot recover from the two errors, 01 and 11 at O D. Figure 7 shows the CDFG (control data flow graph) representation for voter implementation, where each operation ( compare, add, logical and and select ) is supported by a PE in the array. Compared to the TMR voter, the DMR voter can be easily implemented with only one compare operation as shown in Figure 7(b). In [6], most of the area overhead (27 % compared to the original architecture) for designing a reliable system arises from voter implementation. However, in our implementation, since the PE array is used for voting 268

Assuming that the temperature of each PE can be measured (or estimated by a thermal model such as the one in [15]), we map the application considering thermal effect so that the reliability of the

5 Assuming that the temperature of each PE can be measured (or estimated by a thermal model such as the one in [15]), we map the application considering thermal effect so that the reliability of the system is improved. For the mapping, we calculate the thermal cost of each PE as (a) 2x2 (1 thermal location) (b) 3x3 (3 thermal locations) (c) 4x4 (3 thermal locations) (d) 5x5 (6 thermal locations) Figure 8. Thermally different locations. operation instead of extra dedicated hardware logic, there is no additional area overhead Thermal-aware application mapping When we map an application onto the coarse-grained reconfigurable array architecture, we can also consider thermal effect based on the fact that given a certain compute processor and steady ambient temperature, in general, tasks with longer run-times cause more heat and therefore higher peak temperatures. Accordingly tasks with shorter run-times cause less heat, and therefore, lower peak temperature [11]. The reason why we consider thermal effect is that the FIT (failures-in-time) rate increases dramatically along with the temperature 1. Thermal management can be characterized as temporal or spatial. Temporal thermal management scheme [12] controls the amount of computations on the processing element to reduce the temperature. On the other hand, spatial thermal management scheme [13] can reduce the temperature by scheduling hot tasks on cool processing elements. In this paper, we perform spatial thermal management for reliability since it can reduce the temperature effectively without throttling the computation [19]. The advantage of the coarse-grained reconfigurable array architecture is in its flexibility. For example, the mapping of the application running on the coarse-grained reconfigurable array architecture can be dynamically changed as time elapses. Thus the temperature of each PE can be dynamically changed according to the workload. 1 Time to failure is known to be a function of e -Ea/kT (acceleration factor in Arrhenius equation [17]), where E a = activation energy of the failure mechanism being accelerated, k = Boltzmann's constant, and T = absolute temperature. Cost a1 T a2 T a3 T (1) C DA where T C is temperature of the candidate PE, T DA s are temperatures of directly neighboring PEs, T da s are temperatures of diagonally neighboring PEs, and a i s are weights of the parameters. The values of the weights are determined statically through analysis and/or experiments. Equation (1) is obtained from [18] after slight modification for our mapping approach. As shown in Figure 5, our mapping tool for the coarsegrained reconfigurable array architecture [9] takes a two phase approach of list scheduling followed by refinement with quantum-inspired evolutionary algorithm (QEA) [14]. In the second phase, the fitness function that we use for the QEA is the performance. In addition to that we consider the thermal cost in (1). At the evaluation stage of QEA, we calculate the thermal cost for every possible mapping. If there are several candidate solutions that give same performance result, we choose the one having the lowest cost. Thermal model In (1), temperature of each PE can be measured by a thermal sensor (the details of how to measure the temperature of each PE are out of scope of this paper) or calculated by a thermal model as follows. Figure 8 shows the thermally different locations of the PE array. For an N N array of identical square PEs, there are ( N / 2 ( N / 2 1)) / 2 different possible locations [18]. Thermally different location indicates that central PEs such as C in Figure 8(b) tends to have higher temperature than the edge PEs such as A. With the idea of thermally different locations, in [15], they present post thermal map calculation that estimates temperature change after task allocation. A 2D thermal map is defined for the N N PE array, where a cell value in the thermal map represents the temperature corresponding to that particular PE. The current thermal map is referred to as the pre-thermal map, and the temperature of the PE at location (i, j), i.e., i-th row and j-th column, in the pre-thermal map is denoted by T 0 (i, j). The thermal map predicting the temperature change after task allocation is referred to as the postthermal map, and the temperature of the PE at location (i, j) in the post-thermal map is denoted by T(i, j). To calculate the fast thermal distribution associated with adding a task to the PE array, they use the following equation [15]. da 269

(a) Thermal snapshot (b) Application graph Table II Error detection and correction Replication part Voter part Detect O Correct (two-to-one) X O: possible, : partially possible, X: cannot In Figure

3. Discussions on reliability (c) Cost analysis (d) Application mapping Figure 9. Thermal snapshot and cost analysis.

6 (a) Thermal snapshot (b) Application graph Table II Error detection and correction Replication part Voter part Detect O Correct (two-to-one) X O: possible, : partially possible, X: cannot In Figure 9, darker gray in the PE array means hotter area. From Figure 9 (a) and (c), we see that hot-spots in thermal snapshot and cost analysis results may differ Discussions on reliability (c) Cost analysis (d) Application mapping Figure 9. Thermal snapshot and cost analysis. e / T ( i, j) T ( i, j) (1 e ) p LUT ( k, i, ) (2) 0 j where (1 e e/ ) is the architecture-dependent constant which is calculated statically. p is the power dissipated at thermally different location k. LUT(k, i, j) is an element at location (i, j) of the look-up table for k. There are ( N / 2 ( N / 2 1)) / 2 pre-built look-up tables, one for each thermally different location. The look-up table stores the steady-state temperature of each PE, which can be reached if the application is executed infinitely. More specifically, an element in the k-th LUT gives the increase in temperature at the corresponding PE after one Watt of power is dissipated by the application running at k. The details about this thermal model can be found in [15]. Thermal-aware application mapping Figure 9(a) shows the thermal snapshot of the PEs at a certain time t. Now we want to map an application represented by the data flow graph shown in Figure 9(b) onto the PEs. Among several mapping candidates that give same performance result, we choose the one giving the lowest cost. Figure 9(c) shows the cost calculated by (1) for every PEs when we simply assume a 1 =3, a 2 =2 and a 3 =1. Finally we map an application onto the PE array as shown in Figure 9(d), which gives the best performance and lowest peak temperature while satisfying given resource constraints. Regarding the resource constraints, there are several problems to be considered for the mapping. For example, in Figure 9(d), diagonal interconnection or shared resources such as multipliers are not considered for simplicity. However, we do not address such problems since they are out of the scope of this paper. The details of mapping considering resource constraints can be found in our previous paper [9]. In the coarse-grained reconfigurable array architecture, when permanent faults such as manufacturing faults are detected, we can relatively easily correct the problem by reconfiguration. When some broken PEs are detected in the PE array (the details of how to find the broken PEs will not be addressed in this paper, since it is another difficult subject to be solved), we map the kernel by avoiding the broken PEs. Transient faults can be detected and corrected by TMR or DMR implementation. A transient fault can occur either in a PE executing the replicated tasks or in a PE executing the voting operation. In some cases, both replications and voter can have faults. But in our fault-tolerant approach, only errors due to faults in the replications can be detected and corrected with two-to-one vote. Errors due to faults in the voter can be detected but cannot be corrected. Table II summarizes it. We should also consider faults occurred in memory. We expect that the memory faults can be detected and corrected by inserting ECC (error correcting code) circuit. 4. Experiment 4.1. Architecture overhead analysis To see the implementation overhead for reliability, we designed the coarse-grained reconfigurable array architecture at the register-transfer level, and synthesized a gate-level circuit targeting for an FPGA. The area overhead for implementing reliability was 12% compared to the original architecture. This overhead comes from implementing conditional execution for handling control path of the application [16]. As we mentioned in Section 2.3, we added conditional signals, 1-bit registers and column-wide buses for interconnection. However, most of the area overhead (10.3%) come from the increased logic to implement the extension in the operation (such as comparator or logical negation) rather than from the control interconnects (1.7%). 270

Figure 10. ACS (add-compare-select) operation in Viterbi decoder. Figure 12. Tradeoff between performance and reliability level for the ACS operation of Viterbi decoder.

In our synthesis result, there was no degradation of clock speed for reliability implementation compared to the original architecture. As mentioned in Section 2.

7 Figure 10. ACS (add-compare-select) operation in Viterbi decoder. Figure 12. Tradeoff between performance and reliability level for the ACS operation of Viterbi decoder. (a) TMR implementation (b) DMR implementation Figure 11. Application mapping with different reliability level. In our synthesis result, there was no degradation of clock speed for reliability implementation compared to the original architecture. As mentioned in Section 2.2, we use column-wide bus architecture for the control signal shared by column-wide PEs of the array. Adding such an 1-bit column-wide bus for the control signal does not cause degradation of clock speed, since we already have 16-bit column-wide buses for data memory in the original architecture Evaluation As a sample application, we implemented a kernel part of a Viterbi decoder on our coarse-grained reconfigurable array architecture. Viterbi decoding algorithm is widely used for decoding convolutional codes of satellite communications and bioinformatics, where the system has to be highly reliable. One of the most time-consuming operations in a Viterbi decoder is an ACS (add-compareselect) operation as shown in Figure 10. Since our coarsegrained reconfigurable array architecture enables conditional execution, we can easily map this ACS operation on our reconfigurable architecture. Figure 11(a) shows an example, where three replicas of ACS operations and the voter are mapped onto the coarsegrained reconfigurable array architecture that has eight PEs in a column for TMR. With the eight PEs in a column, one replica can use two PEs. Gray region in Figure 11 represents voter implementation. We can compromise reliability level for performance. If we implement DMR instead of TMR, one replica can use 8 / 2 4 PEs. Thus we can run two ACS operations concurrently for each replica, which leads to performance improvement. Figure 11(b) shows the DMR implementation with a voting function which is simpler than that of TMR implementation. Figure 12 shows the tradeoff between the performance and reliability level when running ACS operations of Viterbi decoder. The performance is normalized to the TMR implementation. The performance is not simply linear to the number of PEs used for implementing one replica, since the implementation has different voting operation and latency depending on the reliability level. 5. Conclusion In this paper, we presented a thermal-aware faulttolerant system design with coarse-grained reconfigurable array architecture. The proposed system has several reliability levels so that one can exploit the performance and reliability tradeoffs by adjusting the reliability level. We used the feature of conditional execution to implement a reliable system, which accounts for 12% area overhead compared to the original architecture. We also introduced temperature-aware application mapping onto coarsegrained reconfigurable array architecture for reliability. We experimented with Viterbi decoder where every replications and voting function are implemented on the reconfigurable PE array without causing additional logic overhead. For the future work, we are working on detailed reliability analysis for different implementations, and designing reliable systems including fault-tolerant memory and run-time adaptor. Acknowledgment This work was supported by KOSEF under NRL Program Grant (R0A ) funded by MEST, Korea and Nano IP/SoC Promotion Group under Seoul R&BD Program (10560). 271

8 References [1] L. Anghel, D. Alexandrescu, and M. Nicolaidis, Evaluation of a soft error tolerance technique based on time and/or space redundancy, in Proc. ICSD, [2] H. Singh, M. H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho, Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications, IEEE Tran. Computers, vol. 49, May [3] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins, ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix, in Proc. FPLA, [4] J. A. Cheatham, J. M. Emmert, and S. R. Baumgart, A survey of fault tolerant methodologies for FPGAs, ACM Trans. Design Automation of Computer Systems, April [5] S. K. Lu, F. M. Yesh, J. S. Shih, Fault detection and fault diagnosis techniques for lookup table FPGAs, VLSI Design Vol. 15, [6] D. Alnajjar, Y. Ko, T. Imagawa, M. Hiromoto, Y. Mitsuyama, M. Hashimoto, H. Ochi, and T. Onoye, A coarse-grained dynamically reconfigurable architecture enabling flexible reliability, in Proc. FPL, [7] Y. Kim, M. Kiemb, C. Park, J. Jung, and K. Choi, Resource sharing and pipelining in coarse-grained reconfigurable architecture for domain-specific optimization, in Proc. DATE, [8] Y. Kim, I. Park, K. Choi, and Y. Paek, Power-conscious configuration cache structure and code mapping for coarse-grained reconfigurable architecture, in Proc. ISLPED, [9] G. Lee, S. Lee, K. Choi, and N. Dutt, Routing-aware application mapping considering Steiner points for coarsegrained reconfigurable architecture, in Proc. ARC, [10] G. Lee, K. Chang, and K. Choi, Automatic mapping of control-intensive kernels onto coarse-grained reconfigurable array architecture with speculative execution, in Proc. RAW, [11] D. C. Vanderster, A. Baniasadi, and N. J. Dimopoulos, Exploiting task temperature profiling in temperatureaware task scheduling for computational clusters, in Proc. APCSAC, [12] D. Brooks and M. Martonosi, Dynamic thermal management for high-performance microprocessors, in Proc. HPCA, [13] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, Temperature-aware microarchitecture: modeling and implementation, ACM Trans. Architecture and Code Optimization vol. 1, March [14] K. Han and J. Kim, Quantum-inspired evolutionary two phase scheme, IEEE Trans. Evolutionary Computation 8, April [15] J. Cui and D. L. Maskell, Dynamic thermal-aware scheduling on chip multiprocessor for soft real-time system, in Proc. GLSVLSI, [16] J. Lee, Y. Kim, J. Jung, S. Kang, and K. Choi, Reconfigurable ALU array architecture with conditional execution, in Proc. ISOCC, [17] Compendium of Chemical Terminology, International Union of Pure and Applied Chemistry, Gold Book. [18] K. Stavrou and P. Trancoso, Thermal-aware scheduling for future chip multiprocessors, EURASIP Journal on Embedded Systems, January [19] M. D. Powell, M. Gomaa, and T. N. Vijaykumar, Heatand-run: Leveraging SMT and CMP to manage power density through the operating system, in Proc. ASPLOS,

Design of Reusable Context Pipelining for Coarse Grained Reconfigurable Architecture

Design of Reusable Context Pipelining for Coarse Grained Reconfigurable Architecture P. Murali 1 (M. Tech), Dr. S. Tamilselvan 2, S. Yazhinian (Research Scholar) 3 1, 2, 3 Dept of Electronics and Communication