Power Estimation of UVA CS754 CMP Architecture

Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As the rate of power density increase continues to rise, it is beneficial to eliminate choices early in the design process which would render the design impractical or impossible. Furthermore, power analysis data can also be used to steer power-aware compiler decisions within a constrained energy budget. This paper presents a power and area estimation model for UVA CS754 architecture. The trouble with most early power analysis schemes, is that they are either computationally complex or statistically inaccurate. In addition, the relative novelty of general purpose CMP architectures has resulted in a lack of related high-level power analysis techniques. Traditional methods which are designed for single-core architectures cannot scale to provide rapid and accurate estimates for CMP power consumption. These approaches do not take into account on-chip networks which are frequently present on architectures which contain multiple major computation elements. Especially with the number of cores in projected technologies already on the rise, inter-core message passing will undoubtedly become a more dominant factor in power consumption. The ability of a power estimation technique to incorporate these quantities is similarly important. In this approach, we extend the processor modeling strategy offered by the Eisley, et al. in High-Level Power Analysis for Multi-Core Chips to the UVA CS754 CMP architecture. High-level Power Analysis Methodology Power analysis of complex systems such as CMP architectures is completed by breaking the problem down into smaller blocks. Each instruction in the ISA follows a path which is comprised of smaller common paths. Identifying which sub-paths each instruction traverses, and then summing the power consumption of each traversal is the main idea behind this approach. Power and area models go hand-in-hand. It is impossible to estimate energy without estimating area first, because energy is directly proportional to capacitance, which is in turn directly proportional to area. For high-level power estimation techniques, it is sufficient to estimate area based on component input bit widths. Any more detailed attempts at estimating capacitance would be unnecessary, because power estimates produced by this model are relative. This is because a power estimation for a trace of instructions produces a quantity which is unitless. It only makes sense to compare the results among power estimations for several testbenches produced by this same model. 1

The method of delegating area estimation to the bit width of a component mitigates the error of high-level estimation by using a unified metric across the entire model. Error is therefore normalized and is prevented from threatening the integrity of the power model. Modeling the CS754 Architecture Eisley et al. provide a set of building blocks which can be used to create a network model for any microprocessor architecture. A directed network implicitly discloses dependencies among resources. When a network representation of a microarchitecture is created, the dependencies specified reflect the order in which an instruction passing through the processor traverses components within the datapath. Messages pass through the network of the power model by traversing a series of links. Pipeline stages are represented by one more more links in the power model. Canonical message paths in the network are identified by grouping together links which must be traversed in sequence due to architectural or logical constraints. Modeling the CS754 architecture entailed first developing models for the control processor (CP), acceleration grid (AG), interconnect network. Both the CP and a single AG core were modeled as the same datapath, but with different dimensions. This datapath is made up of stages which are common to most pipelined architectures. This includes an instruction fetch and decode stage, operand fetch, integer ALU, and cache and write-back stages. The CP also contains a two-stage multiplier and a four-stage floating-point unit, not found in the AG cores. However, this sizing ratio is a parameter in the model which can be adjusted. Figure 1. : Network representation of CP / AG core The interconnect network is modeled as a single link whose width is a parameter within the model. When data traverses the TO GRID link to arrive at node 14, it is understood that it has 2

arrived at its destination. For example, if the CP sends data to a core on the AG via the on-chip network, when the data reaches node 14 of the CP, it also reaches node 14 of the AG core. At this point, the data has crossed over the data network grid and has left the CP. Node 14 can be thought of as an intermediate point between source and destination. In order to create a parameterized model of the CS754 architecture, which is a work in progress, it was necessary to make some assumptions. Each instruction is assumed to be 32 bits wide, which sets the minimum link width to 32 bits. Therefore, the width of any link is a multiple of 32. For brevity, link width can simply be normalized to 32, so a normalized link width of four would correspond to a 128-bit wide component. For the purpose of this exploration, it was assumed that there are 16 discrete cores in the AG, but this number can be specified as a parameter in the model. The only significant matter which concerns the grid size is its relative size to the CP. Although the power model of the CP s pipeline looks the same as an AG core (with the addition of floating point and multiply units), it is assumed to be a four-way, out-of-order superscalar core. Therefore, each link in the CP pipeline is modeled four times as wide as a link in a single AG core. Furthermore, since the CP is capable of issuing four instructions at a time to the AG via the on-chip network, the width of the TO GRID and FROM GRID links is four. The interconnect network communicates four words of data, plus routing information. Calculating Power Constituents Power is calculated as suggested by Eisley et al., using the product of energy and utilization of each link. Each link in the network which represents the underlying architecture has its own component energy cost and utilization. When these individual quantities are summed, an approximation of the total power utilization of the architecture is obtained. The energy of a link is approximated by estimating the area of the component which that link represents. This can be done because the other factors which make up capacitance are held constant throughout the design, and only area is varied. Since this power model is relative, only datapath width of a component used to approximate area. Pipeline stages are assumed to have to no slack, so for the purposes of estimation the propagation delay of each stage is the same. E CV 2 C ε A d E ka Lemma 1. Each link is assigned a static energy cost. Link power, as specified by Eisley et al., is calculated by multiplying the energy of the link by its utilization. Lemma 1. shows that energy is directly 3

proportional to area. Since path width is used as a proxy for area, energy consumption of major architectural structures can simply be characterized as by its path width. Utilization is estimated from an instruction trace. Each link within core pipelines was estimated by counting occurrence of related instructions. Every instruction must be fetched and decoded, so the utilization of both IF and ID links is 1 in every core. Operand fetches contribute to the utilization of links R1 and R2, using only R1 for one operand and both R1 and R2 for two operands. Branch instructions such as jump were counted for the BR link. Integer arithmetic instructions add and subtract are part of the utilization for the INT ALU link. Memory operations must traverse the A, TV, TL (address, tag verify, tag lookup) links in addition to the on-chip network link. The WB stage is traversed by every instruction which has a destination register. CP-only instructions which use the four-stage floating point and multiply units correspond to the F, P, U, FP4 and M1, M2 links, respectively. Each instruction issued to the AG incurs the additional penalty of having to be transferred across the on-chip network. Applications The power model was applied to small testbenches which are most appropriate to the highly parallel The general algorithm for Scalar Alpha X plus Y (SAXPY) expressed in C++ is shown in Listing 1. for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } Listing 1. However, in order to evaluate this algorithm in the power model, it must be presented in assembly. The assembly version of this testbench is given in the appendix. This particular procedure is a good example of an algorithm which can take advantage of an acceleration grid, such as the one found in the CS754 architecture. Each iteration for each element of x and y can be processed in parallel on separate cores. This parallel processing can only be done assuming that the elements of both arrays have no data dependencies. A special instruction issues several iterations of the algorithm at once to several cores. Depending on other concurrent operations, the number of cores devoted to this procedure may vary. For the purpose of power evaluation, it is not necessary to know exactly how many cores are dedicated to a particular kernel. Whether the entire algorithm is run in one cycle across multiple cores or sequentially on a single core is of no consequence to the power estimation. Since the cores are architecturally identical, they consume the same amount of power. It is possible that there are differences between these two scenarios, and further investigation is warranted. The second testbench which was included four floating point operations. Since there is only one floating point unit in the architecture, found in the CP, this testbench is a demonstration of CP power vs. AG power. The C++ representation of the 4ACC testbench is given in Listing 2. 4

double list[100], sum = 0.; for (int i = 0; i < 100; i++) sum += list[i]; Listing 2. Figure 2 shows a comparison of the power estimates of the two testbenches. It is interesting to note the distribution of power in each program. The CP is clearly very energy thrifty compared to the AG, even though it is a superscalar core. Not only does the AG consume four times the amount of power as the CP, but it also incurs the additional overhead of network traffic for issuance of instructions. The 4ACC testbench uses the network extensively because each element of the floating point array is fetched directly from memory. Figure 2. : Benchmark Comparison 250 200 150 100 DAXPY 4ACC 50 0 CP AG Int Total Component Power Discussion The parallel nature of the AG introduces unique challenges to the power model. For each step of execution, it is necessary to know what instruction each core is executing. Given a traditional instruction trace, it is possible to infer the work of several cores at a time. In reality, however, the entire AG is not running a single thread at a time. Nevertheless, this power model assumes that only one thread is being run at a time, using all available cores. The accuracy of the power estimate is not compromised, because it is not dependent on execution time. The execution of two threads in sequence which both use the entire AG is the same as both threads running concurrently, each using half of the AG. 5

Further Investigation With the continuing development of the CS754 architecture, further details will emerge that will most likely have to be incorporated into this power model. At this stage in its development, many crucial details concerning task delegation, inter-component communication, and memory access of the CS754 architecture are unknown or are in flux. The advantage of a parameterized, high-level power estimation methodology is that as these parameters become available, they can simply be modified or incorporated into the model. Nevertheless, as the architecture evolves, new physical structures may need to be introduced to meet specific challenges, which may tilt the distribution of power consumption. Testbenches used in this investigation did not take into account instructions executed by the CP. It was assumed that the CP is only responsible for the delegation of tasks to AG cores and for running the operating system, which is a constant power consumption compared the AG. However, the reversal of this assumption will most likely reveal greater differences between testbench power consumption results. Due to the ongoing development of the CS754 architecture, it is difficult to predict the activity of the CP, given a small instruction trace in the AG. A scheduling algorithm for delegating tasks to AG cores is still under development. This procedure will most likely be run often, due to the forecasted frequency of AG context switches. With more In this exploration, cache traces could not be estimated. It is necessary to have a full architecture simulator to reveal data-dependent cache traces. Such data would augment the accuracy of the power model. Finally, the complete absence of a comprehensive clock-tree power estimation methodology is astounding. It is possible that clock power may be estimated as a constant overhead cost, but with clock gating becoming more prevalent, clock power becomes data dependent. This is definitely an area which deserves a closer look. 6

Appendix: Testbench Assembly Code ; Example 12.6b. DAXPY algorithm, 32-bit mode (DAXPY) n = 100 ; Define constant n (even and positive) mov ecx, n * 8 ; Load n * sizeof(double) xor eax, eax ; i = 0 lea esi, X ; X must be aligned by 16 lea edi, Y ; Y must be aligned by 16 movsd xmm2, DA ; Load DA shufpd xmm2, xmm2, 0 ; Get DA into both qwords of xmm2 ; This loop does 2 DAXPY calculations per iteration, using vectors: L1: movapd xmm1, [esi+eax] ; X[i], X[i+1] mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA movapd xmm0, [edi+eax] ; Y[i], Y[i+1] subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA movapd [edi+eax], xmm0 ; Store result add eax, 16 ; Add size of two elements to index cmp eax, ecx ; Compare with n*8 jl L1 ; Loop back ; Example 12.8b, Four floating point accumulators (4ACC) lea esi, list ; Pointer to list fld qword ptr [esi] ; accum1 = list[0] fld qword ptr [esi+8] ; accum2 = list[1] fld qword ptr [esi+16] ; accum3 = list[2] fld qword ptr [esi+24] ; accum4 = list[3] fxch st(3) ; Get accum1 to top add esi, 800 ; Point to end of list mov eax, 32-800 ; Index to list[4] from end of list L1: fadd qword ptr [esi+eax] ; Add list[i] fxch st(1) fadd qword ptr [esi+eax+8] ; Add list[i+1] fxch st(2) fadd qword ptr [esi+eax+16] ; Add list[i+2] fxch st(3) add eax, 24 ; i += 3 js L1 ; Loop faddp st(1), st(0) fxch st(1) faddp st(2), st(0) faddp st(1), st(0) fstp qword ptr [sum] ; Add two accumulators together ; Add the two other accumulators ; Add these sums ; Store the result 7