Power Estimation of UVA CS754 CMP Architecture

Similar documents
Using MMX Instructions to Perform Simple Vector Operations

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Memory Models. Registers

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

Build a program in Release mode and an executable file project.exe is created in Release folder. Run it.

Lecture: Pipeline Wrap-Up and Static ILP

Processors. Young W. Lim. May 12, 2016

CS146 Computer Architecture. Fall Midterm Exam

Real instruction set architectures. Part 2: a representative sample

Advanced Computer Architecture

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

CS / ECE 6810 Midterm Exam - Oct 21st 2008

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Superscalar Machines. Characteristics of superscalar processors

LECTURE 10. Pipelining: Advanced ILP

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

PACE: Power-Aware Computing Engines

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Chapter 9. Pipelining Design Techniques

Overview of the MIPS Architecture: Part I. CS 161: Lecture 0 1/24/17

Superscalar Processors

Chapter 2. Instruction Set Principles and Examples. In-Cheol Park Dept. of EE, KAIST

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Updated Exercises by Diana Franklin

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum

Computer System Architecture Final Examination Spring 2002

High-Level Power Analysis for Multi-Core Chips

CSC 8400: Computer Systems. Machine-Level Representation of Programs

Instruction-Level Parallelism and Its Exploitation

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Assembly Language Programming

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

ECE/CS 757: Homework 1

Hardware and Software Optimisation. Tom Spink

Computer Organization CS 206 T Lec# 2: Instruction Sets

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Fundamentals of Computer Design

Processing Unit CS206T

Computer Architecture. Chapter 2-2. Instructions: Language of the Computer

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Announcements HW1 is due on this Friday (Sept 12th) Appendix A is very helpful to HW1. Check out system calls

History of the Intel 80x86

The Instruction Set. Chapter 5

ECE 486/586. Computer Architecture. Lecture # 7

Pentium 4 Processor Block Diagram

Registers. Registers

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

CSCE 5610: Computer Architecture

ECE 411, Exam 1. Good luck!

ECE 411 Exam 1. Name:

Digital Forensics Lecture 3 - Reverse Engineering

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Computer System Architecture

CS252 Graduate Computer Architecture Midterm 1 Solutions

Unit 9 : Fundamentals of Parallel Processing

Hardware-based Speculation

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to

Optimizing Memory Bandwidth

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

Computer Systems Laboratory Sungkyunkwan University

CS 101, Mock Computer Architecture

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine.

INSTRUCTION LEVEL PARALLELISM

Complex Pipelines and Branch Prediction

COSC 6385 Computer Architecture. Instruction Set Architectures

Instruction Set Overview

Processors. Young W. Lim. May 9, 2016

General Purpose Signal Processors

Chapter 13 Reduced Instruction Set Computers

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Chapter 12. CPU Structure and Function. Yonsei University

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

Chapter 4. The Processor

Computer Architecture Today (I)

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

CS311 Lecture: Pipelining and Superscalar Architectures

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instructions: Language of the Computer

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Instruction Pipelining Review

Parts A and B both refer to the C-code and 6-instruction processor equivalent assembly shown below:

Transcription:

Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As the rate of power density increase continues to rise, it is beneficial to eliminate choices early in the design process which would render the design impractical or impossible. Furthermore, power analysis data can also be used to steer power-aware compiler decisions within a constrained energy budget. This paper presents a power and area estimation model for UVA CS754 architecture. The trouble with most early power analysis schemes, is that they are either computationally complex or statistically inaccurate. In addition, the relative novelty of general purpose CMP architectures has resulted in a lack of related high-level power analysis techniques. Traditional methods which are designed for single-core architectures cannot scale to provide rapid and accurate estimates for CMP power consumption. These approaches do not take into account on-chip networks which are frequently present on architectures which contain multiple major computation elements. Especially with the number of cores in projected technologies already on the rise, inter-core message passing will undoubtedly become a more dominant factor in power consumption. The ability of a power estimation technique to incorporate these quantities is similarly important. In this approach, we extend the processor modeling strategy offered by the Eisley, et al. in High-Level Power Analysis for Multi-Core Chips to the UVA CS754 CMP architecture. High-level Power Analysis Methodology Power analysis of complex systems such as CMP architectures is completed by breaking the problem down into smaller blocks. Each instruction in the ISA follows a path which is comprised of smaller common paths. Identifying which sub-paths each instruction traverses, and then summing the power consumption of each traversal is the main idea behind this approach. Power and area models go hand-in-hand. It is impossible to estimate energy without estimating area first, because energy is directly proportional to capacitance, which is in turn directly proportional to area. For high-level power estimation techniques, it is sufficient to estimate area based on component input bit widths. Any more detailed attempts at estimating capacitance would be unnecessary, because power estimates produced by this model are relative. This is because a power estimation for a trace of instructions produces a quantity which is unitless. It only makes sense to compare the results among power estimations for several testbenches produced by this same model. 1

The method of delegating area estimation to the bit width of a component mitigates the error of high-level estimation by using a unified metric across the entire model. Error is therefore normalized and is prevented from threatening the integrity of the power model. Modeling the CS754 Architecture Eisley et al. provide a set of building blocks which can be used to create a network model for any microprocessor architecture. A directed network implicitly discloses dependencies among resources. When a network representation of a microarchitecture is created, the dependencies specified reflect the order in which an instruction passing through the processor traverses components within the datapath. Messages pass through the network of the power model by traversing a series of links. Pipeline stages are represented by one more more links in the power model. Canonical message paths in the network are identified by grouping together links which must be traversed in sequence due to architectural or logical constraints. Modeling the CS754 architecture entailed first developing models for the control processor (CP), acceleration grid (AG), interconnect network. Both the CP and a single AG core were modeled as the same datapath, but with different dimensions. This datapath is made up of stages which are common to most pipelined architectures. This includes an instruction fetch and decode stage, operand fetch, integer ALU, and cache and write-back stages. The CP also contains a two-stage multiplier and a four-stage floating-point unit, not found in the AG cores. However, this sizing ratio is a parameter in the model which can be adjusted. Figure 1. : Network representation of CP / AG core The interconnect network is modeled as a single link whose width is a parameter within the model. When data traverses the TO GRID link to arrive at node 14, it is understood that it has 2

arrived at its destination. For example, if the CP sends data to a core on the AG via the on-chip network, when the data reaches node 14 of the CP, it also reaches node 14 of the AG core. At this point, the data has crossed over the data network grid and has left the CP. Node 14 can be thought of as an intermediate point between source and destination. In order to create a parameterized model of the CS754 architecture, which is a work in progress, it was necessary to make some assumptions. Each instruction is assumed to be 32 bits wide, which sets the minimum link width to 32 bits. Therefore, the width of any link is a multiple of 32. For brevity, link width can simply be normalized to 32, so a normalized link width of four would correspond to a 128-bit wide component. For the purpose of this exploration, it was assumed that there are 16 discrete cores in the AG, but this number can be specified as a parameter in the model. The only significant matter which concerns the grid size is its relative size to the CP. Although the power model of the CP s pipeline looks the same as an AG core (with the addition of floating point and multiply units), it is assumed to be a four-way, out-of-order superscalar core. Therefore, each link in the CP pipeline is modeled four times as wide as a link in a single AG core. Furthermore, since the CP is capable of issuing four instructions at a time to the AG via the on-chip network, the width of the TO GRID and FROM GRID links is four. The interconnect network communicates four words of data, plus routing information. Calculating Power Constituents Power is calculated as suggested by Eisley et al., using the product of energy and utilization of each link. Each link in the network which represents the underlying architecture has its own component energy cost and utilization. When these individual quantities are summed, an approximation of the total power utilization of the architecture is obtained. The energy of a link is approximated by estimating the area of the component which that link represents. This can be done because the other factors which make up capacitance are held constant throughout the design, and only area is varied. Since this power model is relative, only datapath width of a component used to approximate area. Pipeline stages are assumed to have to no slack, so for the purposes of estimation the propagation delay of each stage is the same. E CV 2 C ε A d E ka Lemma 1. Each link is assigned a static energy cost. Link power, as specified by Eisley et al., is calculated by multiplying the energy of the link by its utilization. Lemma 1. shows that energy is directly 3

proportional to area. Since path width is used as a proxy for area, energy consumption of major architectural structures can simply be characterized as by its path width. Utilization is estimated from an instruction trace. Each link within core pipelines was estimated by counting occurrence of related instructions. Every instruction must be fetched and decoded, so the utilization of both IF and ID links is 1 in every core. Operand fetches contribute to the utilization of links R1 and R2, using only R1 for one operand and both R1 and R2 for two operands. Branch instructions such as jump were counted for the BR link. Integer arithmetic instructions add and subtract are part of the utilization for the INT ALU link. Memory operations must traverse the A, TV, TL (address, tag verify, tag lookup) links in addition to the on-chip network link. The WB stage is traversed by every instruction which has a destination register. CP-only instructions which use the four-stage floating point and multiply units correspond to the F, P, U, FP4 and M1, M2 links, respectively. Each instruction issued to the AG incurs the additional penalty of having to be transferred across the on-chip network. Applications The power model was applied to small testbenches which are most appropriate to the highly parallel The general algorithm for Scalar Alpha X plus Y (SAXPY) expressed in C++ is shown in Listing 1. for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } Listing 1. However, in order to evaluate this algorithm in the power model, it must be presented in assembly. The assembly version of this testbench is given in the appendix. This particular procedure is a good example of an algorithm which can take advantage of an acceleration grid, such as the one found in the CS754 architecture. Each iteration for each element of x and y can be processed in parallel on separate cores. This parallel processing can only be done assuming that the elements of both arrays have no data dependencies. A special instruction issues several iterations of the algorithm at once to several cores. Depending on other concurrent operations, the number of cores devoted to this procedure may vary. For the purpose of power evaluation, it is not necessary to know exactly how many cores are dedicated to a particular kernel. Whether the entire algorithm is run in one cycle across multiple cores or sequentially on a single core is of no consequence to the power estimation. Since the cores are architecturally identical, they consume the same amount of power. It is possible that there are differences between these two scenarios, and further investigation is warranted. The second testbench which was included four floating point operations. Since there is only one floating point unit in the architecture, found in the CP, this testbench is a demonstration of CP power vs. AG power. The C++ representation of the 4ACC testbench is given in Listing 2. 4

double list[100], sum = 0.; for (int i = 0; i < 100; i++) sum += list[i]; Listing 2. Figure 2 shows a comparison of the power estimates of the two testbenches. It is interesting to note the distribution of power in each program. The CP is clearly very energy thrifty compared to the AG, even though it is a superscalar core. Not only does the AG consume four times the amount of power as the CP, but it also incurs the additional overhead of network traffic for issuance of instructions. The 4ACC testbench uses the network extensively because each element of the floating point array is fetched directly from memory. Figure 2. : Benchmark Comparison 250 200 150 100 DAXPY 4ACC 50 0 CP AG Int Total Component Power Discussion The parallel nature of the AG introduces unique challenges to the power model. For each step of execution, it is necessary to know what instruction each core is executing. Given a traditional instruction trace, it is possible to infer the work of several cores at a time. In reality, however, the entire AG is not running a single thread at a time. Nevertheless, this power model assumes that only one thread is being run at a time, using all available cores. The accuracy of the power estimate is not compromised, because it is not dependent on execution time. The execution of two threads in sequence which both use the entire AG is the same as both threads running concurrently, each using half of the AG. 5

Further Investigation With the continuing development of the CS754 architecture, further details will emerge that will most likely have to be incorporated into this power model. At this stage in its development, many crucial details concerning task delegation, inter-component communication, and memory access of the CS754 architecture are unknown or are in flux. The advantage of a parameterized, high-level power estimation methodology is that as these parameters become available, they can simply be modified or incorporated into the model. Nevertheless, as the architecture evolves, new physical structures may need to be introduced to meet specific challenges, which may tilt the distribution of power consumption. Testbenches used in this investigation did not take into account instructions executed by the CP. It was assumed that the CP is only responsible for the delegation of tasks to AG cores and for running the operating system, which is a constant power consumption compared the AG. However, the reversal of this assumption will most likely reveal greater differences between testbench power consumption results. Due to the ongoing development of the CS754 architecture, it is difficult to predict the activity of the CP, given a small instruction trace in the AG. A scheduling algorithm for delegating tasks to AG cores is still under development. This procedure will most likely be run often, due to the forecasted frequency of AG context switches. With more In this exploration, cache traces could not be estimated. It is necessary to have a full architecture simulator to reveal data-dependent cache traces. Such data would augment the accuracy of the power model. Finally, the complete absence of a comprehensive clock-tree power estimation methodology is astounding. It is possible that clock power may be estimated as a constant overhead cost, but with clock gating becoming more prevalent, clock power becomes data dependent. This is definitely an area which deserves a closer look. 6

Appendix: Testbench Assembly Code ; Example 12.6b. DAXPY algorithm, 32-bit mode (DAXPY) n = 100 ; Define constant n (even and positive) mov ecx, n * 8 ; Load n * sizeof(double) xor eax, eax ; i = 0 lea esi, X ; X must be aligned by 16 lea edi, Y ; Y must be aligned by 16 movsd xmm2, DA ; Load DA shufpd xmm2, xmm2, 0 ; Get DA into both qwords of xmm2 ; This loop does 2 DAXPY calculations per iteration, using vectors: L1: movapd xmm1, [esi+eax] ; X[i], X[i+1] mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA movapd xmm0, [edi+eax] ; Y[i], Y[i+1] subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA movapd [edi+eax], xmm0 ; Store result add eax, 16 ; Add size of two elements to index cmp eax, ecx ; Compare with n*8 jl L1 ; Loop back ; Example 12.8b, Four floating point accumulators (4ACC) lea esi, list ; Pointer to list fld qword ptr [esi] ; accum1 = list[0] fld qword ptr [esi+8] ; accum2 = list[1] fld qword ptr [esi+16] ; accum3 = list[2] fld qword ptr [esi+24] ; accum4 = list[3] fxch st(3) ; Get accum1 to top add esi, 800 ; Point to end of list mov eax, 32-800 ; Index to list[4] from end of list L1: fadd qword ptr [esi+eax] ; Add list[i] fxch st(1) fadd qword ptr [esi+eax+8] ; Add list[i+1] fxch st(2) fadd qword ptr [esi+eax+16] ; Add list[i+2] fxch st(3) add eax, 24 ; i += 3 js L1 ; Loop faddp st(1), st(0) fxch st(1) faddp st(2), st(0) faddp st(1), st(0) fstp qword ptr [sum] ; Add two accumulators together ; Add the two other accumulators ; Add these sums ; Store the result 7