Power Estimation of UVA CS754 CMP Architecture

Size: px
Start display at page:

Download "Power Estimation of UVA CS754 CMP Architecture"

Transcription

1 Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic Early power analysis has become an essential part of determining the feasibility of microprocessor design. As the rate of power density increase continues to rise, it is beneficial to eliminate choices early in the design process which would render the design impractical or impossible. Furthermore, power analysis data can also be used to steer power-aware compiler decisions within a constrained energy budget. This paper presents a power and area estimation model for UVA CS754 architecture. The trouble with most early power analysis schemes, is that they are either computationally complex or statistically inaccurate. In addition, the relative novelty of general purpose CMP architectures has resulted in a lack of related high-level power analysis techniques. Traditional methods which are designed for single-core architectures cannot scale to provide rapid and accurate estimates for CMP power consumption. These approaches do not take into account on-chip networks which are frequently present on architectures which contain multiple major computation elements. Especially with the number of cores in projected technologies already on the rise, inter-core message passing will undoubtedly become a more dominant factor in power consumption. The ability of a power estimation technique to incorporate these quantities is similarly important. In this approach, we extend the processor modeling strategy offered by the Eisley, et al. in High-Level Power Analysis for Multi-Core Chips to the UVA CS754 CMP architecture. High-level Power Analysis Methodology Power analysis of complex systems such as CMP architectures is completed by breaking the problem down into smaller blocks. Each instruction in the ISA follows a path which is comprised of smaller common paths. Identifying which sub-paths each instruction traverses, and then summing the power consumption of each traversal is the main idea behind this approach. Power and area models go hand-in-hand. It is impossible to estimate energy without estimating area first, because energy is directly proportional to capacitance, which is in turn directly proportional to area. For high-level power estimation techniques, it is sufficient to estimate area based on component input bit widths. Any more detailed attempts at estimating capacitance would be unnecessary, because power estimates produced by this model are relative. This is because a power estimation for a trace of instructions produces a quantity which is unitless. It only makes sense to compare the results among power estimations for several testbenches produced by this same model. 1

2 The method of delegating area estimation to the bit width of a component mitigates the error of high-level estimation by using a unified metric across the entire model. Error is therefore normalized and is prevented from threatening the integrity of the power model. Modeling the CS754 Architecture Eisley et al. provide a set of building blocks which can be used to create a network model for any microprocessor architecture. A directed network implicitly discloses dependencies among resources. When a network representation of a microarchitecture is created, the dependencies specified reflect the order in which an instruction passing through the processor traverses components within the datapath. Messages pass through the network of the power model by traversing a series of links. Pipeline stages are represented by one more more links in the power model. Canonical message paths in the network are identified by grouping together links which must be traversed in sequence due to architectural or logical constraints. Modeling the CS754 architecture entailed first developing models for the control processor (CP), acceleration grid (AG), interconnect network. Both the CP and a single AG core were modeled as the same datapath, but with different dimensions. This datapath is made up of stages which are common to most pipelined architectures. This includes an instruction fetch and decode stage, operand fetch, integer ALU, and cache and write-back stages. The CP also contains a two-stage multiplier and a four-stage floating-point unit, not found in the AG cores. However, this sizing ratio is a parameter in the model which can be adjusted. Figure 1. : Network representation of CP / AG core The interconnect network is modeled as a single link whose width is a parameter within the model. When data traverses the TO GRID link to arrive at node 14, it is understood that it has 2

3 arrived at its destination. For example, if the CP sends data to a core on the AG via the on-chip network, when the data reaches node 14 of the CP, it also reaches node 14 of the AG core. At this point, the data has crossed over the data network grid and has left the CP. Node 14 can be thought of as an intermediate point between source and destination. In order to create a parameterized model of the CS754 architecture, which is a work in progress, it was necessary to make some assumptions. Each instruction is assumed to be 32 bits wide, which sets the minimum link width to 32 bits. Therefore, the width of any link is a multiple of 32. For brevity, link width can simply be normalized to 32, so a normalized link width of four would correspond to a 128-bit wide component. For the purpose of this exploration, it was assumed that there are 16 discrete cores in the AG, but this number can be specified as a parameter in the model. The only significant matter which concerns the grid size is its relative size to the CP. Although the power model of the CP s pipeline looks the same as an AG core (with the addition of floating point and multiply units), it is assumed to be a four-way, out-of-order superscalar core. Therefore, each link in the CP pipeline is modeled four times as wide as a link in a single AG core. Furthermore, since the CP is capable of issuing four instructions at a time to the AG via the on-chip network, the width of the TO GRID and FROM GRID links is four. The interconnect network communicates four words of data, plus routing information. Calculating Power Constituents Power is calculated as suggested by Eisley et al., using the product of energy and utilization of each link. Each link in the network which represents the underlying architecture has its own component energy cost and utilization. When these individual quantities are summed, an approximation of the total power utilization of the architecture is obtained. The energy of a link is approximated by estimating the area of the component which that link represents. This can be done because the other factors which make up capacitance are held constant throughout the design, and only area is varied. Since this power model is relative, only datapath width of a component used to approximate area. Pipeline stages are assumed to have to no slack, so for the purposes of estimation the propagation delay of each stage is the same. E CV 2 C ε A d E ka Lemma 1. Each link is assigned a static energy cost. Link power, as specified by Eisley et al., is calculated by multiplying the energy of the link by its utilization. Lemma 1. shows that energy is directly 3

4 proportional to area. Since path width is used as a proxy for area, energy consumption of major architectural structures can simply be characterized as by its path width. Utilization is estimated from an instruction trace. Each link within core pipelines was estimated by counting occurrence of related instructions. Every instruction must be fetched and decoded, so the utilization of both IF and ID links is 1 in every core. Operand fetches contribute to the utilization of links R1 and R2, using only R1 for one operand and both R1 and R2 for two operands. Branch instructions such as jump were counted for the BR link. Integer arithmetic instructions add and subtract are part of the utilization for the INT ALU link. Memory operations must traverse the A, TV, TL (address, tag verify, tag lookup) links in addition to the on-chip network link. The WB stage is traversed by every instruction which has a destination register. CP-only instructions which use the four-stage floating point and multiply units correspond to the F, P, U, FP4 and M1, M2 links, respectively. Each instruction issued to the AG incurs the additional penalty of having to be transferred across the on-chip network. Applications The power model was applied to small testbenches which are most appropriate to the highly parallel The general algorithm for Scalar Alpha X plus Y (SAXPY) expressed in C++ is shown in Listing 1. for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } Listing 1. However, in order to evaluate this algorithm in the power model, it must be presented in assembly. The assembly version of this testbench is given in the appendix. This particular procedure is a good example of an algorithm which can take advantage of an acceleration grid, such as the one found in the CS754 architecture. Each iteration for each element of x and y can be processed in parallel on separate cores. This parallel processing can only be done assuming that the elements of both arrays have no data dependencies. A special instruction issues several iterations of the algorithm at once to several cores. Depending on other concurrent operations, the number of cores devoted to this procedure may vary. For the purpose of power evaluation, it is not necessary to know exactly how many cores are dedicated to a particular kernel. Whether the entire algorithm is run in one cycle across multiple cores or sequentially on a single core is of no consequence to the power estimation. Since the cores are architecturally identical, they consume the same amount of power. It is possible that there are differences between these two scenarios, and further investigation is warranted. The second testbench which was included four floating point operations. Since there is only one floating point unit in the architecture, found in the CP, this testbench is a demonstration of CP power vs. AG power. The C++ representation of the 4ACC testbench is given in Listing 2. 4

5 double list[100], sum = 0.; for (int i = 0; i < 100; i++) sum += list[i]; Listing 2. Figure 2 shows a comparison of the power estimates of the two testbenches. It is interesting to note the distribution of power in each program. The CP is clearly very energy thrifty compared to the AG, even though it is a superscalar core. Not only does the AG consume four times the amount of power as the CP, but it also incurs the additional overhead of network traffic for issuance of instructions. The 4ACC testbench uses the network extensively because each element of the floating point array is fetched directly from memory. Figure 2. : Benchmark Comparison DAXPY 4ACC 50 0 CP AG Int Total Component Power Discussion The parallel nature of the AG introduces unique challenges to the power model. For each step of execution, it is necessary to know what instruction each core is executing. Given a traditional instruction trace, it is possible to infer the work of several cores at a time. In reality, however, the entire AG is not running a single thread at a time. Nevertheless, this power model assumes that only one thread is being run at a time, using all available cores. The accuracy of the power estimate is not compromised, because it is not dependent on execution time. The execution of two threads in sequence which both use the entire AG is the same as both threads running concurrently, each using half of the AG. 5

6 Further Investigation With the continuing development of the CS754 architecture, further details will emerge that will most likely have to be incorporated into this power model. At this stage in its development, many crucial details concerning task delegation, inter-component communication, and memory access of the CS754 architecture are unknown or are in flux. The advantage of a parameterized, high-level power estimation methodology is that as these parameters become available, they can simply be modified or incorporated into the model. Nevertheless, as the architecture evolves, new physical structures may need to be introduced to meet specific challenges, which may tilt the distribution of power consumption. Testbenches used in this investigation did not take into account instructions executed by the CP. It was assumed that the CP is only responsible for the delegation of tasks to AG cores and for running the operating system, which is a constant power consumption compared the AG. However, the reversal of this assumption will most likely reveal greater differences between testbench power consumption results. Due to the ongoing development of the CS754 architecture, it is difficult to predict the activity of the CP, given a small instruction trace in the AG. A scheduling algorithm for delegating tasks to AG cores is still under development. This procedure will most likely be run often, due to the forecasted frequency of AG context switches. With more In this exploration, cache traces could not be estimated. It is necessary to have a full architecture simulator to reveal data-dependent cache traces. Such data would augment the accuracy of the power model. Finally, the complete absence of a comprehensive clock-tree power estimation methodology is astounding. It is possible that clock power may be estimated as a constant overhead cost, but with clock gating becoming more prevalent, clock power becomes data dependent. This is definitely an area which deserves a closer look. 6

7 Appendix: Testbench Assembly Code ; Example 12.6b. DAXPY algorithm, 32-bit mode (DAXPY) n = 100 ; Define constant n (even and positive) mov ecx, n * 8 ; Load n * sizeof(double) xor eax, eax ; i = 0 lea esi, X ; X must be aligned by 16 lea edi, Y ; Y must be aligned by 16 movsd xmm2, DA ; Load DA shufpd xmm2, xmm2, 0 ; Get DA into both qwords of xmm2 ; This loop does 2 DAXPY calculations per iteration, using vectors: L1: movapd xmm1, [esi+eax] ; X[i], X[i+1] mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA movapd xmm0, [edi+eax] ; Y[i], Y[i+1] subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA movapd [edi+eax], xmm0 ; Store result add eax, 16 ; Add size of two elements to index cmp eax, ecx ; Compare with n*8 jl L1 ; Loop back ; Example 12.8b, Four floating point accumulators (4ACC) lea esi, list ; Pointer to list fld qword ptr [esi] ; accum1 = list[0] fld qword ptr [esi+8] ; accum2 = list[1] fld qword ptr [esi+16] ; accum3 = list[2] fld qword ptr [esi+24] ; accum4 = list[3] fxch st(3) ; Get accum1 to top add esi, 800 ; Point to end of list mov eax, ; Index to list[4] from end of list L1: fadd qword ptr [esi+eax] ; Add list[i] fxch st(1) fadd qword ptr [esi+eax+8] ; Add list[i+1] fxch st(2) fadd qword ptr [esi+eax+16] ; Add list[i+2] fxch st(3) add eax, 24 ; i += 3 js L1 ; Loop faddp st(1), st(0) fxch st(1) faddp st(2), st(0) faddp st(1), st(0) fstp qword ptr [sum] ; Add two accumulators together ; Add the two other accumulators ; Add these sums ; Store the result 7

Using MMX Instructions to Perform Simple Vector Operations

Using MMX Instructions to Perform Simple Vector Operations Using MMX Instructions to Perform Simple Vector Operations Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection with

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Memory Models. Registers

Memory Models. Registers Memory Models Most machines have a single linear address space at the ISA level, extending from address 0 up to some maximum, often 2 32 1 bytes or 2 64 1 bytes. Some machines have separate address spaces

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points] Review Questions 1 The DRAM problem [5 points] Suggest a solution 2 Big versus Little Endian Addressing [5 points] Consider the 32-bit hexadecimal number 0x21d3ea7d. 1. What is the binary representation

More information

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro Math 230 Assembly Programming (AKA Computer Organization) Spring 2008 MIPS Intro Adapted from slides developed for: Mary J. Irwin PSU CSE331 Dave Patterson s UCB CS152 M230 L09.1 Smith Spring 2008 MIPS

More information

Build a program in Release mode and an executable file project.exe is created in Release folder. Run it.

Build a program in Release mode and an executable file project.exe is created in Release folder. Run it. Assembly Language and System Software Lab Exercise. Finish all the lab exercise under 50 minutes. In this lab exercise, you will learn to use the floating point unit (FPU). The FPU maintains a stack which

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Real instruction set architectures. Part 2: a representative sample

Real instruction set architectures. Part 2: a representative sample Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Chapter 9. Pipelining Design Techniques

Chapter 9. Pipelining Design Techniques Chapter 9 Pipelining Design Techniques 9.1 General Concepts Pipelining refers to the technique in which a given task is divided into a number of subtasks that need to be performed in sequence. Each subtask

More information

Overview of the MIPS Architecture: Part I. CS 161: Lecture 0 1/24/17

Overview of the MIPS Architecture: Part I. CS 161: Lecture 0 1/24/17 Overview of the MIPS Architecture: Part I CS 161: Lecture 0 1/24/17 Looking Behind the Curtain of Software The OS sits between hardware and user-level software, providing: Isolation (e.g., to give each

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Chapter 2. Instruction Set Principles and Examples. In-Cheol Park Dept. of EE, KAIST

Chapter 2. Instruction Set Principles and Examples. In-Cheol Park Dept. of EE, KAIST Chapter 2. Instruction Set Principles and Examples In-Cheol Park Dept. of EE, KAIST Stack architecture( 0-address ) operands are on the top of the stack Accumulator architecture( 1-address ) one operand

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs CSC 2400: Computer Systems Towards the Hardware: Machine-Level Representation of Programs Towards the Hardware High-level language (Java) High-level language (C) assembly language machine language (IA-32)

More information

Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum

Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum Instruction Selection CS42/CS43 Introduction to Compilers Tim Teitelbaum Lecture 32: More Instruction Selection 20 Apr 05. Translate low-level IR code into DAG representation 2. Then find a good tiling

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

High-Level Power Analysis for Multi-Core Chips

High-Level Power Analysis for Multi-Core Chips High-Level Power Analysis for Multi-Core Chips Noel Eisley, Vassos Soteriou, and Li-Shiuan Peh Dept. of Electrical Engineering, Princeton University Princeton, NJ 8544 {eisley, soteriou, peh}@princeton.edu

More information

CSC 8400: Computer Systems. Machine-Level Representation of Programs

CSC 8400: Computer Systems. Machine-Level Representation of Programs CSC 8400: Computer Systems Machine-Level Representation of Programs Towards the Hardware High-level language (Java) High-level language (C) assembly language machine language (IA-32) 1 Compilation Stages

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Assembly Language Programming

Assembly Language Programming Assembly Language Programming Ľudmila Jánošíková Department of Mathematical Methods and Operations Research Faculty of Management Science and Informatics University of Žilina tel.: 421 41 513 4200 Ludmila.Janosikova@fri.uniza.sk

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Hardware and Software Optimisation. Tom Spink

Hardware and Software Optimisation. Tom Spink Hardware and Software Optimisation Tom Spink Optimisation Modifying some aspect of a system to make it run more efficiently, or utilise less resources. Optimising hardware: Making it use less energy, or

More information

Computer Organization CS 206 T Lec# 2: Instruction Sets

Computer Organization CS 206 T Lec# 2: Instruction Sets Computer Organization CS 206 T Lec# 2: Instruction Sets Topics What is an instruction set Elements of instruction Instruction Format Instruction types Types of operations Types of operand Addressing mode

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Computer Architecture. Chapter 2-2. Instructions: Language of the Computer

Computer Architecture. Chapter 2-2. Instructions: Language of the Computer Computer Architecture Chapter 2-2 Instructions: Language of the Computer 1 Procedures A major program structuring mechanism Calling & returning from a procedure requires a protocol. The protocol is a sequence

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Announcements HW1 is due on this Friday (Sept 12th) Appendix A is very helpful to HW1. Check out system calls

Announcements HW1 is due on this Friday (Sept 12th) Appendix A is very helpful to HW1. Check out system calls Announcements HW1 is due on this Friday (Sept 12 th ) Appendix A is very helpful to HW1. Check out system calls on Page A-48. Ask TA (Liquan chen: liquan@ece.rutgers.edu) about homework related questions.

More information

History of the Intel 80x86

History of the Intel 80x86 Intel s IA-32 Architecture Cptr280 Dr Curtis Nelson History of the Intel 80x86 1971 - Intel invents the microprocessor, the 4004 1975-8080 introduced 8-bit microprocessor 1978-8086 introduced 16 bit microprocessor

More information

The Instruction Set. Chapter 5

The Instruction Set. Chapter 5 The Instruction Set Architecture Level(ISA) Chapter 5 1 ISA Level The ISA level l is the interface between the compilers and the hardware. (ISA level code is what a compiler outputs) 2 Memory Models An

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Registers. Registers

Registers. Registers All computers have some registers visible at the ISA level. They are there to control execution of the program hold temporary results visible at the microarchitecture level, such as the Top Of Stack (TOS)

More information

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) { Optimization void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; void twiddle2(int *xp, int *yp) { *xp += 2* *yp; void main() { int x = 3; int y = 3; twiddle1(&x, &y); x = 3; y = 3; twiddle2(&x,

More information

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining 12.1 CS356 Unit 12 Processor Hardware Organization Pipelining BASIC HW 12.2 Inputs Outputs 12.3 Logic Circuits Combinational logic Performs a specific function (mapping of 2 n input combinations to desired

More information

CSCE 5610: Computer Architecture

CSCE 5610: Computer Architecture HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI

More information

ECE 411, Exam 1. Good luck!

ECE 411, Exam 1. Good luck! This exam has 6 problems. Make sure you have a complete exam before you begin. Write your name on every page in case pages become separated during grading. You will have three hours to complete this exam.

More information

ECE 411 Exam 1. Name:

ECE 411 Exam 1. Name: This exam has 5 problems. Make sure you have a complete exam before you begin. Write your name on every page in case pages become separated during grading. You will have 3 hours to complete this exam.

More information

Digital Forensics Lecture 3 - Reverse Engineering

Digital Forensics Lecture 3 - Reverse Engineering Digital Forensics Lecture 3 - Reverse Engineering Low-Level Software Akbar S. Namin Texas Tech University Spring 2017 Reverse Engineering High-Level Software Low-level aspects of software are often the

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Department of Statistics and Computer Science University of Sri Jayewardenepura Instruction Set Architecture (ISA) Level 2 Introduction 3 Instruction Set Architecture

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers Where we are CS 420 Introduction to Compilers Andrew Myers Cornell University Lecture 8: Instruction Selection 5 Oct 20 Intermediate code Canonical intermediate code Abstract assembly code Assembly code

More information

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author...xi

More information

Optimizing Memory Bandwidth

Optimizing Memory Bandwidth Optimizing Memory Bandwidth Don t settle for just a byte or two. Grab a whole fistful of cache. Mike Wall Member of Technical Staff Developer Performance Team Advanced Micro Devices, Inc. make PC performance

More information

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1 Chapter 2 1 MIPS Instructions Instruction Meaning add $s1,$s2,$s3 $s1 = $s2 + $s3 sub $s1,$s2,$s3 $s1 = $s2 $s3 addi $s1,$s2,4 $s1 = $s2 + 4 ori $s1,$s2,4 $s2 = $s2 4 lw $s1,100($s2) $s1 = Memory[$s2+100]

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

Low-Level Essentials for Understanding Security Problems Aurélien Francillon Low-Level Essentials for Understanding Security Problems Aurélien Francillon francill@eurecom.fr Computer Architecture The modern computer architecture is based on Von Neumann Two main parts: CPU (Central

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University ARM & IA-32 Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ARM (1) ARM & MIPS similarities ARM: the most popular embedded core Similar basic set

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng. CS 265 Computer Architecture Wei Lu, Ph.D., P.Eng. Part 5: Processors Our goal: understand basics of processors and CPU understand the architecture of MARIE, a model computer a close look at the instruction

More information

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine.

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine. This lecture Compiler construction Lecture 6: Code generation for x86 Magnus Myreen Spring 2018 Chalmers University of Technology Gothenburg University x86 architecture s Some x86 instructions From LLVM

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

COSC 6385 Computer Architecture. Instruction Set Architectures

COSC 6385 Computer Architecture. Instruction Set Architectures COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set

More information

Instruction Set Overview

Instruction Set Overview MicroBlaze Instruction Set Overview ECE 3534 Part 1 1 The Facts MicroBlaze Soft-core Processor Highly Configurable 32-bit Architecture Master Component for Creating a MicroController Thirty-two 32-bit

More information

Processors. Young W. Lim. May 9, 2016

Processors. Young W. Lim. May 9, 2016 Processors Young W. Lim May 9, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1 I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in

More information

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance Chapter 1 Computer Abstractions and Technology Lesson 3: Understanding Performance Manufacturing ICs 1.7 Real Stuff: The AMD Opteron X4 Yield: proportion of working dies per wafer Chapter 1 Computer Abstractions

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Computer Architecture Today (I)

Computer Architecture Today (I) Fundamental Concepts and ISA Computer Architecture Today (I) Today is a very exciting time to study computer architecture Industry is in a large paradigm shift (to multi-core and beyond) many different

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl. Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

More information

CS311 Lecture: Pipelining and Superscalar Architectures

CS311 Lecture: Pipelining and Superscalar Architectures Objectives: CS311 Lecture: Pipelining and Superscalar Architectures Last revised July 10, 2013 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as a result

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Sample Midterm I Questions Israel Koren ECE568/Koren Sample Midterm.1.1 1. The cost of a pipeline can

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Instructions: Language of the Computer

Instructions: Language of the Computer CS359: Computer Architecture Instructions: Language of the Computer Yanyan Shen Department of Computer Science and Engineering 1 The Language a Computer Understands Word a computer understands: instruction

More information

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1 Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Parts A and B both refer to the C-code and 6-instruction processor equivalent assembly shown below:

Parts A and B both refer to the C-code and 6-instruction processor equivalent assembly shown below: CSE 30321 Computer Architecture I Fall 2010 Homework 02 Architectural Performance Metrics 100 points Assigned: September 7, 2010 Due: September 14, 2010 Problem 1: (20 points) The scope of this 1 st problem

More information