Computer Organization and Components

Similar documents
Lec 25: Parallel Processors. Announcements

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESI

Advanced Instruction-Level Parallelism

45-year CPU Evolution: 1 Law -2 Equations

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Multicore and Parallel Processing

Fundamentals of Computer Design

Fundamentals of Computers Design

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Advanced Computer Architecture

Computer Hardware Engineering

EECS4201 Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Multiple Instruction Issue. Superscalars

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Written Exam / Tentamen

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Parallelism, Multicore, and Synchronization

Lecture 1: Introduction

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Computer Organization and Components

THREAD LEVEL PARALLELISM

CS 152, Spring 2011 Section 8

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Online Course Evaluation. What we will do in the last week?

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4 The Processor (Part 4)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Exercises 6 Parallel Processors and Programs

Chapter 4 The Processor 1. Chapter 4D. The Processor

UC Berkeley CS61C : Machine Structures

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ADVANCED COMPUTER ARCHITECTURES: Prof. C. SILVANO Written exam 11 July 2011

5008: Computer Architecture

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Thomas Polzer Institut für Technische Informatik

Exploitation of instruction level parallelism

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ILP: Instruction Level Parallelism

Course on Advanced Computer Architectures

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Parallel Programming

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Parallel Systems I The GPU architecture. Jan Lemeire

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Keywords and Review Questions

Computer Organization and Components

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EIE/ENE 334 Microprocessors

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Fundamentals of Quantitative Design and Analysis

CS425 Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Limitations of Scalar Pipelines

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

UC Berkeley CS61C : Machine Structures

Advanced processor designs

INSTRUCTION LEVEL PARALLELISM

Getting CPI under 1: Outline

Course web site: teaching/courses/car. Piazza discussion forum:

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

The University of Texas at Austin

Chapter 4. The Processor

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Full Datapath. Chapter 4 The Processor 2

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

omputer Design Concept adao Nakamura

Multicore Hardware and Parallelism

Advanced Computer Architecture

Chapter 4. The Processor

Transcription:

Computer Organization and Components Course Structure IS500, fall 05 Lecture :, Concurrency,, and ILP Module : C and Assembly Programming L L L L Module : Processor Design X LAB S LAB L9 L5 Assistant Research ngineer, University of California, Berkeley L6 X LAB L L8 X X LAB L Proj. xpo Slides version.0 L X5 I Abstractions in Computer Systems Agenda Networked Systems and Systems of Systems Application Software Software Operating System Instruction Set Architecture Hardware/Software Interface I Microarchitecture Logic and Building Blocks Digital Hardware Design Digital Circuits Analog Circuits Devices and Physics Analog Design and Physics I S L Computer System LAB6 Module 6: Parallel Processors and Programs Module : Logic Design L7 LAB5 Module 5: Memory Hierarchy Module : I/O Systems Associate Professor, KTH Royal Institute of Technology S L0 I

5 How is this computer revolution possible? (Revisited) 6 Moore s law: Integrated circuit resources (transistors) double every 8- months. By Gordon. Moore, Intel s co-founder, 960s. Possible because refined manufacturing process..g., th generation Intel Core i7 processors uses nm manufacturing. Sometimes considered a self-fulfilling prophecy. Served as a goal for the semiconductor industry. Acknowledgement: The structure and several of the good examples are derived from the book Computer Organization and Design (0) by David A. Patterson and John L. Hennessy I Have we reached the limit? (Revisited) Why? The Power Wall I 7 During the last decade, the clock rate has increased dramatically. 989: 8086, 5MHz 99: Pentium, 66Mhz 997: Pentium Pro, 00MHz 00: Pentium,.0 GHz 00: Pentium,.6 GHz 8 What is a multiprocessor? A multiprocessor is a computer system with two or more processors. By contrast, a computer with one processor is called a uniprocessor. Multicore microprocessors are multiprocessors where all processors (cores) are located on a single integrated circuit. by ric Gaba, CC BY-SA.0. No modifications made. 0: Core i7,. GHz - GHz http://www.publicdomainpictures.net/view-image.php? image=8&picture=tegelvagg Increased clock rate implies increased power We cannot cool the system enough to increase the clock rate anymore New trend since 006: Multicore Moore s law still holds More processors on a chip: multicore New challenge: parallel programming I A cluster is a set of computers that are connected over a local area network (LAN). May be viewed as one large multiprocessor. Photo by Robert Harker I

Different Kinds of Computer Systems (Revisited) 9 0 Why multiprocessors? Possible to execute many computation tasks in parallel. Performance Multiprocessor Photo by Robert Harker Photo by Kyro mbedded Real-Time Systems Personal Computers and Personal Mobile Devices Dependability nergy Dependability Performance I I How much can we improve the performance using parallelization? = Software Serial Sequential Concurrent xample: A Linux OS running on a unicore processor. xample: matrix multiplication on a multicore processor. xecution time after improvement Still increased speedup, but less efficient I Tafter xecution time of one program before improvement Linear speedup (or ideal speedup) xample: A Linux OS running on a multicore processor. Tbefore Superlinear speedup. ither wrong, or due to e.g. cache effects. Parallel Hardware Note: As always, everybody does not agree on the definitions of concurrency and parallelism. The matrix is from H&P 0 and the informal definitions above are similar to what was said in a talk by Rob Pike. If one out of N processors fails, still N- processors are functioning. Concurrency is about handling many things at the same time. Concurrency may be viewed from the software viewpoint. xample: matrix multiplication on a unicore processor. Replace energy inefficient processors in data centers with many efficient smaller processors. Warehouse Scale Computers and Concurrency what is the difference? is about doing (executing) many things at the same time. may be viewed from the hardware viewpoint. nergy Number of processors Danger: Relative speedup measures only the same program True speedup compares also with the best known sequential program, I

Amdahl s Law (/) Amdahl s Law (/) Can we achieve linear speedup? Divide execution time before improvement into two parts. xecution time after improvement = T after T = T affected + T unaffected T after = = Time affected by the improvement of parallelization T affected N T affected N + T unaffected + T unaffected Time unaffected of improvement (sequential part) Amount of improvement (N times improvement) This is sometimes referred to as Amdahl s law = T after = T affected xercise: Assume a program consists of an image analysis task, sequentially followed by a statistical computation task. Only the image analysis task can be parallelized. How much do we need to improve the image analysis task to be able to achieve times speedup? Assume that the program takes 80ms in total and that the image analysis task takes 60ms out of this time. N + T unaffected Solution: = 80 / (60 / N + 80 60) 60/N + 0 = 0 60/N = 0 It is impossible to achieve this speedup! I I Amdahl s Law (/) 5 Amdahl s Law (/) 6 = T after = T affected N Assume that we perform 0 scalar integer additions, followed by one matrix addition, where matrices are 0x0. Assume additions take the same amount of time and that we can only parallelize the matrix addition. xercise A: What is the speedup with 0 processors? xercise B: What is the speedup with 0 processors? xercise C: What is the maximal speedup (the limit when N à infinity) + T unaffected Solution A: (0+0*0) / (0*0/0 + 0) = 5.5 Solution B: (0+0*0) / (0*0/0 + 0) = 8.8 Solution C: (0+0*0) / ((0*0)/N + 0) = when N à infinity I xample continued. What if we change the size of the problem (make the matrices larger)? Size of matrices 0x0 0x0 Number of processors 0 0 5.5 8. 8.8 0.5 But was not the maximal speedup when N à infinity? Strong scaling = measuring speedup while keeping the problem size fixed. Weak scaling = measuring speedup when the problem size grows proportionally to the increased number of processors. I

7 8 Main Classes of s Task-Level (TLP) Different tasks of work that can work in independently and in parallel xample Many tasks at the farm Assume that there are many different things that can be done on the farm (fix the barn, sheep shearing, feed the pigs etc.) Task-level parallelism would be to let the farm hands do the different tasks in parallel. I An old (from the 960s) but still very useful classification of processors uses the notion of instruction and data streams. Data-level parallelism. xamples are multimedia extensions (e.g., SS, streaming SIMD extension), vector processors. Data Stream Single Single Multiple TLP Data-Level (DLP) Many data items can be processed at the same time. xample Sheep shearing Assume that sheep are data items and the task for the farmer is to do sheep shearing (remove the wool). Data-level parallelism would be the same as using several farm hands to do the shearing. Instruction Stream DLP SISD, SIMD, and MIMD Multiple SISD SIMD.g. Intel Pentium.g. SS Instruction in x86 MISD MIMD No examples today.g. Intel Core i7 Graphical Unit Processors (GPUs) are both SIMD and MIMD Task-level parallelism. xamples are multicore and cluster computers Physical Q/A What is a modern Intel CPU, such as Core i7? Stand for MIMD, on the table for SIMD I 9 0 What is? I (ILP) may increase performance without involvement of the programmer. It may be implemented in a SISD, SIMD, and MIMD computer. Two main approaches:. Deep pipelines with more pipeline stages If the length of all pipeline stages are balanced, we may increase performance by increasing the clock speed.. Multiple issue A technique where multiple instructions are issued in each in cycle. ILP may decrease the CPI to lower than, or using the inverse metric instructions per clock cycle (IPC) increase it above. Acknowledgement: The structure and several of the good examples are derived from the book Computer Organization and Design (0) by David A. Patterson and John L. Hennessy I I

Multiple Issue Approaches Static Multiple Issue (/) VLIW Problems with Multiple Issue Determining how many and which instructions to issue in each cycle. Problematic since instructions typically depend on each other. How to deal with data and control hazards. The two main approaches of multiple issue are. Static Multiple Issue Decisions on when and which instructions to issue at each clock cycle is determined by the compiler.. Dynamic Multiple Issue Many of the decisions of issuing instructions are made by the processor, dynamically, during execution. Very Long Instruction Word (VLIW) processors issue several instructions in each cycle using issue packages. add $s0, $s, $s add $t0, $t0, $0 and $t, $t, $s0 lw $t0, 0($s0) F D M W F D M W F D M W F D M W The compiler may insert no-ops to avoid hazards. An issue package may be seen as one large instruction with multiple operations. ach issue package has two issue slots. How is VLIW affecting the hardware implementation? I I PC next A Static Multiple Issue (/) Changing the hardware To issue two instructions in each cycle, we need the following (not shown in picture): 0 RD Instruction Memory + Fetch and decode 6-bit (two instructions) register file W A W RD A A RD RD 0 A WD 0:6 5: Sign xtend 0 << ALU + Double the number of ports for the Data Memory WD 0 Add another ALU Static Multiple Issue (/) VLIW xercise: Assume we have a 5-stage VLIW addi $s, $0, 0 MIPS processor that can issue two instructions L: lw $t0, 0($s) in the same clock cycle. For the following code, ori $t, $t0, 7 schedule the instructions into issue slots and sw $t, 0($s) compute IPC. addi $s, $s, addi $s, $s, - Solution. IPC = ( + 6*0) / ( + *0) =.87 bne $s, $0, L (max ) Slot Slot Cycle Reschedule to addi $s, $0, 0 avoid stalls because of lw L lw $t0, 0($s) addi $s, $s, addi $s, $s, - ori $t, $t0, 7 bne $s, $0, L sw $t, -($s) 5 mpty cells means no-ops I I

Dynamic Multiple Issue Processors (/) Superscalar Processors 5 Dynamic Multiple Issue Processors (/) Out-of-Order xecution, RAW, WAR 6 In many modern processors (e.g., Intel Core i7), instruction issuing is performed dynamically by the processor while executing the program. Such processor is called superscalar. Instruction Fetch and Decode unit RS RS RS RS FU FU FU FU Results are sent back to RS (if a unit waits on the operand) and to the commit unit. Commit Unit I The first unit fetches and decodes several instruction in-order Reservation Stations (RS) buffer operands to the FU before execution. Data dependencies are handled dynamically. Functional Units (FU) execute the instruction. xamples are integer units, floating point units, and load/store units. The commit unit commits instructions (stores in registers) conservatively, respecting the observable order of instructions. Called in-order commit. If the superscalar processor can reorder the instruction execution order, it is an out-of-order execution processor. lw $t0, 0($s) addi $t, $t0, 7 sub $t0, $t, $s0 addi $t, $s0, 0 addi $r, $s0, 0 sub $t0, $t, $s0 xample of a Read After Write (RAW) dependency (dependency on $t0). The superscalar pipeline must make sure that the data is available before read. xample of a Write After Read (WAR) dependency (dependency on $t). If the order is flipped due to out-of-order execution, we have a hazard. WAR dependencies can be resolved using register renaming, where the processor writes to a nonarchitectural renaming register (here in the pseudo asm code called $r, not accessible to the programmer) Note that out-of-order processor is not rewriting the code, but keeps track of the renamed registers during execution. I Some observations on Multi-Issue Processors 7 Intel Microprocessors, some examples 8 Multi Issue Processors VLIW processors tend to be more energy efficient than superscalar out-of-order processors (less hardware, the compiler does the job) Superscalar processors with dynamic scheduling can hide some latencies that are not statically predictable (e.g., cache misses, dynamic branch predictions). Processor Intel 86 Intel Pentium Intel Pentium Pro Intel Pentium Willamette Intel Pentium Prescott Intel Core Intel Core i5 Nehalem Intel Core i5 Ivy Bridge Year 989 99 997 00 00 006 00 0 Clock Rate 5 MHz 66 MHz 00 MHz 000 MHz 600 MHz 90 MHz 00 MHz 00 MHz Pipeline Stages 5 5 0 Issue Cores Power Width 8 5 W 0W 9 W 75W 0W 75W 87W 77W Although modern processors issues to 6 instructions per clock cycle, few applications results in IPC over. The reason is dependencies. Clock rate increase stopped (the power wall) around 006 Pipeline stages first increased The power peaked and then decreased, but the with Pentium number of cores increased after 006. Source: Patterson and Hennessey, 0, page. I I

9 0 Time-Aware Systems - xamples Cyber-Physical Systems (CPS) Bonus Part Time-Aware Systems Design Research Challenges Automotive @ KTH Process Industry and Industrial Automation Aircraft (not part of the xamination) Time-Aware Simulation Systems Time-Aware Distributed Systems Time-stamped distributed systems (.g. Google Spanner) Physical simulations (Simulink, Modelica, etc.) I I Programming Model and Time What is our goal? verything should be made as simple as possible, but not simpler attributed to Albert instein Timing is not part of the software semantics Correct execution of programs (e.g., in C, C++, C#, Java, Scala, Haskell, OCaml) has nothing to do with how long time things takes to execute. Traditional Approach xecution time should be as short as possible, but not shorter Programming Model Objective: Minimize area, memory, energy. Minimize the slack No point in making the execution time shorter, as long as the deadline is met. Deadline Task Challenge: Still guarantee to meet all timing constraints. Slack I Timing Dependent on the Hardware Platform Our Objective Programming Model Make time an abstraction within the programming model Timing is independent of the hardware platform (within certain constraints) I

Are you interested to be challenged? If you are interested in You know the drill Programming language design, or Compilers, or keep calm Computer Architecture You are ambitious and interested in learning new things You want to do a real research project as part of you Bachelor s or Master s thesis project Please send an email to, so that we can discuss some ideas. I I 5 6 Reading Guidelines Summary Module 6: Parallel Processors and Programs Some key take away points: Lecture :, Concurrency,, and ILP H&H Chapter.8,.6, 7.8.-7.8.5 P&H5 Chapters.7-.8,.0,.0, 6.-6. or P&H Chapters.5-.6,.8,.0, 7.-7. Amdahl s law can be used to estimate maximal speedup when introducing parallelism in parts of a program. Lecture : SIMD, MIMD, and Parallel Programming H&H Chapter 7.8.6-7.8.9 P&H5 Chapters.,.6, 5.0, 6.-6.7 or P&H Chapters.,,6, 5.8, 7.-7.7 (ILP) has been very important for performance improvements over the years, but improvements have not been as significant lately. Reading Guidelines See the course webpage for more information. Thanks for listening! I I